Importing big CSV files into relational databases like MariaDB or MySQL

Rüdiger Voigt | last update: 2016-11-26

It is quite easy to import small data sets using the import assistance / wizards of graphical database interfaces. However, big data files of several gigabytes in size pose a problem for those interfaces:

Web interfaces like phpMyAdmin are limited in the way that almost every webserver defines a maximum upload file size and more important a maximum execution time for the import script. The setups vary but an upload limit of 20 to 50 Megabyte and an execution time of up to 30 or 60 seconds are the norm. Often it is feasible to circumvent such limits by uploading the data in smaller chunks. This can be burdensome if the data set is large or if data upload is a regular task.
Desktop applications like MySQL-Workbench usually run on a client machine and not on the database server. Transferring multiple gigabytes from a desktop machine to the server might take quite a lot of time.

Using the command line interface removes those hurdles. The import can be scripted and run overnight. The following describes the method I use to load large datasets into MariaDB. With almost no change the same method can be applied to MySQL. Transferring this to other relational database management systems should be simple.

Step 1: transferring the data files to the database server

For a start the dataset should be transferred to the database server or a system / volume / SAN / NAS which is connected to it via a high bandwidth connection.

Depending on the connection and the size of the data this may take some time. Compression algorithms like 7z or bzip decrease the size of standard data files dramatically. The method used for transfer should support to continue the upload in case it is interrupted without starting from scratch. For downloads the command wget -c fileUrl does the trick. Rsync is also well suited for such a task.

Step 2: checking and verifying data before input

In practice data sets are quite often messy even if they were created by subject matter experts. It is common that a documentation is missing or vague. The assumed data type for a column often is not suitable. Sometimes stuff is written to the wrong column.

Small data sets can be checked with a text editor (like vim, emacs, …) or a spreadsheet (like Open Office Calc or Microsoft Excel). Yet if the data set is several gigabytes in size, those methods cannot be applied.

If the input format is not binary having a look at the beginning of the file is helpful. To avoid loading the whole file on a linux system the bash command head can be used. The command head -n 100 data.csv returns the first 100 lines of the file named data.csv. This works very fast as the command stops reading the file after the specified number of lines has been read. If the input format is csv this (usually) contains the headers and several lines for a first check.

A script and regular expressions could be used to check the whole file line by line if it matches the expected format. It should return the number of lines that do not match so one can have a closer look. Some rudimentary example code in Python (version 3!):


import sys
import os
import shutil
import re

InputDir = 'path to your directory'
InputFile = 'datafile.csv'


def PrepareImport ():
    # counters
    ValidHeader = 0
    ValidLines = 0
    InvalidLines = 0
    LineNumber = 1

    # patterns
    # only valid header line: "abc;def"
    HeaderPattern = re.compile('abc;def')
    # valid line pattern: 9 digits, semicolon, text of variable length
    LinePattern = re.compile("[0-9]{9};.*")

    # store line number of invalid lines
    InvalidStore = []

    # read
    try:
        InputFile = open(str(InputDir + InputFile), 'r', encoding = 'utf-8-sig')
        for line in InputFile:
            
            if(LinePattern.search(line)):
                ValidLines += 1
            elif(HeaderPattern.search(line)):
                ValidHeader += 1
            else:
                InvalidLines += 1
                InvalidStore.append(LineNumber)

            LineNumber += 1
            
        InputFile.close()

    # result:
    if not(ValidHeader):
        print('header missing or wrong format')

    print('found ' + str(ValidLines) + 'valid lines')

    if(InvalidLines > 0):
        print('the following lines ' + str(InvalidLines) +
                          ' are not valid')

# run it:
PrepareImport()

Step 3: defining the target table and (if needed) an import table

Before the import can start the structure of the target table must be created. That means columns and their data types have to be defined, the table collation has to be chosen, permissions have to be set and accounts need to be created.

Sometimes the target table is already in place but the structure of the input data does not match. Alternatively, it may be that the target table contains production data and the new data need to be processed before they can be added. In such cases a second table solely for the import can be used. After processing the data can be moved to the target table.

When importing CSV files many applications do not guess which field contains which data type. Therefore all fields are set to something like VARCHAR(255). Yet if the input was checked fields can be set to a data type matching their content. This spares us converting the column later.

A simple trick saves time: Export the first few lines of the data file containing the file header using the head command and a pipe ( head data.csv > file.csv). Now import those lines using your favorite database interface’s wizard. If necessary, do not forget to check the box which tells the application to interpret the first line as a header. It will create a table structure - at least with most fields set to varchar. Some applications like phpMyAdmin use the first lines to guess the data type of the column. Empty the table (for example using the SQL command TRUNCATE). Now edit the structure: all left to do is changing the data types if they not already match.

Step 4: loading the data

MariaDB and MySQL share the mysqlimport application. A sample command to load a data file looks like this:

mysqlimport --debug-info --ignore-lines=1 --user=importUser -p --fields-terminated-by=',' --use-threads=2 targetDatabase /databaseimports/main.csv

--debug-info: Prints some useful debugging information after the command completed.
--ignore-lines=1: In this case the first line consists of the column headers. As the columns are already named they are not needed. Besides, they often do not match the data type of the column so including them would lead to errors.
--user=importUser: The database user account to import the file. It needs sufficient permissions on the database and the table.
-p: This is short for --password. As nothing follows you will be prompted for the password. This is important to avoid the password being stored in the bash history.
--fields-terminated-by=',': One might assume that in CSV format fields would always be terminated by comma, but this is not the case especially in European countries which prefer semicolon. MySQL and MariaDB have preferences depending on your locale settings. So look those up or simply set it explicitly.
--use-threads=2: Mysqlimport can use multiple threads. The number to choose depends on the system load and the available cores.
targetDatabase: The name of the database you want to use.
/databaseimports/main.csv: Even if the above command is run by the system root user mysqlimport drops rights. The reading is done by the mysql user so the directory and the file have to be readable to it. Beware: Mysqlimport strips the suffix of the filename and assumes that the filename is identical to the name of the target table within your database!

Mysqlimport has many more options which help to to read more variants and to script the process.