Importing big CSV files into relational databases like MariaDB or MySQL

| last update: 2016-11-26

It is quite easy to import small data sets using the import assistance / wizards of graphical database interfaces. However, big data files several gigabytes in size pose a problem for those interfaces:

Using the Linux command line interface removes those hurdles. The import can be scripted and run overnight. The following describes the method I use to load datasets of several gigabyte size into MariaDB. With almost no change the same method can be applied to MySQL. Transferring this to other relational database management systems should be simple.

Step 1: transferring the data files to the database server

For a start the dataset should be transferred to the database server or a system / volume / SAN / NAS which is connected to it via high bandwidth.

Depending on the connection and the size of the data this may take some time. Compression algorithms like 7z or bzip decrease the size of standard data files dramatically. The method used for transfer should support to continue the upload in case it is interrupted without starting from scratch. For downloads the command wget -c fileUrl does the trick. Rsync is also well suited for such a task.

Step 2: checking and verifying data before input

In practice data sets are quite often messy even if they were created by subject matter experts. It is common that a documentation is missing or vague. The assumed data type for a column often is not suitable. Sometimes stuff is written to the wrong column.

Small data sets can be checked with a text editor (like vim, emacs, …) or a spreadsheet (like Open Office Calc or Microsoft Excel). Yet if the data set is several gigabytes in size, those methods cannot be applied.

If the input format is not binary having a look at the beginning of the file is helpful. To avoid loading the whole file on a linux system the bash command head can be used. The command head -n 100 data.csv returns the first 100 lines of the file named data.csv. This works very fast as the command stops reading the file after the specified number of lines has been read. If the input format is csv this contains the headers and several lines for a first check.

A script and regular expressions could be used to check the whole file line by line if it matches the expected format. It should return the number of lines that do not match so one can have a closer look. Some rudimentary example code in Python (version 3!):


import sys
import os
import shutil
import re

InputDir = 'path to your directory'
InputFile = 'datafile.csv'


def PrepareImport ():
    # counters
    ValidHeader = 0
    ValidLines = 0
    InvalidLines = 0
    LineNumber = 1

    # patterns
    # only valid header line: "abc;def"
    HeaderPattern = re.compile('abc;def')
    # valid line pattern: 9 digits, semicolon, text of variable length
    LinePattern = re.compile("[0-9]{9};.*")

    # store line number of invalid lines
    InvalidStore = []

    # read
    try:
        InputFile = open(str(InputDir + InputFile), 'r', encoding = 'utf-8-sig')
        for line in InputFile:
            
            if(LinePattern.search(line)):
                ValidLines += 1
            elif(HeaderPattern.search(line)):
                ValidHeader += 1
            else:
                InvalidLines += 1
                InvalidStore.append(LineNumber)

            LineNumber += 1
            
        InputFile.close()

    # result:
    if not(ValidHeader):
        print('header missing or wrong format')

    print('found ' + str(ValidLines) + 'valid lines')

    if(InvalidLines > 0):
        print('the following lines ' + str(InvalidLines) +
                          ' are not valid')

# run it:
PrepareImport()

Step 3: defining the target table and (if needed) an import table

Before the import can start the structure of the target table must be created. That means columns and their data types have to be defined, the table collation has to be chosen, permissions have to be set and accounts need to be created.

Sometimes the target table is already in place but the structure of the input data does not match. Alternatively, it may be that the target table contains production data and the new data need to be processed before they can be added. In such cases a second table solely for the import can be used. After processing the data can be moved to the target table.

When importing CSV files many application do not guess which field contains which data type. Therefore all fields are set to something like VARCHAR(255). Yet if the input was checked fields can be set to a data type matching their content. This spares us converting the column later.

A simple trick saves time: Export the first few lines of the data file containing the file header using the head command and a pipe ( head data.csv > file.csv). Now import those lines using your favorite database interface’s wizard. If necessary, do not forget to check the box which tells the application to interpret the first line as a header. It will create a table structure - at least with all fields set to varchar. Some applications like phpMyAdmin use the first lines to guess the data type of the column. Empty the table (for example using the SQL command TRUNCATE). Now edit the structure: all left to do is changing the data types if they not already match.

Step 4: loading the data

MariaDB and MySQL share the mysqlimport application. A sample command to load a data file looks like this:

mysqlimport --debug-info --ignore-lines=1 --user=importUser -p --fields-terminated-by=',' --use-threads=2 targetDatabase /databaseimports/main.csv

Mysqlimport has many more options which help to to read more variants and to script the process.