Accessing Databases with R

| last update: 2016-10-03

Although R works well with CSV and Excel files, directly accessing a database management system like Oracle, MS SQL Server, PostgreSQL or MySQL / MariaDB has advantages:

  • Access to real time data
  • The ability to use SQL to shape and / or filter data.
  • Avoiding extra steps for export / import of data.
  • ...

Step 1: Preparing the Database Management System

For security reasons the standard setup for most database management systems (dbms) is to accept only connections that originate from the same machine they are running on. This is the localhost which has the special IP adress

If the R-script trying to access the database is running on the very machine please skip to step two. If it resides on a different host several things have to done:

  1. The DBMS configuration has to be changed to allow remote access. If possible access should get limited to specific machines or a subnet. If the world can access the interface it is almost certain that there will be bruteforce attacks trying to guess access credentials.
  2. In many cases an additional firewalls protects the database server. Depending on its configuration an exception for the server and the used port has to defined.
  3. The connection between the R-script and the database server should be encrypted.
  4. A password policy should be enforced to prevent to easy / standard passwords. Besides that it is advisable to configure fail2ban (or similar system services) to block IP addresses caught trying to bruteforce credentials for the database.

Step 2: Setting up an User Account for R

R should never access the DBMS with the credentials of the root user. The R-script containing it may be on a remote machine outside the admin's control. It is fast and easy to create a special user with all permissions R needs to work with the database but none more.

Step 3: Installing the Database Driver in R

R has specific drivers (available via CRAN) for several DBMS:

  • Cassandra: RCassandra
  • MariaDB / MySQL: RMariaDB
  • MongoDB: RMongo
  • Oracle: ROracle
  • PostgreSQL: RPostgreSQL

See CRAN for documentation and other available drivers.

Step 4: Testing Database Access

The setup should be tested as there can be unforeseen problems through routing or security appliances.

EXAMPLE: Accessing a MariaDB System

  1. The MariaDB config file is changed to make the DBMS listen to the world.
  2. The DBMS has to be rebooted to load the new configuration.
  3. User and server certificates are created to allow an encrypted connection.
  4. A database account is created:
    • The Linux command line tool pwgen is used to create a random passphrase with at least 25 characters to make bruteforce attacks futile.
    • The account will not be limited to access the dbms from a specific host (% instead of localhost or an IP).
    • The account's permissions get limited to read a specifc database. No other rights get granted.
    • The use of a client certificate is enforced.
    • This results in something like the following command:
  5. fail2ban gets configured to detect brute force attacks. Three tries is a good default, but the ban time should be increased significantly.
  6. On the machine running the R-Script an upgrade of R and the install of the corresponding database driver get done. As it is MariaDB RMySQL is compatible.
  7. A test script is run on the client:

# load the library matching the database software

# create a connection object and 
# store all necessary connection parameters
con <- dbConnect(RMySQL::MySQL(), 
dbname = "databaseName", 
username ="loginUserName", 
password = "secret", 
host = 'IP_or_Domain_without_http', 
port = 3306)

# prepare the query:
# TestTable has been created beforehand
query = "SELECT x, y FROM TestTable";

# send the query and name the result:
rs = dbSendQuery(con, query);

# store the result into a dataframe
# (using dbFetch as fetch is deprecated)
df = dbFetch(rs, -1);

# If all results fit in memory dbGetQuery can 
# be used to send, fetch and clear at once. 
# See RMySQL docs for this.

# close the connection.

# start working with the dataframe: