Accessing Databases with R
Rüdiger Voigt | last update: 2019-08-02
Although R works well with CSV and Excel files, directly accessing a database management system like Oracle, MS SQL Server, PostgreSQL or MySQL / MariaDB has advantages:
- Access to real time data
- The ability to use SQL to shape and / or filter data.
- Avoiding extra steps for export / import of data.
Step 1: Preparing the Database Management System
For security reasons the standard setup for most database management systems (dbms) is to accept only connections that originate from the same machine they are running on. This is the
localhost which has the special IP adress
If the R-script trying to access the database is running on the very machine please skip to step two. If it resides on a different host several things have to be done:
- The DBMS configuration has to be changed to allow remote access. If possible access should get limited to specific machines or a subnet. If the world can access the interface it is almost certain that there will be bruteforce attacks trying to guess access credentials. It is advisable to configure fail2ban (or similar system services) to block IP addresses caught trying to bruteforce credentials for the database.
- In many cases an additional firewall protects the database server. Depending on its configuration an exception for the server and the used port has to defined.
- The connection between the R-script and the database server should be encrypted.
Step 2: Setting up an User Account
R should never access the DBMS with the credentials of the root user. The script containing it may be on a remote machine outside the admin's control. It is fast and easy to create a special user with all permissions R needs to work with the database but none more.
A password policy should be enforced to prevent too easy / standard passwords.
pwgen tool (available on Linux) is useful to generate random passwords. The command
pwgen -B -c -n -s -y 64 for example generates multiple passwords with a length of 64 characters which avoid ambiguous characters (like O and 0), include at least one capital letter, include numbers, are hard to memorize (i.e random), and contain special characters.
Step 3: Installing the Database Driver in R
An upgrade of R is often advisable. Then the specific driver for the DBMS has to be installed. Many are available via CRAN:
- Cassandra: RCassandra
- MariaDB / MySQL: RMariaDB
- MongoDB: RMongo
- Oracle: ROracle
- PostgreSQL: RPostgreSQL
See CRAN for documentation and other available drivers.
Step 4: Testing Database Access
The setup should be tested as there can be unforeseen problems through routing or security appliances.
EXAMPLE: Accessing a MariaDB System
- The MariaDB config file had been changed to make the DBMS listen to the world.
- The DBMS has to be rebooted to load the new configuration.
- fail2ban was configured to detect brute force attacks on the DBMS login. Three tries was a good default, but the ban time was increased significantly to multiple days.
- User and server certificates were created to allow an encrypted connection.
- A database account was created:
- The Linux command line tool pwgen was used to create a random passphrase with more than 32 characters to make bruteforce attacks futile.
- The account was not limited to access the dbms from a specific host (% instead of localhost or an IP).
- The account's permissions are limited to read a specific database. No other rights were granted.
- The use of a client certificate was enforced.
- On the machine running the R-Script an upgrade of R and the install of the corresponding database driver was done.
- A test script was run on the client:
# load the library matching the database software library('RMariaDB') # create a connection object and # store all necessary connection parameters con <- dbConnect(RMariaDB::MariaDB(), dbname = "databaseName", username ="loginUserName", password = "secret", host = 'IP_or_Domain_without_http', port = 3306) # prepare the query: # TestTable has been created beforehand query = "SELECT x, y FROM TestTable;" # send the query and name the result: rs = dbSendQuery(con, query) # actually get the result and store it # into a dataframe # (using dbFetch as fetch is deprecated) df = dbFetch(rs, -1) # close the connection. dbDisconnect(con) # start working with the dataframe: str(df) plot(df)