Data loading

In this chapter, we will see how we can load the data in the three formats that have been published (MySQL, TSV and HDF).

Working with MySQL

Follow this link to see how to install MySQL and setup a database.

Configuring MySQL settings

Here, one can either either configure a my.cnf file, or store the settings manually in a config.json file as given below:

{
    "host": "localhost", 
    "password": "password!", 
    "user": "username", 
    "port": "port", 
    "database": "db_name", 
    "datadir": "dataFolder"
}

Loading MySQL dumps into the database

For each of the dump files (.sql), load them with the command mysql db_name < dump-file.sql. db_name is the name of database to contain the tables, and the dump-file.sql is the file containing the table structure and data.


We have written a short bash script loader.sh for this, where we give the path to the directory containing dump files (-d $dataDir) and the name of the MySQL database (-n $database):

> loader.sh -d $dataDir -n $database

Reading MySQL data

There are different ways for retrieving data lying in a MySQL database. Here, we show two approaches: one with Pandas and one with a custom function.

We can use the RMySQL package to connect to the MySQL database.

install.packages("RMySQL")
library(RMySQL)
mydb = dbConnect(MySQL(), user=user, password=passwd, host=host, port=as.integer(port), dbname=database)
rs = dbSendQuery(mydb, "select * from metadata")
data = fetch(rs, n=5)
data

Loding TSV files

It is fairly straightforward to work with the .tsv files.

For example, with Pandas:

import os
import pandas as pd
tsvFile = os.path.join(dataDir, 'metadata.tsv')
df = pd.read_csv(tsvFile, sep='\t')
df.head()

To read the .tsv files in R one can use either base functions or for example readr.

Loading HDF files

HDF5 is a data software library that is built for fast I/O processing and storage.

Note that the rRNA.h5 file consists of 4737 batches, each of a 10000 rows. In order to load it properly, loop through the keys to load all of the data. The key format is table_{i} and i=0,1,2,…,4736.

The Pandas library contains the function pandas.read_hdf.

import os
import pandas as pd
h5File = os.path.join(dataDir, 'metadata.h5')
df = pd.read_hdf(h5File)
df.head()

There are no straight forward ways in R to load the HDF5 file types, although there are several libraries that claim to do it, like hdf5r and rhdf5. Let me know if you can get it to work!