Data loading¶

In this chapter, we will see how we can load the data in the three formats that have been published (MySQL, TSV and HDF).

Working with MySQL¶

Follow this link to see how to install MySQL and setup a database.

Configuring MySQL settings¶

Here, one can either either configure a my.cnf file, or store the settings manually in a config.json file as given below:

{
    "host": "localhost", 
    "password": "password!", 
    "user": "username", 
    "port": "port", 
    "database": "db_name", 
    "datadir": "dataFolder"
}

Loading MySQL dumps into the database¶

For each of the dump files (.sql), load them with the command mysql db_name < dump-file.sql. db_name is the name of database to contain the tables, and the dump-file.sql is the file containing the table structure and data.

We have written a short bash script loader.sh for this, where we give the path to the directory containing dump files (-d $dataDir) and the name of the MySQL database (-n $database):

> loader.sh -d $dataDir -n $database

Reading MySQL data¶

Python

There are different ways for retrieving data lying in a MySQL database. Here, we show two approaches: one with Pandas and one with a custom function.

R

We can use the RMySQL package to connect to the MySQL database.

install.packages("RMySQL")
library(RMySQL)
mydb = dbConnect(MySQL(), user=user, password=passwd, host=host, port=as.integer(port), dbname=database)
rs = dbSendQuery(mydb, "select * from metadata")
data = fetch(rs, n=5)
data

Loding TSV files¶

It is fairly straightforward to work with the .tsv files.

Python

For example, with Pandas:

import os
import pandas as pd
tsvFile = os.path.join(dataDir, 'metadata.tsv')
df = pd.read_csv(tsvFile, sep='\t')
df.head()

R

To read the .tsv files in R one can use either base functions or for example readr.

Loading HDF files¶

HDF5 is a data software library that is built for fast I/O processing and storage.

Note that the rRNA.h5 file consists of 4737 batches, each of a 10000 rows. In order to load it properly, loop through the keys to load all of the data. The key format is table_{i} and i=0,1,2,…,4736.

Python

The Pandas library contains the function pandas.read_hdf.

import os
import pandas as pd
h5File = os.path.join(dataDir, 'metadata.h5')
df = pd.read_hdf(h5File)
df.head()

R

There are no straight forward ways in R to load the HDF5 file types, although there are several libraries that claim to do it, like hdf5r and rhdf5. Let me know if you can get it to work!

The 214K ARG metagenomic data collection

Data loading

Contents

Data loading¶

Working with MySQL¶

Configuring MySQL settings¶

Loading MySQL dumps into the database¶

Reading MySQL data¶

Loding TSV files¶

Loading HDF files¶