Storage and Compute Infrastructures¶

Storage¶

Package format¶

Commonly, when data is delivered by the sequencing facility, they will come in what is called a “tar” format. This is a way of grouping files into one package. Note, as opposed to the “zip” format, files in a tar archive are only packaged into one file. Files that are in the archive may or may not be compressed, but that is done independently of the packaging. Files inside of such an archive are commonly compressed with a tool called “gzip”.

Note: commonly the tools to unpack a tar archive are not available on a windows machine. These tools are available on linux and in the terminal on Macs (Mac systems are linux based). However, on windows the terminal program Git Bash does contain these tools, in case there is a need for unpacking on a windows machine.

To unpack a tar file the following command can be used:

tar -xvf mypackage.tar

The -x means extract, the -v means show me the progress on the screen and -f means the filename of the archive is given on the command line.

Once an archive file has been unpackaged, there will likely be a folder or a directory present on your computer filled with files ending in “.gz”. This indicates that your sequencing files are compressed, most likely using the tool mentioned above. Many sequence analysis tools are able to work with compressed files, thus these can be left as is. If not, the “gzip” tool can be used to unpack the files. In true linux fashion, however, the command for unpacking these files is not “gzip”, it is “gunzip”.

Space¶

Sequencing data can consume quite a bit of space. Generally speaking, one set of Illumina paired end read files for one isolate will take up about 0.5-1GB of space. However, in the course of processing this is commonly likely to expand somewhere between 5-10 times the original space of the raw data. As is described later, the analysis of the data mostly consists of processing the files, and then storing the results as a new set of files. Hence, it is likely that this process will produce somewhere between 5-10 different sets of derived data. The space consumed by these derived files are likely to shrink by each step, but many of these processing steps will not reduce the file size drastically.

Here are some size estimates from an assembly pipeline consisting of commonly used tools. Reads here means the raw untrimmed data from the sequencer. Work folder here indicates the size of the files produced by the pipeline, i.e. trimmed files, bam files used for polishing etc. These do not necessarily need to be kept. Results indicates the size of the output that would be used further on, i.e. assemblies, annotation files, antibiotic resistance results, etc. These data are likely to be kept and used onwards.

#isolates	Reads	Results	Work	Total
10	5 Gb	750 Mb	10 Gb	15 Gb
100	40 Gb	10 Gb	100 Gb	150 Gb
500	200 Gb	50 Gb	500 Gb	750 Gb