Data preprocessing

Nucleotide sequences in sequencing files can be of low quality. Thus, the sequences needs to be processed such that the overall quality in the sequence file is improved before it is used in any kind of data analysis. Adapters might be attached to the sequenced fragments, these are also often removed before further processing. Also, while Illumina data these days come already basecalled (i.e. the signal has been translated into DNA letters), this might not be the case for Pacbio and Nanopore data. This step might thus have to be performed before adapter and quality trimming.

Quality and adapter trimming

Before further analysis, it is common to evaluate the quality of the data, and to remove any adapters found in the reads and also low quality regions. Commonly used tools frequently do both of these things.

Quality is denoted on a per base level, via the PHRED score, which denotes the likelihood of the base being wrong. For Illumina data, the quality of a read will commonly be quite high in the beginning (Q30-40), but then fall along the read, dipping towards the end. Commonly anything below Q15-Q20 is regarded as bad, and portions of the reads where the average quality is getting too low are generally trimmed, i.e. removed from the read. The first read (R1) in a pair commonly has better quality than the second (R1) read.

Nanopore basecalling and trimming

Nanopore sequence data is delivered in the fast5 file format which contains the raw signal data. That data can be translated into fastq files using dedicated basecallers such as Guppy / Bonito. Guppy comes with two different models for basecalling, a fast basecalling model and a high accuracy model. As the names indicate the high accuracy model gives more accurate basecalling and with better detection and binning of barcoded reads than the fast model. The average quality scores of sequences generated by Oxford Nanopore instruments are between 7 and 14 with quality being variable along the reads. Any sequences having a Q-value below 7 are usually discarded. In addition, trimming of the first group of bases (10-50) improves the overall quality score of the reads. Trimming of adapters and low-quality bases at the end of the sequences is also performed.

Pacific biosciences data

Pacbio sequences are delivered as BAM-files, where the bases do not have meaningful quality scores. Pacbio sequences do however have highly variable qualities for the bases. Depending on the sequencing technique used (Continues Long Reads (CLR) or Circular Consensus Sequencing (CCS)) the Pacbio reads can be corrected or not. The raw pacbio sequences can be converted into fastq or fasta files. When converted to fastq, the quality scores are marked with the exclamation mark: “!”, which is similar to “0”. CLR reads can easily be converted to fastq using the program bam2fastx, but with low quality scores. These reads can best be used in combination with Illumina reads to generate a hybrid assembly. CCS reads are demultiplexed and can be filtered using the number of passes using the SMRT portal software. More passes gives a better sequences afterwards. The CCS reads then can be converted to fastq reads with ccs, which uses each of the subreads in an alignment to polish the reads and generate high quality bases. It also removes the hairpin sequences from the CCS reads. At that point only limited or no trimming is needed of the reads.

Software availability

There are many tools available for doing QC and adapter trimming. This paper, although not quite new, contains a good overview of the process and the effect of some commonly used tools for Illumina data (Fabro et al., 2013). Important to note, all these tools can be used for paired-end and mate-pair sequencing data. Nevertheless, they usually do not account for the particularities of mate-pair sequencing protocol, often discarding more data than necessary. NxTrim is a trimming software optimized for mate-pair sequencing. For Nanopore and Pacbio data there fewer options available. Good starting points are tools such as NanoPack and Pauvre that give information about the quality of the sequence data.

Note: it might not be necessary to do quality trimming and adaptor removal in cases where mapping is the primary approach. The adapters are unlikely to match anything in the reference sequence, and mapping tools commonly take the quality score of the base into account and may leave low quality regions out.