Listeria monocytogenes analysis

The gram-positive bacterium Listeria monocytogenes is the causative agent of Listeriosis, a foodborne disease (reviewed in Buchanan et al. 2017). Listeriosis can be particularly severe, potentially deadly, in elderly and immunocompromised patients. It can cause misscariage in pregnant women or stillbirth. Delay between exposure and illness, and the possibility of consumption of contaminated products spread over-time (eg. frozen products) represent a challenge to the identification of epidemiological links between infection cases (eg. Datta and Burall 2018).

Matle et al. (2020) reviewed the different aspects of Listeriosis. They provide an overview of known L. monocytogenes virulence factors, as well as diagnostics and treatment options. The state-of-the-art dry-lab approaches employed to the study of L. monocytogenes are described in Luth et al. (2018).

Four evolutionary lineages have been identified in L. monocytogenes. These bacteria can be found in a variety of hosts and in environmental samples (Buchanan et al. 2017). L. monocytogenes genome is approximately 3Mb, with approximately 2,900 genes (Den Bakker et al. 2010). At least 13 serotypes have been identified (see Figure 1 in Ragon et al. 2008). Early studies showed that despite this diversity, the majority of human infections are caused by isolates belonging to serotypes 1/2a, 1/2b and 4b (eg. Burall et al. 2017). Serotyping is based on an antigen-antibody reaction using somatic (O) and flagellar (H) antigens. Although serotyping has traditionally been used to characterize L. monocytogenes isolates, it has gradually been replaced by molecular typing methods that provide enhanced discriminatory power and therefore represent a more suitable approach for epidemiological studies (see eg. Datta and Burall 2018 - not open access - and Matle et al. 2020, for an overview of the methods that have been employed for L. monocytogenes analyses).

Typing methods

L. monocytogenes molecular typing is a constantly evolving field. An ideal typing method presents not only a high discriminatory power, but also high reproducibility and the possibility of automation. Nowadays, different techniques can be applied for L. monocytogenes molecular typing, namely:

  • Pulsed Field Gel Electrophoresis (PFGE) - PFGE is a fragment length restriction analysis (Dalmasso et al. 2014) that has long been considered as the “gold-standard” for L. monocytogenes typing due to its high discriminatory power in the pre-WGS era. This method has been used by PulseNet to connect cases of disease through the comparison of their DNA fingerprints, and consequently identify potential outbreaks. Despite its robustness, PFGE is time-consuming, difficult to standardize (Van Walle 2018) and lacks discriminatory power for outbreak delineation. Nevertheless, despite these disadvantages, PFGE still represented the best compromise between time and discriminatory power in L. monocytogenes molecular typing until the advent of NGS technologies. It was thus being considered as the “gold-standard”typing method for L. monocytogenes for many years, and played an important role in L. monocytogenes surveillance and in the resolution of multiple outbreaks.
  • MLVA (Multiple locus variable tandem repeat analysis) - Given the drawbacks of PFGE, other typing methods started being considered as good alternatives or at least complements to PFGE analysis. MLVA represents another method of DNA fingerprinting. This method has the advantage of detecting fast-evolving bacterial strains among isolates which may look the same with PFGE. However, it requires highly trained technicians and does not have a standardized protocol for multiple pathogens. This is why it is not used as a routine typing method, but rather as a complementary method to PFGE by PulseNet for some microorganisms, but not L. monocytogenes. For this reason, it does not represent a standard method for surveillance of Listeria, but it is used by the scientific community to explore the diversity of these bacteria (e.g. Saleh-Lakha et al. 2013, Lunestad et al. 2013). Chenal-Francisque et al. (2013) compares MLVA performance to PFGE and MLST.
  • Multiplex-PCR for classifying 5 serogroups - Consisting of the amplification of 5 different genes (lmo0737, lmo1118, ORF2110, ORF2819 and prs), this method was developed in order to facilitate serotyping discrimination by quickly classifying L. monocytogenes into 5 serogroups (Borucki and Call 2003, Doumith et al. 2004, Matle et al. 2020). Nevertheless, despite being a quick method to implement, it has low discriminatory power, which makes it less suitable for outbreak detection and investigation.
  • MLST (Multi-Locus Sequence Tying) - DNA sequencing allows unambiguous identification of genetic differences by direct comparison of allele sequences between samples, and sequencing information can be easily shared between laboratories. Therefore, DNA sequencing provides a robust solution for molecular typing. In this context, a 7-gene MLST (housekeeping genes) method was developed for L. monocytogenes (Salcedo et al. 2003, Matle et al. 2020). Sequence types (ST) represent unique combinations of the MLST alleles. Clonal complexes (CC) are groups of ST differing by no more than one allele to another isolate belonging to the same CC (Ragon et al. 2008, Henri et al. 2016). A significant drawback of this method is that it requires multiple PCR reactions which cannot be multiplexed.
  • Ribosomal multi-locus typing (rMLST) - rMLST has also been employed for strain characterisation (Jolley et al. 2012). This typing method has recently been employed for WGS data quality control of L. monocytogenes to detect potential intra-species contamination (admixture) of sequencing data (Low et al. 2019).
  • MVLST (Multi-virulence-locus sequence typing) - Similar to MLST, but considering a set of virulence (prfA, inlB, and inlC) and virulence-related genes (dal, lisR, and clpP), which has been shown to accurately differentiate epidemic clones (see Lomonaco et al. 2013, Cantinelli et al. 2013, Burall et al. 2017b).
  • WGS (Whole-Genome Sequencing) - With the advent of NGS technologies, WGS technology has led to the improvement of small listeriosis outbreak investigation and is currently being regarded as the new “gold-standard” in the analysis of L. monocytogenes (Nadon et al. 2017). By providing information at the genomic level, WGS allows not only a highly discriminatory typing (cgMLST, wgMLST and SNP-typing), but also to establish backward compatibility with previously mentioned molecular typing methods, such as the 7-genes MLST, Multiplex-PCR, rMLST and MVLST, which, for this reason, will tend to continue to be used. Furthermore, it allows the analysis of specific genes, such as virulence factors and antimicrobial resistance genes. Genetic clustering using WGS can be performed on any distance measure (eg. issued from allelic differences detected using cgMLST typing) or evolutionary-model based clustering (ie. phylogenetics) relying on variants/SNPs detection. PulseNet has been implementing WGS for Listeria surveillance and outbreak monitoring. Their results have shown that using WGS increases the number of outbreaks detected, and earlier outbreak detection facilitates timely action, thus limiting the extent of outbreaks. Similar studies in the EU have confirmed those findings (Nielsen et al. 2017, Van Walle et al. 2018, Moura et al. 2016). For this reason, efforts have been made in order to make WGS widely used for Listeria surveillance, replacing PFGE and serotyping methods. In a near future, WGS-based Listeria surveillance is expected to be implemented in most developed countries.

“One Health” surveillance and WGS of L. monocytogenes

The identification of infection sources is essential for outbreak resolution. Hence, an integrated analysis of clinical, food and veterinary samples relying on the concept of One Health is the key to achieve a good surveillance system. As shown here by PulseNet network, the high discriminatory power of WGS increases the chances to find the source of infection, and possibly reduces the time that it takes to identify the source. Indeed, as reported by the WHO, the use of WGS on Listeria strains has resulted in more accurate detection of clusters and allowed more outbreaks to be successfully resolved. However, several factors are hindering the implementation of a generalized WGS-based surveillance system. For instance, while the notification of clinical cases of listeriosis is mandatory in most EU members, most of the monitoring data on L. monocytogenes in animals and food are generated by non-harmonised monitoring schemes across member states and for which mandatory reporting requirements do not exist (ECDC 2019). Moreover, WGS requires laboratories to be equipped with expensive technologies, and highly skilled technicians able to analyze the data, which is not affordable for many countries. Furthermore, the resources for implementation of WGS differ between different sectors (human health, animal health and food safety), thus complicating the implementation of “One health” surveillance. Despite all these issues in the implementation of a proper WGS-based system, as L. monocytogenes was the selected bacteria to start implementing WGS at an European level, it is already some steps ahead from other pathogenic agents regarding WGS-based surveillance. Indeed, L. monocytogenes is currently the most frequently WGS-based typed pathogen for surveillance and outbreak investigation in all sectors in Europe (ECDC et al. 2019), and according to the ECDC roadmap, the capacity of the member states for use of WGS as a complement or replacement technology for PFGE is already significant and progressing fast.

WGS lab protocol

DNA extraction

Before DNA extraction, L. monocytogenes is cultured in the laboratory (usually liquid media). For proper growth, these bacteria need a medium containing the seven amino acids for which they are auxotrophic (arginine, cysteine, glutamine, isoleucine, leucine, methionine, and valine) and four additional vitamins (biotin, riboflavin, thiamine, and thioctic acid). Brain Heart Infusion (BHI) is a nutrient-rich medium harboring all these ingredients, thus being the most commonly used medium for Listeria culture (check Jones and D’Orazio 2013 for more details). An overnight incubation at 37 ºC in BHI is usually performed before DNA extraction. Regarding DNA extraction, there is no standard methodology or kit used for L. monocytogenes. However, commonly used kits include DNeasy Blood and Tissue kit (Qiagen) or Wizard Genomic purification kit (Promega).

Sequencing technology

There is not a prefered WGS technology to sequence L. monocytogenes. Similar to other fields, Illumina paired-end reads represent the most commonly used strategy. Due to the number of samples that can be handled at a single run and the possible higher read size, MiSeq sequencing machines seem to be the choice for many labs. Long-read sequencing technologies are now becoming more frequently used (alone and/or in combination with short-read sequencing). This because of the improvements in the error rates and price and a chance to improve or complete genome assemblies.

Bioinformatics protocol

Mapping or assembly

The first step to perform when receiving the sequencing data of your samples, is to evaluate the sequencing quality and perform trimming and cleaning of the reads (see Data preprocessing).

The cleaned sequence data can then be used for downstream analysis following one of two approaches (or both in parallel, check [Data production][../Pipelines/data_production.md]):

  • de novo genome assembly of the sample(s),
  • Read mapping of each sample on a reference sequence (obtained from a database or by de novo genome assembly of one of your samples).

Both approaches are commonly used for L. monocytogenes.

De novo genome assemblers that can be used for L. monocytogenes include SPAdes and Velvet. Both of them perform very well and are freely available. There are command-line pipelines, such as INNUca, which incorporate these programs and provide the opportunity to automatically perform all the analyses from quality control to genome assembly. If a platform with predefined pipelines (and that usually does not require bioinformatics skills) is preferred, CLC Genomic Workbench and Ridom SeqSphere+ can be used for L. monocytogenes.

As for read mapping, BWA and Bowtie are often used in L. monocytogenes analysis. The Center for Food Safety and Applied Nutrition (CFSAN) of USA developed the CFSAN SNP pipeline, which is tailored to create high quality SNP matrices for sequences from closely-related pathogens. This pipeline covers all the steps of the analysis from read mapping to calculation of SNP distances and reconstruction of the haplotypes. Therefore, despite requiring some bioinformatics skills, it may represent an alternative to the development a new own pipeline and it is commonly used for L. monocytogenes analysis (eg. Hurley et al. 2019, Scaltriti et al. 2020, Portmann et al. 2018). Similar to the genome assembly step, a platform with predefined pipelines for read mapping can be used. In this case, Ridom SeqSphere+ is the most commonly used one.

Choosing a reference genome

Should an analysis require the use of a reference genome, the choice of the reference genome is a crucial step. Analyses relying on read-mapping approaches might be strongly influenced by reference choice, as the genetic distance between the reference and the sample may influence the performance of downstream steps, namely SNPs/INDELs calling (Pightling et al. 2014, Pightling et al. 2015). This reference can be picked from the samples at hand (after genome assembly), or from a public database. If a sample is used as the reference genome, studies on L. monocytogenes usually perform preliminary analysis (e.g. hierarchical clustering based on some distance, such as mash, ANI or allelic differences - eg. MLST analysis) and then select a strain of each cluster to use as reference. If instead a publicly available genome is used, the most widely used ones are L. monocytogenes strain EGD-e, which is the reference genome assembly of NCBI database, strain CFSAN029793 (e.g. Ottesen et al. 2020, Chen et al. 2017) or strain 08-5578, strain HPB5622 (Pightling et al. 2014, Pightling et al. 2015). EGD-e is the reference strain for Lineage II and F2365 for reference strain for Lineage I (Knudsen et al. 2017). If the objective is to discriminate between highly related samples that may/may not belong to a single outbreak, using one outbreak isolate as reference and combining multiple analyses approaches might maximize the resolution of your analyses (eg. Chen et al. 2017).

Getting SNPs

How to detect SNPs is described earlier.

Briefly, there are three different approaches.

  • Perform de novo genome assembly of each sample, and then align their genomic sequences. Studies involved in L. monocytogenes analysis use very often SPAdes, CLC Genomic Workbench or Ridom SeqSphere+ to obtain the assembly, and progressiveMAUVE, BLAST or MUSCLE to align the genomes. This multi-sequence alignment is the input for phylogenetic and clustering analysis (see sections on phylogeny and clustering). If instead of genome analysis, only the SNPs in the genes are of interest, alignments can be performed with eg. Roary or Panaroo for pangenome.
  • Use a reference genome where the reads of all the samples will be mapped, and then use a variant-calling pipeline to determine the polymorphic positions. Studies involved in L. monocytogenes analysis use mostly BWA and Bowtie for read mapping, and GATK and VarScan for variant-calling. Of note, many of them also use the CFSAN SNP pipeline for both processes.
  • Determine the polymorphic positions in the sample by analyzing the k-mer pattern using kSNP. For this approach either the genome assembly, or the cleaned genomic reads are needed. This is the less frequently used approach for L. monocytogenes.

Each of these approaches will provide information about the genetic variability in the dataset. This information can then be used to perform SNP-based clustering and phylogenetic analysis.

Getting alleles and allele differences

The allele sequences in the samples at hand can be retrieved by:

  • Replacing the nucleotide of the reference genome by the observed alternative allele (check previous question), and then retrieve the sequence of each gene of interest considering the genome annotation of the reference.
  • Obtaining the de novo genome assembly of each sample, and:
    • Perform the respective genome annotation. Prokka is a commonly used program for L. monocytogenes genome annotation. BLAST or MUSCLE can be used to align the predicted genes to the set of genes of interest and identify homology relations. Alternatively, a less commonly used approach in the study of L. monocytogenes genomes, is the use of a program like eggNOG mapper to perform functional annotation.
    • Use BLAST or MUSCLE to align a set of genes of interest on the genome assembly and identify the respective homologs.
  • Some allele callers, such as ChewBBACA, provide locus-specific alignments in an automated manner, being a good option to determine the allelic profile of samples.

It is important to note that nowadays there are several platforms which can automatically do all this analysis. Several of these are mentioned in the xMLST section.

Allele-based typing

Allele-based typing consists of retrieving clustering information considering the different alleles present in a population for a given set of genes (e.g. the core genome). With the advent of WGS, the 7-loci based MLST approach was broadened to the use of a cgMLST or wgMLST approach. In this context, there are two public cgMLST schemes which have been widely used in L. monocytogenes analysis considering an allele-based approach, namely, a 1,701-loci scheme proposed by Ruppitsch et al. (2015), and a 1,748-loci scheme proposed by Moura et al. (2016). Although a standardized approach for cgMLST analysis would be ideal, in reality both schemes seem to work well and provide similar results (Van Walle et al. 2018), and no preference is given to any of them. Nevertheless, when using automated platforms, usually only a single scheme is available. For example, BIGSdb uses the cgMLST scheme proposed by Moura et al. (2016), while Ridom SeqSphere+ uses the scheme proposed by Ruppitsch et al. (2015), (the scheme can be found here: https://www.cgmlst.org/ncs). Besides cgMLST, some studies also perform a MVLST analysis, considering a set of virulence-related genes, which has been shown to accurately differentiate epidemic clones (check Lomonaco et al. 2013).

Platforms available for cgMLST typing of L. monocytogenes include BIGSdb (Jolley & Maiden 2010), BioNumerics, CGE, IRIDA, Pathogen Watch, and Ridom SeqSphere+. However, cgMLST analysis can also be done outside a platform with software such as ChewBBACA and MentaLiST.

SNP-based typing

A SNP-based approach relies on the comparison of SNPs in a population. This strategy can be seen as an alternative to the allele-based approach, but many studies actually perform both of them and assess the overlap of the results. For a SNP-based analysis all of the the SNPs that are present in the samples need to be acquired and used to obtain clustering information. Examples of publicly available pipelines for SNP-based typing are:

  • Center for Food Safety and Applied Nutrition (CFSAN) HqSNPs pipeline
  • Lyve-SET pipeline for HqSNPs typing
  • SNV-Phy (Canadian Public Health Agency)
  • PHEnix (The Public Health England SNP calling pipeline)

Outbreak definition

As defined by the World Health Organization, “a disease outbreak is the occurrence of cases of disease in excess of what would normally be expected in a defined community, geographical area or season”. For foodborne diseases, outbreaks can be defined as two or more cases linked to the same food source (Hoezler et al. 2018). Henri et al. (2017) showed that clustering of a diverse dataset of L. monocytogenes isolates from food origin, using three different approaches (cgMLST, wgMLST and SNP based phylogeny) was highly concordant.

Regarding the interpretation of WGS data and the respective genetic clusters, in the particular case of L. monocytogenes, some thresholds appear to be more commonly used to define a cluster/outbreak. They are:

  • ≤4 or ≤7 cgMLST allelic differences (AD) considering any of the above-mentioned cgMLST schemes (Van Walle et al. 2018). Cabal et al. (2019) also considered clusters ≤ 10 AD.
  • Using PHE SnapperDB pipeline, Nielsen et al. (2017) found that the maximum HqSNPs pairwise distance between outbreak isolates were < 5 SNPs (5/9 outbreaks) while the pairwise distances of the remaining 4/9 outbreaks studied was between 8-21 HqSNPs.

However, despite cutoff-thresholds (eg., allelic differences for cgMLST, or SNPs pairwise differences between isolates belonging to a single cluster) being commonly used, those thresholds are dependent on your workflow. For example, Chen et al (2017) demonstrated that pairwise differences in SNP/allele count was not necessary and sufficient to distinguish between cheese outbreak isolates from related non-outbreak isolates. Moreover, thresholds are not directly transposable between studies, therefore it is good practice to access the sensitivity and specificity of the workflow when evaluating which threshold could be appropriate.

Viruence and AMR

Several genes are important for Listeria ability to cause infection and are medically relevant, such as internalins (inlA, inlB, inlF, or inlJ, essential for adhesion and invasion) or the prfA-regulated virulence gene cluster (pVGC) (Vázquez-Boland et al. 2001, Ward et al. 2004, Poimenidou et al. 2018). Listeria are naturally susceptible to penicillin, ampicillin, amoxicillin, gentamicin, erythromycin, tetracycline, rifampicin, co-trimoxazole, vancomycin and imipenem (Goméz et al. 2014, Byrne et al. 2016). Nevertheless, reports of antimicrobial resistance towards one or several of these compounds has been reported (eg. Boháčová et al. 2018, Escolar et al. 2017, Kevenk et al. 2015). For this reason, monitorization of virulence- and antimicrobial resistance-related genes is of extreme relevance to determine the best way of action in presence of a case of infection or even an outbreak. As mentioned in the Virulence and AMR detection section, where more details can be found, this is performed by comparing the genome to a database comprising a set of genes of interest. Examples of predefined resistome databases are mentioned in the same section.