Escherichia coli analysis

Escherichia coli are gram-negative bacteria which may reside in the intestinal tract of most warm-blooded animals contributing to a healthy microbiota. However, some of these bacteria have a pathogenic behavior, and may be transmitted by contaminated water or food. E. coli are divided into six different pathotypes, from which phage-encoded Shiga toxin-producing E. coli (STEC) (also known as Verocytotoxin-producing E. coli (VTEC)) are the ones most commonly associated with foodborne outbreaks (CDC). Indeed, Shiga toxins (Stx) are thought to be the key virulence factors for STEC infections (Gyles, 2007). STEC represents the third most relevant human foodborne bacterial pathogen, just behind Campylobacter and Salmonella (EFSA (2019)). Amesquita-Lopez et al. (2018) revises the possible routes of STEC transmission, classification, virulence factors and antimicrobial resistance.

Considering the relevance of STEC for human health, different methods have been applied in order to determine their diversity and associate these features to pathogenic traits. E. coli serotyping is based on somatic surface (O-antigens) and flagellum (H-antigens) antigens, and so far more than 400 STEC serotypes have been identified (Amesquita-Lopez 2018). Moreover, these serotypes are also divided into pathotypes (from A to E), according to their association to outbreaks and hemolytic-uremic syndrome (Karmali et al. 2003). STEC O157:H7 serotype belongs to the pathotype A and is responsible for the majority of outbreaks. For this reason, it is the main focus of many studies (Amesquita-Lopez 2018). However, in recent years the epidemiology of this disease has been shifting with the increasing number of cases of non-O157:H7 STEC infections (Shen et al. 2015, Lang et al. 2019). Similar to what happens with other species, STEC serotyping can be time-consuming and have limited discriminatory power for epidemiological studies. Therefore, molecular typing methods have been developed and are also used to assess STEC diversity.

Typing methods

STEC molecular typing is an evolving field, constantly seeking for the best typing method. A good typing method is not only highly discriminatory, but also reproducible and automated. STEC molecular typing can be performed through:

  • Pulsed Field Gel Electrophoresis (PFGE) - PFGE is a fragment length restriction analysis that has long been considered the most discriminatory typing method for STEC in the pre-WGS era (Amesquita-Lopez 2018). This is currently the “gold-standard” for PulseNet network, and has been used by public health authorities and food regulators for outbreak investigations. Several studies have suggested that combination of PFGE with other typing methods may increase the discriminatory power and be useful to determine outbreak infection’s sources (Amesquita-Lopez 2018).
  • MLVA (Multiple locus variable tandem repeat analysis) - Multiple Locus Variable Number of Tandem Repeats Analysis is a PCR-based typing method, which is the second major typing tool used by the PulseNet network (before WGS). This method is fast and might also be able to differentiate fast-evolving bacteria with a similar PFGE profile. Therefore, MLVA has been used to complement PFGE results, thus providing a useful resource during outbreaks (Parsons et al. 2016).
  • MLST (Multi-Locus Sequence Typing) - As for other bacteria, MLST methods based on 7 locus have been developed for E. coli. Two protocols have been established; one specifically developed for STEC (aspC, clpX, fadD, icdA, lysP, mdh, and uidA; STEC center) and one developed for a more general approach for E. coli (adk, fumC, gyrB, icd, mdh, recA and purA; Wirth et al. 2006). MLST can provide faster results when compared to PFGE, and it is highly reproducible.
  • WGS (Whole-Genome Sequencing) - With the advent of NGS technologies, WGS was shown to be useful for STEC outbreak investigation (Parsons et al. 2016). By providing information at the genomic level, WGS allows not only a highly discriminatory typing (cgMLST, wgMLST and SNP-typing), but also to establish the backward compatibility with previously mentioned molecular typing methods, as the in silico serotyping and 7-loci MLST. For this reason, these methods will tend to continue to be used. Furthermore, it allows the analysis of specific genes, such as virulence factors and antimicrobial resistance genes. Genetic clustering using WGS can be performed on any distance measure (eg. issued from allelic differences detected using cgMLST typing) or evolutionary-model based clustering (ie. phylogenetics) relying on variants/SNPs detection. PulseNet network is making efforts to implement WGS as a routine tool to replace PFGE and MLVA.

“One Health” surveillance and WGS of STEC

The identification of infection sources is essential for outbreak monitoring. Hence, an integrated analysis of clinical, food and veterinary samples relying on the concept of One Health is the key to achieve a good surveillance system. As shown here by PulseNet network, the high discriminatory power of WGS increases the chances to find the bacterial source of infection, and possibly reduces the time that it takes. Indeed, WGS analysis has proven to be an effective way to determine the genetic clustering of STEC isolates, as well as the source of infections (Parsons et al. 2016, Jenkins et al. 2019, Nouws et al. 2020, Joensen et al. 2014, Chattaway et al. 2016). For instance, in England and Denmark WGS-based STEC surveillance has been implemented with success (Parsons et al. 2016, Dallman et al. 2021. However, this has mainly focused on STEC from patients. Nevertheless, WGS-based STEC surveillance at the EU level has been proposed to be delayed until the technological transition has been made for listeriosis (ECDC roadmap).

WGS lab protocol

DNA extraction

Before DNA extraction, STEC is cultured in the laboratory. Commonly used media for STEC include tryptic soy broth, E. coli broth and buffered peptone water (Amezquita-Lopes et al. 2018) as well as more specific growth media. Regarding DNA extraction, there is not a standard protocol or kit that is used, but a protocol directed towards Gram-negative bacteria will be recommended.

Sequencing technology

There is not a prefered WGS technology to sequence STEC. Similar to other fields, Illumina paired-end reads represent the most commonly used strategy. Due to the number of samples that can be handled at a single run and the possible higher read size, MiSeq sequencing machines seem to be the choice for the majority of the labs.

Bioinformatics protocol

Mapping or assembly

The first step to perform when receiving the sequencing data , is to evaluate the sequencing quality and perform trimming and cleaning of the reads (see Data preprocessing).

The cleaned sequence data can then be used for downstream analysis following one of two approaches (or both in parallel, check Data production):

  • De novo genome assembly of the sample(s),
  • Read mapping of each sample on a reference sequence (obtained from a database or by de novo genome assembly of one of your samples).

It is important to note that both approaches have advantages and disadvantages. The decision on which of them to follow should be made according to what is more appropriate for the data at hand, and the purpose of the analyses. De novo genome assembly of all sequenced isolates followed by their annotation seems to be a common approach in studies including STEC genomes. A commonly used de novo genome assembler for STEC is SPAdes (Iramiot et al. 2020, Reid et al. 2020, Sonda et al. 2018). It performs very well and is freely available. There are command-line pipelines, such as INNUca, which incorporate these programs and provide the opportunity to automatically perform all the analyses from quality control to genome assembly. If a platform with predefined pipelines (and that usually does not require bioinformatics skills) is preferred, Enterobase is available for E. coli. As for read mapping, BWA is a commonly used approach (Holmes et al. 2015, Iramiot et al. 2020, Parsons et al. 2016, Dallman et al. 2021). However, as mentioned before, these represent commonly used approaches, and not recommendations. Thus, other methodologies, pipelines or even platforms may be used.

Choosing a reference genome

Should an analysis require the use of a reference genome, the choice of the reference genome is a crucial step. Analyses relying on read-mapping approaches might be strongly influenced by reference choice, as the genetic distance between the reference and the sample may influence the performance of downstream steps, namely SNPs/INDELs calling (Pightling et al. 2014, Pightling et al. 2015). This reference can be picked from the samples (after genome assembly), or from a public database. Enterobase is a good site for choosing a reference for this species.

Serotyping

Besides the wet-lab approach for serotype determination of STEC samples, in silico approaches using WGS data can also be performed (Joensen et al. 2015, Ingle et al. 2016). SRST2 can be used to determine serotyping without the need of de novo genome assembly, by comparing the genomic reads directly to the database (Ingle et al. 2016). SeroTypeFinder is another alternative for in silico determination of E. coli serotype, requiring sequencing reads or genome assembly as input. Bionumerics (using the database from SeroTypeFinder), Enterobase is an examploe of of a platform where this function is available.

Getting SNPs

Analysis of SNPs is a frequently used approach for the analysis of STEC samples (Parsons et al. 2016).

How to detect SNPs is described earlier. Briefly, there are three different approaches.

  • Perform de novo genome assembly of each sample and then align their genomic sequences.
  • Use a reference genome where the reads of all the samples will be mapped, and then use a variant-calling pipeline to determine the polymorphic positions. CFSAN SNP is a commonly used pipeline which performs both processes (read mapping and variant calling). Snippy and SNVPhyl are also commonly used alternatives for STEC analyses.
  • Determine the polymorphic positions in the sample by analyzing the k-mer pattern using kSNP. For this approach either the genome assembly or the genomic reads must be provided. This is not a commonly used approach for STEC analyses.

Getting alleles and allele differences

The allele sequences of the samples can be retrieved by:

  • Replacing the nucleotide of the reference genome by the observed alternative allele, and then retrieve the sequence of each gene of interest considering the genome annotation of the reference.
  • Obtaining the de novo genome assembly of each sample, and performing the respective genome annotation. Prokka is acommonly used program for STEC.
  • Some allele callers, such as chewBBACA, provide locus-specific alignments in an automated manner, being a good option to determine the allelic profile of samples.

It is important to note that nowadays there are several platforms which can automatically do all this analysis. One of the more commonly used for E.coli is Enterobase, and also Bionumerics. These platforms provide assembly, serotyping and allele calling. Several of these platforms are mentioned in the xMLST section.

Allele based typing

Allele-based typing consists of retrieving clustering information considering the different alleles present in a population for a given set of genes (e.g. the core genome). With the advent of WGS, the 7-loci based MLST approach was broadened to the use of a cgMLST or a wgMLST approach. In this context, there is a public cgMLST scheme which has been used in STEC analysis considering an allele-based approach. This scheme comprises 2,513 loci and is available in the most commonly used platforms, such as EnteroBase and Ridom SeqSphere+. Noteworthy, although the scheme used by the platforms is the same, their allele calling is independent, and therefore there may be some nomenclature incompatibilities between the different platforms.

SNP based typing

A SNP-based approach relies on the comparison of SNPs in a population. This strategy can be seen as an alternative to the allele-based approach, but many studies actually perform both of them and assess the overlap of the results. Although for the majority of important bacterial pathogens WGS-based typing is performed following an allele-based approach, in the case of STEC SNP-based typing is frequently used. For instance, Public Health England has performed WGS-based STEC surveillance for a long time following a well established pipeline (PHEnix) for surveillance and outbreak detection (Dallman et al. 2021, Dallman et al. 2015). This pipeline relies mostly on variant-calling with GATK after read-mapping with BWA-MEM, followed by clustering analysis with SnapperDB.

Examples of other available pipelines for SNP-based typing are:

  • Center for Food Safety and Applied Nutrition (CFSAN) HqSNPs pipeline
  • Lyve-SET pipeline for HqSNPs typing
  • SNV-Phyl (Canadian Public Health Agency)
  • PHEnix (The Public Health England SNP calling pipeline)

Outbreak definition

As defined by the World Health Organization , “a disease outbreak is the occurrence of cases of disease in excess of what would normally be expected in a defined community, geographical area or season” . WGS data provides a high discriminatory power allowing clustering of different isolates (from different geographical areas, and clinical, animal or environmental sources) according to their genomic similarity. This contributes not only to an earlier detection of outbreaks and determination of contamination sources, but also to the detection of more outbreaks, as has been reported by PulseNet network for Listeria. It is difficult to establish a clear cluster outbreak definition, a threshold at which we decide whether two isolates belong to the same genetic cluster, thus linking two cases of infection. Previous studies have shown that outbreak-related isolates differ in up to five SNPs in the whole genome, and therefore this is a commonly used threshold to determine an outbreak-related cluster (Dallman et al. 2021, Holmes et al. 2018, Dallman et al. 2015).

Virulence and AMR

Several genes are important for E. coli ability to cause infection and are medically relevant and many of these are associated to different pathogroups. Relevant virulence-associated genes for STEC are different stx subtypes (stx1a, stx2a, stx2d) and other virulence associated genes such as eae and aggR (ref) while Extra intestinal E. coli (ExPEC) other virulence genes such as pap, fimH, sfa, iha, hlyA, cnf1 or sat are of importance (eg. Hung et al. 2019, Wang et al. 2009, Rodríguez-Villodres et al. 2019). As stx subtypes might be highly similar a specific database has been created associated with VirulenceFinder. Natural evolution, horizontal transfer of antimicrobial resistant elements as well as the use of antibiotics have contributed to the emergence of multi-drug resistant isolates, and this has become a worrying issue that is increasingly observed (Poirel et al. 2018). Of particular concern is the acquisition of genes conferring resistance to broad-spectrum cephalosporins, carbapenems aminoglycosides, and (fluoro)quinolones (Poirel et al. 2018). For this reason, monitoring of virulence- and antimicrobial resistance-related genes is of great relevance to determine the best way of action in the presence of a case of infection or even an outbreak. As mentioned in the Virulence and AMR detection section, where more details can be found, this is performed by comparing the genome to a database comprising a set of genes of interest. Examples of predefined resistome databases are mentioned in the same section.