Salmonella analysis

Gram-negative bacteria of the genus Salmonella are a major cause of foodborne illness. Two Salmonella species have been identified, namely, Salmonella enterica and Salmonella bongori. Despite only harboring two species, this genus can be divided into several subspecies and then further to different serotypes. Isolates are often reported by the name of the genus followed by the name of the serotype, without mentioning the species or subspecies name (Eng et al. 2015). Moreover, Salmonella isolates are usually classified into typhoidal and non-typhoidal Salmonella, according to their role as causative agents of typhoid or paratyphoid fever and salmonellosis, respectively.

Salmonella serotyping is performed using the White-Kauffman-Le Minor scheme (Grimont and Weill 2007, Guibourdenche et al. 2010), which uses somatic (O), flagellar (H), and capsular (Vi) antigens. This is one of the “gold-standards” for Salmonella classification, being widely used for outbreak, surveillance and epidemiological studies. So far, more than 2,500 serotypes have been identified, and many of them seem to be particularly associated with certain niches (CDC). Thus, serotyping may guide public authorities during outbreak investigations. Nevertheless, a small number of serotypes which are globally distributed are responsible for the majority of outbreaks, and in these cases serotyping does not have high enough resolution. Moreover, the existence of so many serotypes obligates laboratories to keep a high amount of high-quality typing antisera and antigens for conventional serotyping of Salmonella. In this context, molecular typing methods acquired a key-role in Salmonella surveillance and outbreak investigation.

Typing methods

Salmonella molecular typing can be performed through:

  • Pulsed Field Gel Electrophoresis (PFGE) - PFGE is a fragment length restriction analysis that has long been considered as one of the “gold-standards” for Salmonella typing, together with serotyping, due to its relatively high discriminatory power. This was until recently considered the “gold-standard” for PulseNet network, and has been used by public health authorities and food regulators for outbreak investigations.
  • MLVA (Multiple locus variable tandem repeat analysis) - Multiple Locus Variable Number of Tandem Repeats Analysis is a PCR-based typing method, which is a major typing tool used by the PulseNet network. This method is able to differentiate fast-evolving bacteria even if they look similar with PFGE and is a faster, less laborious method. Therefore, MLVA is usually performed as a complement to PFGE results or instead of PFGE, thus providing a useful resource during outbreaks. As this analysis is specific for each serotype, different Salmonella serotypes usually require different MLVA schemes. Therefore, isolates have to be serotyped before selecting the MLVA scheme.
  • MLST (Multi-locus Sequence Typing) - As for other bacteria, a MLST method based on seven housekeeping genes (aroC, dnaN, hemD, hisD, thrA, sucA, and purE) has been developed for Salmonella (Achtman et al. 2012). MLST can provide faster and more reproducible results compared to PFGE. However, it shows lower discriminatory power than PFGE and MLVA, but at a similar level as serotyping.
  • Microarrays - The Salmonella genoserotyping array (SGSA) is a microarray developed as an alternative to the usual serotyping method. This method presents very good results for the 57 most commonly reported serotypes, but fails for many others. Therefore, it is more useful for fast screening of those 57 serotypes, but not for the others. This method has been improved in SGSA v2.
  • CRISPR - This method uses the diversity of spacers present at CRISPR loci to distinguish bacterial strains (Fabre et al. 2012). Amplified CRISPR loci PCR products are sequenced and analyzed to assign each locus to an allelic type in order to determine the allelic profile of each isolate, and their evolutionary relation. A CRISPR–multi-virulence-locus sequence typing (MVLST) approach using the genes sseL and fimH has also been developed (Shariat et al. 2013). A comparative analysis revealed that CRISPR–MVLST has a higher discriminatory power than the usual MLST, but lower discrimination than PFGE. This represents an expensive non-standardized protocol.
  • WGS (Whole-Genome Sequencing) - With the advent of NGS technologies, WGS technology has led to the improvement of small salmonellosis outbreak investigation (Kubota et al. 2019). By providing information at the genomic level, WGS allows not only a highly discriminatory typing (cgMLST, wgMLST and SNP-typing), but also to establish the backward compatibility with previously mentioned molecular typing methods, as the molecular serotyping 7-genes MLST, which, for this reason, will tend to continue to be used. Furthermore, it allows the analysis of specific genes, such as virulence factors and antimicrobial resistance genes. Genetic clustering using WGS can be performed on any distance measure (eg. issued from allelic differences detected using cgMLST typing) or evolutionary-model based clustering (ie. phylogenetics) relying on variants/SNPs detection. PulseNet network, as well as ECDC and EFSA, are making efforts to implement WGS as a routine tool to replace PFGE and MLVA. Nevertheless, in the case of Salmonella this is still not a routine procedure.

“One Health” surveillance and WGS of Salmonella

The identification of infection sources is essential for outbreak resolution. Hence, an integrated analysis of clinical, food and veterinary samples relying on the concept of One Health is the key to achieve a good surveillance system. As shown here by PulseNet network for Listeria, the high discriminatory power of WGS increases the chances to find the bacterial source of infection, and possibly reduces the time that it takes. Indeed, as reported by the WHO, the use of WGS increased the resolution of Salmonella cluster analysis, and contributed to the identification of recurrent sources of infection. Furthermore, the integrated WGS analysis of food and human samples at international level during a multi-country Salmonella outbreak allowed the identification of the source of infection in Germany (Inns et al. 2015), reflecting the ease at which WGS data can be shared, analyzed and compared. However, several factors are hindering the implementation of a generalized WGS-based surveillance system. For instance, the resources for implementation of WGS differ between different sectors (human health, animal health and food safety), thus complicating the implementation of “One health” surveillance. For this reason, it has been decided that the technological transition to WGS-based surveillance at European level is performed first in Listeria, and only afterwards in other bacteria, such as Salmonella (ECDC roadmap).

WGS lab protocol

DNA extraction

Regarding DNA extraction, there is not a standard protocol or kit that is used, but a protocol directed towards Gram-negative bacteria will be recommended.

Sequencing technology

There is not a prefered WGS technology to sequence Salmonella. Similar to other fields, Illumina paired-end reads represent the most commonly used strategy. Due to the number of samples that can be handled at a single run and the possible higher read size, MiSeq sequencing machines seem to be the choice for the majority of the labs. Long-read sequencing technologies are now becoming more frequently used, and there is an apparent tendency to sequence Salmonella genomes using both short- and long-read technologies.

For Illumina sequencing, the choice of library preparation procedure may have adverse effects on in silico serotyping. For example, the Nextera XT library preparation kit seems to introduce a GC bias, which negatively affects O-antigen recognition due to increased fragmentation (Uelze, 2019). The new version, Nextera Flex, is therefore recommended over the XT kit.

Bioinformatics protocol

Mapping or assembly

The first step to perform when receiving the sequencing data of your samples, is to evaluate the sequencing quality and perform trimming and cleaning of the reads.

The cleaned sequence data can then be used for downstream analysis following one of two approaches (or both in parallel, check Data production):

  • De novo genome assembly of the sample(s),
  • Read mapping of each sample on a reference sequence (obtained from a database or by de novo genome assembly of one of your sample)

It is important to note that both approaches have advantages and disadvantages, and the decision on which of them to follow should be made according to what is more appropriate for the data you have at hand, and the purpose of your analyses. De novo genome assembly of all sequenced isolates followed by their annotation seems to be a common approach in studies including Salmonella genomes. Nevertheless, especially when a further SNP-based approach will be performed (see next questions), a parallel read mapping approach is also followed.

De novo genome assemblers that can be used for Salmonella include SPAdes and SKESA. Both of them perform very well and are freely available. A major difference between them is the fact that SKESA can not use longreads produced by Oxford Nanopore or Pacfic Biosciences machines. For this reason, it does not represent a good alternative for hybrid assemblies combining both short- and long-reads, which is the tendency in the field. In this context, Unicycler, a pipeline tailored to perform hybrid assemblies, combines SPAdes to other tools, is commonly used for Salmonella genomes. Other command-line pipelines, such as INNUca, also provide the opportunity to automatically perform all the analyses from quality control to genome assembly. If a platform with predefined pipelines is needed instead, INNUENDO, BioNumerics, Ridom SeqSphere+ and EnteroBase can be used for Salmonella.

As for read mapping, BWA is a common choice for Salmonella. Alternatively, CFSAN SNP pipeline, which is tailored to create high quality SNP matrices for sequences from closely-related pathogens, is a commonly used pipeline for Salmonella. This pipeline covers all the steps of the analysis from read mapping to calculation of SNP distances and reconstruction of the haplotypes. Therefore, despite requiring some bioinformatics skills, it may represent a good alternative to the development of your own pipeline. Noteworthy, as mentioned before, these represent commonly used approaches, and not recommendations. Thus, other methodologies, pipelines or even platforms may be used for your analysis.

Choosing a reference genome

Should the analysis require the use of a reference genome, the choice of the reference genome is a crucial step. Analyses relying on read-mapping approaches might be strongly influenced by reference choice, as the genetic distance between the reference and the sample may influence the performance of downstream steps, namely SNPs/INDELs calling (Pightling et al. 2014, Pightling et al. 2015). This reference can be picked from the samples themselves (after genome assembly), or from a public database. In both cases the reference must be chosen according to the serotype of each isolate. For this reason, it is essential to determine the serotype before read mapping, and there is not a specific reference genome that is used from public databases. However, a closed bacterial genome will be preferable.

Salmonella serotyping

As mentioned before, determination of Salmonella serotype is an important step to be able to perform further analysis. For instance, a read-mapping approach and downstream analysis obtain better results if the reference genome corresponds to the same serotype as the sample. Serotype determination can be performed with the White-Kauffman-Le Minor scheme, or with an in silico pipeline. The most commonly used programs for in silico serotype determination in Salmonella are SISTR and SeqSero (and v2. SeqSero2). Although less commonly used, the “bacterial analysis pipeline” is also an option. The Salmonella Type Finder is a pipeline developed by the Center for Genomic Epidemiology which uses SRST2 and SeqSero. Enterobase is a complete pipeline combining several tools and includes both SISTR and SeqSero2.

Getting SNPs

How to detect SNPs is described earlier. Briefly, there are three different approaches.

  • Perform de novo genome assembly of each sample (check above), and then align their genomic sequences. Salmonella analyses usually use MAUVE to align the genomes. This multi-sequence alignment can be then input to SNP-sites to get the number of variants.
  • Use a reference genome where the reads of all the samples will be mapped, and then use a variant-calling pipeline to determine the polymorphic positions. CFSAN SNP is a commonly used pipeline which performs both processes (read mapping and variant calling). Snippy and SNVPhyl are also commonly used alternatives for Salmonella genomes.
  • Determine the polymorphic positions in the sample by analyzing the k-mer pattern using kSNP. For this approach you can either provide the genome assembly, or the cleaned genomic reads. This is the less frequently used approach for Salmonella.

Each of these approaches provides you with information about the genetic variability in the dataset. This information can then be used to perform SNP-based clustering and phylogenetic analysis. Alternatively, if a read mapping approache is followed, the reference nucleotide can be replaced by the observed allele, and consequently reconstruct the haplotype of each sample. This is the approach used by the CFSAN SNP pipeline.

Getting alleles and allele differences

The allele sequences of the samples can be retrieved by:

  • Replacing the nucleotide of the reference genome by the observed alternative allele (check previous question), and then retrieve the sequence of each gene of interest considering the genome annotation of the reference.
  • Obtaining the de novo genome assembly of each sample, and performing the respective genome annotation. Prokka and NCBI Prokaryotic Genome Annotation Pipeline are commonly used programs for Salmonella genome annotation. GLIMMER and RASTk are also used.
  • Some allele callers, such as chewBBACA, provide locus-specific alignments in an automated manner, being a good option to determine the allelic profile of samples.

It is important to note that nowadays there are several platforms which can automatically do all this analysis. One of the more commonly used for Salmonella is Enterobase, which provides assembly, serotyping and allele calling. Several of these platforms are mentioned in the xMLST section.

Allele based typing

Allele-based typing consists of retrieving clustering information considering the different alleles present in a population for a given set of genes (e.g. the core genome). With the developmentadvent of WGS, the 7-loci based MLST approach was broadened to the use of a cgMLST approach. In this context, there is a public cgMLST scheme which has been widely used in Salmonella enterica analysis. This scheme comprises 3,002 loci, and is available in the most commonly used platforms, such as EnteroBase and Ridom SeqSphere+. EnteroBase and BioNumerics also use a wgMLST scheme which, besides the previously mentioned cgMLST loci, include the accessory genes.

Platforms available for cgMLST typing of Salmonella include Enterobase, INNUENDO, BioNumerics, CGE, IRIDA, Pathogen Watch, and Ridom SeqSphere+.

SNP based typing

A SNP-based approach relies on the comparison of SNPs in a population. This strategy can be seen as an alternative to the allele-based approach, but many studies actually perform both of them and assess the overlap of the results. For a SNP-based analysis all of the the SNPs that are present in the samples need to be acquired and used to obtain clustering information. Examples of publicly available pipelines for SNP-based typing are:

  • Center for Food Safety and Applied Nutrition (CFSAN) HqSNPs pipeline
  • Lyve-SET pipeline for HqSNPs typing
  • SNV-Phy (Canadian Public Health Agency)
  • PHEnix (The Public Health England SNP calling pipeline)

Outbreak definition

As defined by the World Health Organization, “a disease outbreak is the occurrence of cases of disease in excess of what would normally be expected in a defined community, geographical area or season”. WGS data provides a high discriminatory power allowing clustering of different isolates (from different geographical areas, and clinical, animal or environmental sources) according to their genomic similarity. This contributes not only to an earlier detection of outbreaks and determination of contamination sources, but also to the detection of more outbreaks, as has been reported by PulseNet network for Listeria. Nevertheless, it is still difficult to establish a clear cluster outbreak definition for Salmonella, a threshold at which we decide whether two isolates belong to the same genetic cluster, thus linking two cases of infection.

Virulence and AMR

Several genes are important for Salmonella ability to cause infection and are medically relevant, such as motility genes, fimbrial adhesins and metabolic genes (Ilyas et al. 2017, Eng et al. 2015). In the particular case of Salmonella, horizontally transferred genes strongly influence the course of infection, as they can lead to the emergence of new phenotypes and favor the adaptation to new niches (Ilyas et al. 2017, Ochman et al. 2000). Such events are not only important for Salmonella ability to infect humans, but also for the acquisition of resistance to antimicrobial drugs (Wang et al. 2019). Chloramphenicol, ampicillin, and trimethoprim–sulfamethoxazole are the first-line antimicrobial drugs used to treat Salmonella infections. However, over the years resistance towards one or several of these drugs (leading to multidrug resistant isolates) has been emerging (Eng et al. 2015). Alternative antimicrobials have been used. However, resistance towards the alternatives is also appearing, and antimicrobial resistance in Salmonella is considered a global threat (Marchello et al. 2020, Eng et al. 2015). For this reason, monitoring of virulence- and antimicrobial resistance-related genes is of great relevance to determine the best way of action in the presence of a case of infection or even an outbreak. As mentioned in the Virulence and AMR detection section, where more details can be found, this is performed by comparing the genome to a database comprising a set of genes of interest. Examples of predefined resistome databases are mentioned in the same section.