Software Information

This is a general guide for the software that we employed along the BacterialTyper pipeline.

Edirect

The Entrez Programming Utilities (E-utilities) are a set of eight server-side programs that provide a stable interface into the Entrez query and database system at the National Center for Biotechnology Information (NCBI).

Entrez Direct (EDirect) provides access to the NCBI’s suite of interconnected databases (publication, sequence, structure, gene, variation, expression, etc.) from a UNIX terminal window. Functions take search terms from command-line arguments. Individual operations are combined to build multi-step queries. Record retrieval and formatting normally complete the process.

EDirect also includes an argument-driven function that simplifies the extraction of data from document summaries or other results that are returned in structured XML format. This can eliminate the need for writing custom software to answer ad hoc questions. Queries can move seamlessly between EDirect commands and UNIX utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.

Read further information of the E-utilities and EDirect in https://www.ncbi.nlm.nih.gov/books/NBK25501/ https://www.ncbi.nlm.nih.gov/books/NBK179288/

FastQC

FastQC [5] is a quality control tool for high throughput sequence data. It aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analysis which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

The main functions of FastQC are

  • Import of data from BAM, SAM or FastQ files (any variant)

  • Providing a quick overview to tell you in which areas there may be problems

  • Summary graphs and tables to quickly assess your data

  • Export of results to an HTML based permanent report

  • Offline operation to allow automated generation of reports without running the interactive application

Read further information about FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

IQ-TREE

IQ-TREE is a software with a strong emphasis on phylogenomic inference developed since 2011 as open-source software under the GNU-GPL license.

Main goals are:

** Accuracy: Proposing novel computational methods that perform better than existing approaches.

** Speed: Allowing fast analysis on big data sets and utilizing high performance computing platforms.

** Flexibility: Facilitating the inclusion of new (phylogenomic) models and sequence data types.

** Versatility: Implementing a broad range of commonly-used maximum likelihood analyses.

The name IQ-TREE comes from the fact that it is the successor of IQPNNI and TREE-PUZZLE software.

See additional details in: http://www.iqtree.org/about/

KMA

The k-mer alignment (KMA) software [6] creates an alignment method that allows for direct alignment of raw reads against entire databases, without the need of similarity reduction. KMA uses an extra mapping step where the template of each input sequence is found and scored with the ConClave algorithm.

KMA - Output Guide

Explanation of the columns

Template: shows the name of the template sequences

Score: is the global alignment score of the template

Expected: is the expected alignment score if all mapping reads where smeared over all templates in the database

Template length: is the template length in nucleotides

template_id is the percent identity of the found template, over the full template length.

template_coverage is percent of the template that is covered by the query.

query_id is the percent identity between the query and template sequence, over the length of the matching query sequence.

query_coverage is the length of the matching query sequence divided by the template length.

Depth: is the number of times the template has been covered by the query.

q_value: is the quantile from McNemars test, to test whether the current template is a significant hit.

p_value: is p-value corresponding to the obtained q_value.

See additional details here: https://cge.cbs.dtu.dk/services/KMA/output.php

MLSTar

MLSTar is an R package to generate an MLST analysis. It also works as an interface between PubMLST through their RESTful API, automatically downloading and collecting files.

Reference: https://peerj.com/articles/5098/

See additional details in: https://github.com/iferres/MLSTar

Prokka

Prokka [7] is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files. Prokka is a contraction of “prokaryotic annotation”.

See additional details in: https://github.com/tseemann/prokka

PhyML

PhyML is a software that estimates maximum likelihood phylogenies from alignments of nucleotide or amino acid sequences. The main strength of PhyML lies in the large number of substitution models coupled to various options to search the space of phylogenetic tree topologies, going from very fast and efficient methods to slower but generally more accurate approaches. PhyML was designed to process moderate to large data sets.

SPAdes

SPAdes – St. Petersburg genome assembler – is an assembly toolkit containing various assembly pipelines [8].

See additional details in: http://cab.spbu.ru/software/spades/

https://github.com/ablab/spades

Snippy

Snippy finds SNPs between a haploid reference genome and your NGS sequence reads. It will find both substitutions (snps) and insertions/deletions (indels). It can then take a set of Snippy results using the same reference and generate a core SNP alignment (and ultimately a phylogenomic tree).

See additional details in: https://github.com/tseemann/snippy

https://www.slideshare.net/torstenseemann/snippy-balti-bioinformatics-brum-uk-tue-5-may-2015

Trimmomatic

Trimmomatic [9] is a fast, multi threaded command line tool that can be used to trim and crop Illumina (FASTQ) data as well as to remove adapters. These adapters can pose a real problem depending on the library preparation and downstream application.

Trimmomatic works with FASTQ files (using phred + 33 or phred + 64 quality scores, depending on the Illumina pipeline used). Files compressed using either gzip or bzip2 are supported, and are identified by use of .gz or .bz2 file extensions.

See additional information in: http://www.usadellab.org/cms/?page=trimmomatic

http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf