Interpretation for developers

Here we include some guidelines and interpretation of intermediate results for developing the BacterialTyper project.

ARIBA results description

There is a lot of information generated by ARIBA that is common for all databases. Here we listed as an example.

  • assembled_seqs.fa.gz: reference sequences identified

  • assemblies.fa.gz: query sequences retrieved from sample

  • assembled_genes.fa.gz: encoding genes from assemblies.fa

  • debug.report.tsv: initial report.tsv before filtering

  • log.clusters.gz: log details for each cluster

  • report.tsv: report for each sample

  • version_info.txt: version and additional information

File devel/info/ARIBA_explained.csv contains the description of the columns in the ARIBA result file generated.

Column

Description

ariba_ref_name

ariba name of reference sequence chosen from cluster (needs to rename to stop some tools breaking)

ref_name

original name of reference sequence chosen from cluster before renaming

gene

1=gene; 0=non-coding (same as metadata column 2)

var_only

1=variant only; 0=presence/absence (same as metadata column 3)

flag

cluster flag

reads

number of reads in this cluster

cluster

name of cluster

ref_len

length of reference sequence

ref_base_assembled

number of reference nucleotides assembled by this contig

pc_ident

%identity between reference sequence and contig

ctg

name of contig matching reference

ctg_len

length of contig

ctg_cov

mean mapped read depth of this contig

known_var

is this a known SNP from reference metadata? 1 or 0

var_type

The type of variant. Currently only SNP supported

var_seq_type

Variant sequence type. if known_var=1: n or p for nucleotide or protein

known_var_change

if known_var=1: the wild/variant change eg I42L

has_known_var

if known_var=1: 1 or 0 for whether or not the assembly has the variant

ref_ctg_change

amino acid or nucleotide change between reference and contig eg I42L

ref_ctg_effect

effect of change between reference and contig eg SYS; NONSYN (amino acid changes only)

ref_start

start position of variant in reference

ref_end

end position of variant in reference

ref_nt

nucleotide(s) in reference at variant position

ctg_start

start position of variant in contig

ctg_end

end position of variant in contig

ctg_nt

nucleotide(s) in contig at variant position

smtls_total_depth

total read depth at variant start position in contig reported by mpileup

smtls_nts

nucleotides on contig as reported by mpileup. The first is the contig nucleotide

smtls_nts_depth

depths on contig as reported by mpileup. One number per nucleotide in the previous column

var_description

description of variant from reference metdata

free_text

other free text about reference sequence from reference metadata

Prokka output files description

File devel/info/prokka_output_files.csv contains the description of the different output files generated by Prokka.

See additional details in: https://github.com/tseemann/prokka#output-files>

Extension

Description

.gff

This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.

.gbk

This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.

.fna

Nucleotide FASTA file of the input contig sequences.

.faa

Protein FASTA file of the translated CDS sequences.

.ffn

Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)

.sqn

An ASN1 format ‘Sequin’ file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.

.fsa

Nucleotide FASTA file of the input contig sequences, used by tbl2asn to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.

.tbl

Feature Table file, used by ‘tbl2asn’ to create the .sqn file.

.err

Unacceptable annotations: the NCBI discrepancy report.

.log

Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the –quiet option was enabled.

.txt

Statistics relating to the annotated features found.

.tsv

Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product

Snippy output files description

File devel/info/snippy_output_files.csv contains the description of the different output files generated by Snippy.

See additional details in: https://github.com/tseemann/snippy#output-files>

Extension

Description

.tab

A simple tab-separated summary of all the variants

.csv

A comma-separated version of the .tab file

.html

A HTML version of the .tab file

.vcf

The final annotated variants in VCF format

.bed

The variants in BED format

.gff

The variants in GFF3 format

.bam

The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates.

.bam.bai

Index for the .bam file

.log

A log file with the commands run and their outputs

.aligned.fa

A version of the reference but with - at position with depth=0 and N for 0 < depth < –mincov (does not have variants)

.consensus.fa

A version of the reference genome with all variants instantiated

.consensus.subs.fa

A version of the reference genome with only substitution variants instantiated

.raw.vcf

The unfiltered variant calls from Freebayes

.filt.vcf

The filtered variant calls from Freebayes

.vcf.gz

Compressed .vcf file via BGZIP

.vcf.gz.csi

Index for the .vcf.gz via bcftools index)

PhiSpy

PhiSpy identifies prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions.

PhiSpy training Sets available

File devel/info/PhiSpy_training-sets.txt contains the description of the different training sets available for bacteriophage analysis using PhiSpy.

trainingSet

Tag

Set_Name

genus

species

other

0

Generic

testSet_genericAll

All available

All available

All available

1

Single_species

trainSet_32002.17

Achromobacter

denitrificans

strain_PR1

2

Single_species

trainSet_272558.23

Bacillus

halodurans

C-125

3

Single_species

trainSet_224308.360

Bacillus

subtilis

subsp._subtilis_str._168

4

Single_species

trainSet_411479.31

Bacteroides

uniformis

ATCC_8492_strain_81A2

5

Single_species

trainSet_206672.37

Bifidobacterium

longum

NCC2705

6

Single_species

trainSet_224914.79

Brucella

melitensis

16M

7

Single_species

trainSet_190650.21

Caulobacter

crescentus

CB15

8

Single_species

trainSet_195102.53

Clostridium

perfringens

str._13

9

Single_species

trainSet_212717.31

Clostridium

tetani

E88

10

Single_species

trainSet_243230.96

Deinococcus

radiodurans

R1

11

Single_species

trainSet_226185.9

Enterococcus

faecalis

V583

12

Single_species

trainSet_1351.557

Enterococcus

faecalis

strain_V583

13

Single_species

trainSet_199310.168

Escherichia

coli

CFT073

14

Single_species

trainSet_83333.998

Escherichia

coli

K12

15

Single_species

trainSet_83334.295

Escherichia

coli

O157-H7

16

Single_species

trainSet_155864.289

Escherichia

coli

O157-H7_EDL933

17

Single_species

trainSet_71421.45

Haemophilus

influenzae

Rd_KW20

18

Single_species

trainSet_272623.42

Lactococcus

lactis

subsp._lactis_Il1403

19

Single_species

trainSet_272626.22

Listeria

innocua

Clip11262

20

Single_species

trainSet_169963.176

Listeria

monocytogenes

EGD-e

21

Single_species

trainSet_266835.41

Mesorhizobium

loti

MAFF303099

22

Single_species

trainSet_83331.121

Mycobacterium

tuberculosis

CDC1551

23

Single_species

trainSet_83332.460

Mycobacterium

tuberculosis

H37Rv

24

Single_species

trainSet_122586.26

Neisseria

meningitidis

MC58

25

Single_species

trainSet_122587.18

Neisseria

meningitidis

Z2491

26

Single_species

trainSet_272843.53

Pasteurella

multocida

subsp._multocida_str._Pm70

27

Single_species

trainSet_208964.452

Pseudomonas

aeruginosa

PAO1

28

Single_species

trainSet_160488.79

Pseudomonas

putida

KT2440

29

Single_species

trainSet_267608.42

Ralstonia

solanacearum

GMI1000

30

Single_species

trainSet_220341.87

Salmonella

enterica

subsp._enterica_serovar_Typhi_str._CT18

31

Single_species

trainSet_211586.69

Shewanella

oneidensis

MR-1

32

Single_species

trainSet_198214

Shigella

flexneri

2a_str._301

33

Single_species

trainSet_1280.10152

Staphylococcus

aureus

strain_Sa_Newman_UoM

34

Single_species

trainSet_196620.15

Staphylococcus

aureus

subsp._aureus_MW2

35

Single_species

trainSet_158878.38

Staphylococcus

aureus

subsp._aureus_Mu50

36

Single_species

trainSet_160490.61

Streptococcus

pyogenes

M1_GAS

37

Single_species

trainSet_198466.10

Streptococcus

pyogenes

MGAS315

38

Single_species

trainSet_186103.26

Streptococcus

pyogenes

MGAS8232

39

Single_species

trainSet_243277.252

Vibrio

cholerae

O1_biovar_eltor_str._N16961

40

Single_species

trainSet_190486.46

Xanthomonas

axonopodis

pv._citri_str._306

41

Single_species

trainSet_160492.65

Xylella

fastidiosa

9a5c

42

Single_species

trainSet_183190.38

Xylella

fastidiosa

Temecula1

43

Single_species

trainSet_214092.200

Yersinia

pestis

CO92

44

Single_species

trainSet_187410.24

Yersinia

pestis

KIM

53

Single_species

trainSet_1367847.3

Paracoccus

aminophilus

JCM_7686

54

Single_species

trainSet_318586.5

Paracoccus

denitrificans

PD1222

55

Single_species

trainSet_1525717.3

Paracoccus

sanguinis

5503

56

Single_species

trainSet_1660154.3

Paracoccus

sp.

SCN_68-21

57

Single_species

trainSet_147645.106

Paracoccus

yeei

TT13

45

Multi_species

trainSet_Bacillus

Bacillus

halodurans/subtilis

subsp._subtilis_str._168/C-125

46

Multi_species

trainSet_Clostridium

Clostridium

tetani/perfringens

str._13/E88

47

Multi_species

trainSet_Ecoli

Escherichia

coli

K12/O157-H7/O157-H7_EDL933/CFT073

48

Multi_species

trainSet_Efec

Enterococcus

faecalis

V583/strain_V583

49

Multi_species

trainSet_Listeria

Listeria

monocytogenes/innocua

EGD-e/Clip11262

50

Multi_species

trainSet_Mtb

Mycobacterium

tuberculosis

CDC1551/H37Rv

51

Multi_species

trainSet_Nmeningitidis

Neisseria

meningitidis

MC58/Z2491

52

Multi_species

trainSet_Paracoccus

Paracoccus

sp./yeei/sanguinis/denitrificans/aminophilus

PD1222ff/SCN_68-21ff/5503ff/JCM_7686ff/TT13ff

58

Multi_species

trainSet_Pseudomonas

Pseudomonas

putida/aeruginosa

PAO1/KT2440

59

Multi_species

trainSet_Saureus

Staphylococcus

aureus

subsp._aureus_Mu50/strain_Sa_Newman_UoMf/subsp._aureus_MW2

60

Multi_species

trainSet_Spyogenes

Streptococcus

pyogenes

M1_GAS/MGAS8232/MGAS315

61

Multi_species

trainSet_Xfastidiosa

Xylella

fastidiosa

9a5c/Temecula1

62

Multi_species

trainSet_Ypestis

Yersinia

pestis

CO92/KIM

PhiSpy results

Results generated by PhiSpy are text files containing the annotation and information regarding the identified inserted bacteriophages.

There are some limitations and we implemented some improvements for a better clarification and interpretation.

Original results

See original details in: https://github.com/linsalrob/PhiSpy#output-files

There are several files generated:

  • prophage.tbl: This file has two columns separated by tabs [id, location].

    The id is in the format: pp_number, where number is a sequential number of the prophage (starting at 1).

    Location is in the format: contig_start_stop that encompasses the prophage.

  • prophage_tbl.tsv: This is a tab seperated file. The file contains all the genes of the genome. The tenth colum represents the status of a gene. If this column is 1 then the gene is a phage like gene; otherwise it is a bacterial gene. This file has 16 columns:

      1. fig_no: the id of each gene;

      1. function: function of the gene;

      1. contig;

      1. start: start location of the gene;

      1. stop: end location of the gene;

      1. position: a sequential number of the gene (starting at 1);

      1. rank: rank of each gene provided by random forest;

      1. my_status: status of each gene based on random forest;

      1. pp: classification of each gene based on their function;

      1. Final_status: the status of each gene. For prophages, this column has the number of the prophage as listed in prophage.tbl above; If the column contains a 0 we believe that it is a bacterial gene.

      1. start of attL;

      1. end of attL;

      1. start of attR;

      1. end of attR;

      1. sequence of attL;

      1. sequence of attR.

  • prophage_coordinates.tsv: This file has the prophage ID, contig, start, stop, and potential att sites identified for the phages.

  • prophage.gff3: Gene Feature Format file (v3) containing the annotation of the phages identified. This is a contribution that BacterialTyper developer (Jose F. Sanchez-Herrero) pulled to original PhiSpy code:

  • testSet.txt: Results of the Shannon score generated during the makeTest module of PhiSpy and necessary for the following randomforest classifier.

  • classify.tsv: Results of the randomforest classifier call within the classification module of PhiSpy.

Modified results

All original files generated are named independently of the sample name as prophage or classify. Also, some samples are not necessary for a regular user to interpret results and obtain the number of prophage regions and details.

Within BacterialTyper, we rename original PhiSpy result files according to sample names provided and as some tab files do not contain headers, we generate either tab files with headers and a summary excel files for a better interpretation and integration of results.

File conversion:

  • prophage_tbl.tsv:

    Rename it to ‘SampleName’_PhiSpy-classification_genes.tsv

    Include it in a summary excel file.

  • prophage.gff3:

    Rename it to ‘SampleName’_PhiSpy-prophage.gff3

  • prophage_coordinates.tsv:

    Rename it to ‘SampleName’_PhiSpy-prophage-coordinates.tsv’

    Add header containing the following columns:

    • prophage_ID

    • Contig

    • Start

    • End

    • attL_Start

    • attL_End

    • attR_Start

    • attR_End

    • attL_Seq

    • attR_Seq

    • Longest_Repeat_flanking_phage

    Include it in a summary excel file.

  • prophage.tbl:

    Move it to a temporary folder generated. Redundant information

  • classify.tsv:

    Move it to a temporary folder generated

  • testSet.txt:

    Move it to a temporary folder generated

  • Additional excel file: ‘SampleName’_bacteriophage_summary.xlsx