Interpretation for developers ¶

Here we include some guidelines and interpretation of intermediate results for developing the BacterialTyper project.

ARIBA results description ¶

There is a lot of information generated by ARIBA that is common for all databases. Here we listed as an example.

assembled_seqs.fa.gz: reference sequences identified

assemblies.fa.gz: query sequences retrieved from sample

assembled_genes.fa.gz: encoding genes from assemblies.fa

debug.report.tsv: initial report.tsv before filtering

log.clusters.gz: log details for each cluster

report.tsv: report for each sample

version_info.txt: version and additional information

File devel/info/ARIBA_explained.csv contains the description of the columns in the ARIBA result file generated.

Column	Description
ariba_ref_name	ariba name of reference sequence chosen from cluster (needs to rename to stop some tools breaking)
ref_name	original name of reference sequence chosen from cluster before renaming
gene	1=gene; 0=non-coding (same as metadata column 2)
var_only	1=variant only; 0=presence/absence (same as metadata column 3)
flag	cluster flag
reads	number of reads in this cluster
cluster	name of cluster
ref_len	length of reference sequence
ref_base_assembled	number of reference nucleotides assembled by this contig
pc_ident	%identity between reference sequence and contig
ctg	name of contig matching reference
ctg_len	length of contig
ctg_cov	mean mapped read depth of this contig
known_var	is this a known SNP from reference metadata? 1 or 0
var_type	The type of variant. Currently only SNP supported
var_seq_type	Variant sequence type. if known_var=1: n or p for nucleotide or protein
known_var_change	if known_var=1: the wild/variant change eg I42L
has_known_var	if known_var=1: 1 or 0 for whether or not the assembly has the variant
ref_ctg_change	amino acid or nucleotide change between reference and contig eg I42L
ref_ctg_effect	effect of change between reference and contig eg SYS; NONSYN (amino acid changes only)
ref_start	start position of variant in reference
ref_end	end position of variant in reference
ref_nt	nucleotide(s) in reference at variant position
ctg_start	start position of variant in contig
ctg_end	end position of variant in contig
ctg_nt	nucleotide(s) in contig at variant position
smtls_total_depth	total read depth at variant start position in contig reported by mpileup
smtls_nts	nucleotides on contig as reported by mpileup. The first is the contig nucleotide
smtls_nts_depth	depths on contig as reported by mpileup. One number per nucleotide in the previous column
var_description	description of variant from reference metdata
free_text	other free text about reference sequence from reference metadata

Prokka output files description ¶

File devel/info/prokka_output_files.csv contains the description of the different output files generated by Prokka.

See additional details in: https://github.com/tseemann/prokka#output-files>

Extension	Description
.gff	This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
.gbk	This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.
.fna	Nucleotide FASTA file of the input contig sequences.
.faa	Protein FASTA file of the translated CDS sequences.
.ffn	Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
.sqn	An ASN1 format ‘Sequin’ file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
.fsa	Nucleotide FASTA file of the input contig sequences, used by tbl2asn to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
.tbl	Feature Table file, used by ‘tbl2asn’ to create the .sqn file.
.err	Unacceptable annotations: the NCBI discrepancy report.
.log	Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the –quiet option was enabled.
.txt	Statistics relating to the annotated features found.
.tsv	Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product

Snippy output files description ¶

File devel/info/snippy_output_files.csv contains the description of the different output files generated by Snippy.

See additional details in: https://github.com/tseemann/snippy#output-files>

Extension	Description
.tab	A simple tab-separated summary of all the variants
.csv	A comma-separated version of the .tab file
.html	A HTML version of the .tab file
.vcf	The final annotated variants in VCF format
.bed	The variants in BED format
.gff	The variants in GFF3 format
.bam	The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates.
.bam.bai	Index for the .bam file
.log	A log file with the commands run and their outputs
.aligned.fa	A version of the reference but with - at position with depth=0 and N for 0 < depth < –mincov (does not have variants)
.consensus.fa	A version of the reference genome with all variants instantiated
.consensus.subs.fa	A version of the reference genome with only substitution variants instantiated
.raw.vcf	The unfiltered variant calls from Freebayes
.filt.vcf	The filtered variant calls from Freebayes
.vcf.gz	Compressed .vcf file via BGZIP
.vcf.gz.csi	Index for the .vcf.gz via bcftools index)

PhiSpy ¶

PhiSpy identifies prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions.

PhiSpy training Sets available ¶

File devel/info/PhiSpy_training-sets.txt contains the description of the different training sets available for bacteriophage analysis using PhiSpy.

trainingSet	Tag	Set_Name	genus	species	other
0	Generic	testSet_genericAll	All available	All available	All available
1	Single_species	trainSet_32002.17	Achromobacter	denitrificans	strain_PR1
2	Single_species	trainSet_272558.23	Bacillus	halodurans	C-125
3	Single_species	trainSet_224308.360	Bacillus	subtilis	subsp._subtilis_str._168
4	Single_species	trainSet_411479.31	Bacteroides	uniformis	ATCC_8492_strain_81A2
5	Single_species	trainSet_206672.37	Bifidobacterium	longum	NCC2705
6	Single_species	trainSet_224914.79	Brucella	melitensis	16M
7	Single_species	trainSet_190650.21	Caulobacter	crescentus	CB15
8	Single_species	trainSet_195102.53	Clostridium	perfringens	str._13
9	Single_species	trainSet_212717.31	Clostridium	tetani	E88
10	Single_species	trainSet_243230.96	Deinococcus	radiodurans	R1
11	Single_species	trainSet_226185.9	Enterococcus	faecalis	V583
12	Single_species	trainSet_1351.557	Enterococcus	faecalis	strain_V583
13	Single_species	trainSet_199310.168	Escherichia	coli	CFT073
14	Single_species	trainSet_83333.998	Escherichia	coli	K12
15	Single_species	trainSet_83334.295	Escherichia	coli	O157-H7
16	Single_species	trainSet_155864.289	Escherichia	coli	O157-H7_EDL933
17	Single_species	trainSet_71421.45	Haemophilus	influenzae	Rd_KW20
18	Single_species	trainSet_272623.42	Lactococcus	lactis	subsp._lactis_Il1403
19	Single_species	trainSet_272626.22	Listeria	innocua	Clip11262
20	Single_species	trainSet_169963.176	Listeria	monocytogenes	EGD-e
21	Single_species	trainSet_266835.41	Mesorhizobium	loti	MAFF303099
22	Single_species	trainSet_83331.121	Mycobacterium	tuberculosis	CDC1551
23	Single_species	trainSet_83332.460	Mycobacterium	tuberculosis	H37Rv
24	Single_species	trainSet_122586.26	Neisseria	meningitidis	MC58
25	Single_species	trainSet_122587.18	Neisseria	meningitidis	Z2491
26	Single_species	trainSet_272843.53	Pasteurella	multocida	subsp._multocida_str._Pm70
27	Single_species	trainSet_208964.452	Pseudomonas	aeruginosa	PAO1
28	Single_species	trainSet_160488.79	Pseudomonas	putida	KT2440
29	Single_species	trainSet_267608.42	Ralstonia	solanacearum	GMI1000
30	Single_species	trainSet_220341.87	Salmonella	enterica	subsp._enterica_serovar_Typhi_str._CT18
31	Single_species	trainSet_211586.69	Shewanella	oneidensis	MR-1
32	Single_species	trainSet_198214	Shigella	flexneri	2a_str._301
33	Single_species	trainSet_1280.10152	Staphylococcus	aureus	strain_Sa_Newman_UoM
34	Single_species	trainSet_196620.15	Staphylococcus	aureus	subsp._aureus_MW2
35	Single_species	trainSet_158878.38	Staphylococcus	aureus	subsp._aureus_Mu50
36	Single_species	trainSet_160490.61	Streptococcus	pyogenes	M1_GAS
37	Single_species	trainSet_198466.10	Streptococcus	pyogenes	MGAS315
38	Single_species	trainSet_186103.26	Streptococcus	pyogenes	MGAS8232
39	Single_species	trainSet_243277.252	Vibrio	cholerae	O1_biovar_eltor_str._N16961
40	Single_species	trainSet_190486.46	Xanthomonas	axonopodis	pv._citri_str._306
41	Single_species	trainSet_160492.65	Xylella	fastidiosa	9a5c
42	Single_species	trainSet_183190.38	Xylella	fastidiosa	Temecula1
43	Single_species	trainSet_214092.200	Yersinia	pestis	CO92
44	Single_species	trainSet_187410.24	Yersinia	pestis	KIM
53	Single_species	trainSet_1367847.3	Paracoccus	aminophilus	JCM_7686
54	Single_species	trainSet_318586.5	Paracoccus	denitrificans	PD1222
55	Single_species	trainSet_1525717.3	Paracoccus	sanguinis	5503
56	Single_species	trainSet_1660154.3	Paracoccus	sp.	SCN_68-21
57	Single_species	trainSet_147645.106	Paracoccus	yeei	TT13
45	Multi_species	trainSet_Bacillus	Bacillus	halodurans/subtilis	subsp._subtilis_str._168/C-125
46	Multi_species	trainSet_Clostridium	Clostridium	tetani/perfringens	str._13/E88
47	Multi_species	trainSet_Ecoli	Escherichia	coli	K12/O157-H7/O157-H7_EDL933/CFT073
48	Multi_species	trainSet_Efec	Enterococcus	faecalis	V583/strain_V583
49	Multi_species	trainSet_Listeria	Listeria	monocytogenes/innocua	EGD-e/Clip11262
50	Multi_species	trainSet_Mtb	Mycobacterium	tuberculosis	CDC1551/H37Rv
51	Multi_species	trainSet_Nmeningitidis	Neisseria	meningitidis	MC58/Z2491
52	Multi_species	trainSet_Paracoccus	Paracoccus	sp./yeei/sanguinis/denitrificans/aminophilus	PD1222ff/SCN_68-21ff/5503ff/JCM_7686ff/TT13ff
58	Multi_species	trainSet_Pseudomonas	Pseudomonas	putida/aeruginosa	PAO1/KT2440
59	Multi_species	trainSet_Saureus	Staphylococcus	aureus	subsp._aureus_Mu50/strain_Sa_Newman_UoMf/subsp._aureus_MW2
60	Multi_species	trainSet_Spyogenes	Streptococcus	pyogenes	M1_GAS/MGAS8232/MGAS315
61	Multi_species	trainSet_Xfastidiosa	Xylella	fastidiosa	9a5c/Temecula1
62	Multi_species	trainSet_Ypestis	Yersinia	pestis	CO92/KIM

PhiSpy results ¶

Results generated by PhiSpy are text files containing the annotation and information regarding the identified inserted bacteriophages.

There are some limitations and we implemented some improvements for a better clarification and interpretation.

Original results ¶

See original details in: https://github.com/linsalrob/PhiSpy#output-files

There are several files generated:

prophage.tbl: This file has two columns separated by tabs [id, location].
The id is in the format: pp_number, where number is a sequential number of the prophage (starting at 1).

Location is in the format: contig_start_stop that encompasses the prophage.
prophage_tbl.tsv: This is a tab seperated file. The file contains all the genes of the genome. The tenth colum represents the status of a gene. If this column is 1 then the gene is a phage like gene; otherwise it is a bacterial gene. This file has 16 columns:
- 1. fig_no: the id of each gene;
- 1. function: function of the gene;
- 1. contig;
- 1. start: start location of the gene;
- 1. stop: end location of the gene;
- 1. position: a sequential number of the gene (starting at 1);
- 1. rank: rank of each gene provided by random forest;
- 1. my_status: status of each gene based on random forest;
- 1. pp: classification of each gene based on their function;
- 1. Final_status: the status of each gene. For prophages, this column has the number of the prophage as listed in prophage.tbl above; If the column contains a 0 we believe that it is a bacterial gene.
- 1. start of attL;
- 1. end of attL;
- 1. start of attR;
- 1. end of attR;
- 1. sequence of attL;
- 1. sequence of attR.
prophage_coordinates.tsv: This file has the prophage ID, contig, start, stop, and potential att sites identified for the phages.
prophage.gff3: Gene Feature Format file (v3) containing the annotation of the phages identified. This is a contribution that BacterialTyper developer (Jose F. Sanchez-Herrero) pulled to original PhiSpy code:
- https://github.com/linsalrob/PhiSpy/PhiSpyModules/writers.py
- https://github.com/linsalrob/PhiSpy/pull/10
testSet.txt: Results of the Shannon score generated during the makeTest module of PhiSpy and necessary for the following randomforest classifier.
classify.tsv: Results of the randomforest classifier call within the classification module of PhiSpy.

All original files generated are named independently of the sample name as prophage or classify. Also, some samples are not necessary for a regular user to interpret results and obtain the number of prophage regions and details.

Within BacterialTyper, we rename original PhiSpy result files according to sample names provided and as some tab files do not contain headers, we generate either tab files with headers and a summary excel files for a better interpretation and integration of results.

File conversion:

prophage_tbl.tsv:

Rename it to ‘SampleName’_PhiSpy-classification_genes.tsv

Include it in a summary excel file.
prophage.gff3:

Rename it to ‘SampleName’_PhiSpy-prophage.gff3
prophage_coordinates.tsv:
Rename it to ‘SampleName’_PhiSpy-prophage-coordinates.tsv’

Add header containing the following columns:
- prophage_ID
- Contig
- Start
- End
- attL_Start
- attL_End
- attR_Start
- attR_End
- attL_Seq
- attR_Seq
- Longest_Repeat_flanking_phage
Include it in a summary excel file.
prophage.tbl:

Move it to a temporary folder generated. Redundant information
classify.tsv:

Move it to a temporary folder generated
testSet.txt:

Move it to a temporary folder generated
Additional excel file: ‘SampleName’_bacteriophage_summary.xlsx

Interpretation for developers ¶

ARIBA results description ¶

Prokka output files description ¶

Snippy output files description ¶

PhiSpy ¶

PhiSpy training Sets available ¶

PhiSpy results ¶

Original results ¶

Modified results ¶

Table of Contents

Previous topic

Next topic

This Page