Interpretation for developers¶
Here we include some guidelines and interpretation of intermediate results for developing the BacterialTyper
project.
ARIBA results description¶
There is a lot of information generated by ARIBA that is common for all databases. Here we listed as an example.
assembled_seqs.fa.gz: reference sequences identified
assemblies.fa.gz: query sequences retrieved from sample
assembled_genes.fa.gz: encoding genes from assemblies.fa
debug.report.tsv: initial report.tsv before filtering
log.clusters.gz: log details for each cluster
report.tsv: report for each sample
version_info.txt: version and additional information
File devel/info/ARIBA_explained.csv
contains the description of the columns in the ARIBA result file generated.
Column |
Description |
---|---|
ariba_ref_name |
ariba name of reference sequence chosen from cluster (needs to rename to stop some tools breaking) |
ref_name |
original name of reference sequence chosen from cluster before renaming |
gene |
1=gene; 0=non-coding (same as metadata column 2) |
var_only |
1=variant only; 0=presence/absence (same as metadata column 3) |
flag |
cluster flag |
reads |
number of reads in this cluster |
cluster |
name of cluster |
ref_len |
length of reference sequence |
ref_base_assembled |
number of reference nucleotides assembled by this contig |
pc_ident |
%identity between reference sequence and contig |
ctg |
name of contig matching reference |
ctg_len |
length of contig |
ctg_cov |
mean mapped read depth of this contig |
known_var |
is this a known SNP from reference metadata? 1 or 0 |
var_type |
The type of variant. Currently only SNP supported |
var_seq_type |
Variant sequence type. if known_var=1: n or p for nucleotide or protein |
known_var_change |
if known_var=1: the wild/variant change eg I42L |
has_known_var |
if known_var=1: 1 or 0 for whether or not the assembly has the variant |
ref_ctg_change |
amino acid or nucleotide change between reference and contig eg I42L |
ref_ctg_effect |
effect of change between reference and contig eg SYS; NONSYN (amino acid changes only) |
ref_start |
start position of variant in reference |
ref_end |
end position of variant in reference |
ref_nt |
nucleotide(s) in reference at variant position |
ctg_start |
start position of variant in contig |
ctg_end |
end position of variant in contig |
ctg_nt |
nucleotide(s) in contig at variant position |
smtls_total_depth |
total read depth at variant start position in contig reported by mpileup |
smtls_nts |
nucleotides on contig as reported by mpileup. The first is the contig nucleotide |
smtls_nts_depth |
depths on contig as reported by mpileup. One number per nucleotide in the previous column |
var_description |
description of variant from reference metdata |
free_text |
other free text about reference sequence from reference metadata |
Prokka output files description¶
File devel/info/prokka_output_files.csv
contains the description of the different output files generated by Prokka.
See additional details in: https://github.com/tseemann/prokka#output-files>
Extension |
Description |
---|---|
.gff |
This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. |
.gbk |
This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. |
.fna |
Nucleotide FASTA file of the input contig sequences. |
.faa |
Protein FASTA file of the translated CDS sequences. |
.ffn |
Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) |
.sqn |
An ASN1 format ‘Sequin’ file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. |
.fsa |
Nucleotide FASTA file of the input contig sequences, used by tbl2asn to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. |
.tbl |
Feature Table file, used by ‘tbl2asn’ to create the .sqn file. |
.err |
Unacceptable annotations: the NCBI discrepancy report. |
.log |
Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the –quiet option was enabled. |
.txt |
Statistics relating to the annotated features found. |
.tsv |
Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product |
Snippy output files description¶
File devel/info/snippy_output_files.csv
contains the description of the different output files generated by Snippy.
See additional details in: https://github.com/tseemann/snippy#output-files>
Extension |
Description |
---|---|
.tab |
A simple tab-separated summary of all the variants |
.csv |
A comma-separated version of the .tab file |
.html |
A HTML version of the .tab file |
.vcf |
The final annotated variants in VCF format |
.bed |
The variants in BED format |
.gff |
The variants in GFF3 format |
.bam |
The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates. |
.bam.bai |
Index for the .bam file |
.log |
A log file with the commands run and their outputs |
.aligned.fa |
A version of the reference but with - at position with depth=0 and N for 0 < depth < –mincov (does not have variants) |
.consensus.fa |
A version of the reference genome with all variants instantiated |
.consensus.subs.fa |
A version of the reference genome with only substitution variants instantiated |
.raw.vcf |
The unfiltered variant calls from Freebayes |
.filt.vcf |
The filtered variant calls from Freebayes |
.vcf.gz |
Compressed .vcf file via BGZIP |
.vcf.gz.csi |
Index for the .vcf.gz via bcftools index) |
PhiSpy¶
PhiSpy identifies prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions.
PhiSpy training Sets available¶
File devel/info/PhiSpy_training-sets.txt
contains the description of the different training sets available for bacteriophage analysis using PhiSpy.
trainingSet |
Tag |
Set_Name |
genus |
species |
other |
---|---|---|---|---|---|
0 |
Generic |
testSet_genericAll |
All available |
All available |
All available |
1 |
Single_species |
trainSet_32002.17 |
Achromobacter |
denitrificans |
strain_PR1 |
2 |
Single_species |
trainSet_272558.23 |
Bacillus |
halodurans |
C-125 |
3 |
Single_species |
trainSet_224308.360 |
Bacillus |
subtilis |
subsp._subtilis_str._168 |
4 |
Single_species |
trainSet_411479.31 |
Bacteroides |
uniformis |
ATCC_8492_strain_81A2 |
5 |
Single_species |
trainSet_206672.37 |
Bifidobacterium |
longum |
NCC2705 |
6 |
Single_species |
trainSet_224914.79 |
Brucella |
melitensis |
16M |
7 |
Single_species |
trainSet_190650.21 |
Caulobacter |
crescentus |
CB15 |
8 |
Single_species |
trainSet_195102.53 |
Clostridium |
perfringens |
str._13 |
9 |
Single_species |
trainSet_212717.31 |
Clostridium |
tetani |
E88 |
10 |
Single_species |
trainSet_243230.96 |
Deinococcus |
radiodurans |
R1 |
11 |
Single_species |
trainSet_226185.9 |
Enterococcus |
faecalis |
V583 |
12 |
Single_species |
trainSet_1351.557 |
Enterococcus |
faecalis |
strain_V583 |
13 |
Single_species |
trainSet_199310.168 |
Escherichia |
coli |
CFT073 |
14 |
Single_species |
trainSet_83333.998 |
Escherichia |
coli |
K12 |
15 |
Single_species |
trainSet_83334.295 |
Escherichia |
coli |
O157-H7 |
16 |
Single_species |
trainSet_155864.289 |
Escherichia |
coli |
O157-H7_EDL933 |
17 |
Single_species |
trainSet_71421.45 |
Haemophilus |
influenzae |
Rd_KW20 |
18 |
Single_species |
trainSet_272623.42 |
Lactococcus |
lactis |
subsp._lactis_Il1403 |
19 |
Single_species |
trainSet_272626.22 |
Listeria |
innocua |
Clip11262 |
20 |
Single_species |
trainSet_169963.176 |
Listeria |
monocytogenes |
EGD-e |
21 |
Single_species |
trainSet_266835.41 |
Mesorhizobium |
loti |
MAFF303099 |
22 |
Single_species |
trainSet_83331.121 |
Mycobacterium |
tuberculosis |
CDC1551 |
23 |
Single_species |
trainSet_83332.460 |
Mycobacterium |
tuberculosis |
H37Rv |
24 |
Single_species |
trainSet_122586.26 |
Neisseria |
meningitidis |
MC58 |
25 |
Single_species |
trainSet_122587.18 |
Neisseria |
meningitidis |
Z2491 |
26 |
Single_species |
trainSet_272843.53 |
Pasteurella |
multocida |
subsp._multocida_str._Pm70 |
27 |
Single_species |
trainSet_208964.452 |
Pseudomonas |
aeruginosa |
PAO1 |
28 |
Single_species |
trainSet_160488.79 |
Pseudomonas |
putida |
KT2440 |
29 |
Single_species |
trainSet_267608.42 |
Ralstonia |
solanacearum |
GMI1000 |
30 |
Single_species |
trainSet_220341.87 |
Salmonella |
enterica |
subsp._enterica_serovar_Typhi_str._CT18 |
31 |
Single_species |
trainSet_211586.69 |
Shewanella |
oneidensis |
MR-1 |
32 |
Single_species |
trainSet_198214 |
Shigella |
flexneri |
2a_str._301 |
33 |
Single_species |
trainSet_1280.10152 |
Staphylococcus |
aureus |
strain_Sa_Newman_UoM |
34 |
Single_species |
trainSet_196620.15 |
Staphylococcus |
aureus |
subsp._aureus_MW2 |
35 |
Single_species |
trainSet_158878.38 |
Staphylococcus |
aureus |
subsp._aureus_Mu50 |
36 |
Single_species |
trainSet_160490.61 |
Streptococcus |
pyogenes |
M1_GAS |
37 |
Single_species |
trainSet_198466.10 |
Streptococcus |
pyogenes |
MGAS315 |
38 |
Single_species |
trainSet_186103.26 |
Streptococcus |
pyogenes |
MGAS8232 |
39 |
Single_species |
trainSet_243277.252 |
Vibrio |
cholerae |
O1_biovar_eltor_str._N16961 |
40 |
Single_species |
trainSet_190486.46 |
Xanthomonas |
axonopodis |
pv._citri_str._306 |
41 |
Single_species |
trainSet_160492.65 |
Xylella |
fastidiosa |
9a5c |
42 |
Single_species |
trainSet_183190.38 |
Xylella |
fastidiosa |
Temecula1 |
43 |
Single_species |
trainSet_214092.200 |
Yersinia |
pestis |
CO92 |
44 |
Single_species |
trainSet_187410.24 |
Yersinia |
pestis |
KIM |
53 |
Single_species |
trainSet_1367847.3 |
Paracoccus |
aminophilus |
JCM_7686 |
54 |
Single_species |
trainSet_318586.5 |
Paracoccus |
denitrificans |
PD1222 |
55 |
Single_species |
trainSet_1525717.3 |
Paracoccus |
sanguinis |
5503 |
56 |
Single_species |
trainSet_1660154.3 |
Paracoccus |
sp. |
SCN_68-21 |
57 |
Single_species |
trainSet_147645.106 |
Paracoccus |
yeei |
TT13 |
45 |
Multi_species |
trainSet_Bacillus |
Bacillus |
halodurans/subtilis |
subsp._subtilis_str._168/C-125 |
46 |
Multi_species |
trainSet_Clostridium |
Clostridium |
tetani/perfringens |
str._13/E88 |
47 |
Multi_species |
trainSet_Ecoli |
Escherichia |
coli |
K12/O157-H7/O157-H7_EDL933/CFT073 |
48 |
Multi_species |
trainSet_Efec |
Enterococcus |
faecalis |
V583/strain_V583 |
49 |
Multi_species |
trainSet_Listeria |
Listeria |
monocytogenes/innocua |
EGD-e/Clip11262 |
50 |
Multi_species |
trainSet_Mtb |
Mycobacterium |
tuberculosis |
CDC1551/H37Rv |
51 |
Multi_species |
trainSet_Nmeningitidis |
Neisseria |
meningitidis |
MC58/Z2491 |
52 |
Multi_species |
trainSet_Paracoccus |
Paracoccus |
sp./yeei/sanguinis/denitrificans/aminophilus |
PD1222ff/SCN_68-21ff/5503ff/JCM_7686ff/TT13ff |
58 |
Multi_species |
trainSet_Pseudomonas |
Pseudomonas |
putida/aeruginosa |
PAO1/KT2440 |
59 |
Multi_species |
trainSet_Saureus |
Staphylococcus |
aureus |
subsp._aureus_Mu50/strain_Sa_Newman_UoMf/subsp._aureus_MW2 |
60 |
Multi_species |
trainSet_Spyogenes |
Streptococcus |
pyogenes |
M1_GAS/MGAS8232/MGAS315 |
61 |
Multi_species |
trainSet_Xfastidiosa |
Xylella |
fastidiosa |
9a5c/Temecula1 |
62 |
Multi_species |
trainSet_Ypestis |
Yersinia |
pestis |
CO92/KIM |
PhiSpy results¶
Results generated by PhiSpy are text files containing the annotation and information regarding the identified inserted bacteriophages.
There are some limitations and we implemented some improvements for a better clarification and interpretation.
Original results¶
See original details in: https://github.com/linsalrob/PhiSpy#output-files
There are several files generated:
- prophage.tbl: This file has two columns separated by tabs [id, location].
The id is in the format: pp_number, where number is a sequential number of the prophage (starting at 1).
Location is in the format: contig_start_stop that encompasses the prophage.
prophage_tbl.tsv: This is a tab seperated file. The file contains all the genes of the genome. The tenth colum represents the status of a gene. If this column is 1 then the gene is a phage like gene; otherwise it is a bacterial gene. This file has 16 columns:
fig_no: the id of each gene;
function: function of the gene;
contig;
start: start location of the gene;
stop: end location of the gene;
position: a sequential number of the gene (starting at 1);
rank: rank of each gene provided by random forest;
my_status: status of each gene based on random forest;
pp: classification of each gene based on their function;
Final_status: the status of each gene. For prophages, this column has the number of the prophage as listed in prophage.tbl above; If the column contains a 0 we believe that it is a bacterial gene.
start of attL;
end of attL;
start of attR;
end of attR;
sequence of attL;
sequence of attR.
prophage_coordinates.tsv: This file has the prophage ID, contig, start, stop, and potential att sites identified for the phages.
prophage.gff3: Gene Feature Format file (v3) containing the annotation of the phages identified. This is a contribution that
BacterialTyper
developer (Jose F. Sanchez-Herrero) pulled to original PhiSpy code:testSet.txt: Results of the Shannon score generated during the makeTest module of PhiSpy and necessary for the following randomforest classifier.
classify.tsv: Results of the randomforest classifier call within the classification module of PhiSpy.
Modified results¶
All original files generated are named independently of the sample name as prophage or classify. Also, some samples are not necessary for a regular user to interpret results and obtain the number of prophage regions and details.
Within BacterialTyper
, we rename original PhiSpy result files according to sample names provided and as some tab files do not contain headers,
we generate either tab files with headers and a summary excel files for a better interpretation and integration of results.
File conversion:
prophage_tbl.tsv:
Rename it to ‘SampleName’_PhiSpy-classification_genes.tsv
Include it in a summary excel file.
prophage.gff3:
Rename it to ‘SampleName’_PhiSpy-prophage.gff3
prophage_coordinates.tsv:
Rename it to ‘SampleName’_PhiSpy-prophage-coordinates.tsv’
Add header containing the following columns:
prophage_ID
Contig
Start
End
attL_Start
attL_End
attR_Start
attR_End
attL_Seq
attR_Seq
Longest_Repeat_flanking_phage
Include it in a summary excel file.
prophage.tbl:
Move it to a temporary folder generated. Redundant information
classify.tsv:
Move it to a temporary folder generated
testSet.txt:
Move it to a temporary folder generated
Additional excel file: ‘SampleName’_bacteriophage_summary.xlsx