SRA Taxonomy Analysis Tool
Overview
The NCBI SRA Taxonomy Analysis Tool (STAT) calculates the taxonomic distribution of reads from next generation sequencing runs. This analysis maps individual sequencing reads to a taxonomic hierarchy and reports the taxonomic composition of reads within a sequencing run.
Method
STAT maps sequencing reads to a taxonomic hierarchy using a two-step strategy based on exact query read matches to precomputed k-mer dictionary databases. In the first pass, a small, "coarse" reference dictionary database is used to identify organisms matching a read set. In the second pass, organism-specific slices from a "fine" reference dictionary database are used to compute distribution of reads between identified taxonomy classes (species and higher order taxonomy nodes). When multiple tax nodes are mapped for single spot, we use the lowest non-ambiguous mapping.
STAT k-mer dictionaries are built using an iterative minhash based approach against reference genomic databases. For every fixed segment length of incoming reference nucleotide sequence, k-mer representing this segment are selected based on minimum fvn1 hash function. Several strategies were used to enhance the specificity and accuracy of STAT results. Low complexity k-mers composed of >50% homo-polymer or dinucleotide repeats (e.g. AAAAAA or ACACACACACA) were filtered from dictionaries, and discrete k-mers belonging to multiple taxonomic references were "merged" at the lowest common taxonomic node shared between references. Finally, the specificity of representative k-mers was determined by searching against the source reference genomic database. When representative k-mers were found in multiple taxonomic references nodes, they were merged at the lowest common taxonomic node as above.
Genome references
The NCBI RefSeq genomic database was supplemented with the viral genome set from nt and used as the source for k-mer creation in both "coarse" and "fine" sets.
Taxonomy hierarchy
Reference sequences were mapped to the taxonomy hierarchy using the NCBI Taxonomy database. The database contained 2,383,364 taxonomy nodes in March 2020.
Segment sizes and K-mer selection
K-mer dictionaries were built by computationally slicing reference genomes into sequential segments and selecting 32-mers to represent each segment. The "coarse" k-mer dictionary uses variable segment lengths proportional to genomes size and ranging from 200-8000 nt. The "fine" k-mer dictionary uses a constant 64 nt segment length for all genomes; for 32-mer index it gives us 32x reduction in space with the assumption that we have at least one error-free 64-mer for every spot.
Frequently Asked Questions
Are all SRA Runs analyzed using the course and fine databases?
Yes, each public run is analyzed with both databases.
Can I get the software?
Yes. At github
git clone https://github.com/ncbi/sra-tools.git
git clone https://github.com/ncbi/ngs-tools.git --branch tax
cd ncbi-vdb
./configure
make
cd ../sra-tools
./configure
make -C ngs
cd ../ngs-tools
./configure
make -C tools/tax
And in ./examples
folder you can find helper *.sh
scripts.
How can I cite you?
The publication associated with this tool is available here: https://pubmed.ncbi.nlm.nih.gov/34544477/
Contact SRA
Contact SRA staff for assistance at [email protected]