U.S. flag

An official website of the United States government

SRA Aligned Read Format

Overview

SRA aligned reads for SARS-CoV-2 have been created to facilitate more rapid identification of NGS data of interest to the COVID-19 research community. This data type represents a compressed data format for more rapid data retrieval and facilitates data exploration via the pre-assembled contigs.

Approach

Briefly, Saute was used to assemble contigs via guided assembly, with the SARS-CoV-2 refseq genomic sequence used as the guide. The initial scope of the project is limited to those runs deposited in SRA with at least 100 hits for SARS-CoV-2 via the SARS-CoV-2 Detection Tool, with a read length of at least 75, and generated using the Illumina platform. If contigs were successfully assembled, reads were mapped back to the contigs and coverage calculated. Additionally, the taxonomy of the contigs was assessed via two methods. First, the contigs were analyzed using the SARS-CoV-2 Detection Tool. Second, the contigs were checked via megablast against the nucleotide blast database. The SRA aligned reads, the results of these analyses, along with the associated BioSample and experiment metadata are available from AWS as part of their Open Data Project. Metadata is formatted to facilitate exploration via AWS Athena service.

Installation and Setup

For instructions on setting up Athena, please see Get started in Athena

Example Usage

To access the SRA aligned reads for a Run of interest, you can build a path to the object in S3 storage by inserting the Run accession into this address:  

s3://sra-pub-sars-cov2/RAO/<RUN_accession>/<RUN_accession>.realign

 
Then it can be downloaded using the AWS CLI:

aws s3 cp s3://sra-pub-sars-cov2/RAO/SRR11517741/SRR11517741.realign ./

 
You can dump the contigs from SRA aligned reads using the SRA Toolkit:
 

dump-ref-fasta ./DRR220595.realign

Example output

You can view the SRA aligned reads in sam format using the SRA Toolkit:
 

sam-dump ./DRR220595.realign

@HD VN:1.6 SO:coordinate
@SQ SN:NC_045512.2:1:1:23364 LN:997
@SQ SN:NC_045512.2:1:2:23359 LN:997
@SQ SN:NC_045512.2:1:3:23004 LN:997
@RG ID:default
@PG ID:minimap2 PN:minimap2 VN:2.17-r941 CL:minimap2 -a -Q -t 10 -x sr DRR220595.vars -
 
1 163 NC_045512.2:1:1:23364 1 0 12S138M = 127 276 GCCCTTTTAGTGGGAACATCCAATAATTTAGTTGATCTTAGAACGTCTTGTTTTAGTGTTTGTGCTTTAGCGTCTGGTATTACGCATCAAACTGTAAAACCAGGTCACTTTAACAAGGATTTCTACGACTTTGCAGAGAAGGCTGGTATG ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? NH:i:1 NM:i:0
 
2 83 NC_045512.2:1:1:23364 1 0 15S135M = 1 135 CCTGCCCTTTTAGTGGGAACATCCAATAATTTAGTTGATCTTAGAACGTCTTGTTTTAGTGTTTGTGCTTTAGCATCTGGTATTACGCATCAAACTGTAAAACCAGGTCACTTTAACAGGGATTTCTACGACTTTGCAGAGAAGGCTGGT ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? NH:i:1 NM:i:2
 
3 163 NC_045512.2:1:1:23364 1 0 36S114M = 33 182 CTCATGCAGTTTGTTGGAGATCCTGCCCTTTTAGTGGGAACATCCAATAATTTAGTTGATCTTAGAACGTCTTGTTTTAGTGTTTGTGCTTTAGCGTCTGGTATTACGCATCAAACTGTAAAACCAGGTCACTTTAACAAGGATTTCTAC ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? NH:i:1 NM:i:0
 


BLAST DB

The contigs and peptides are also available in BLAST DB format:

Contigs are stored in a nucleotide Blast DB located in s3://sra-pub-sars-cov2/BLAST_DB/CONTIGS/.
Peptides derived from VIGOR3 annotation of the contigs are stored in a protein Blast DB located in s3://sra-pub-sars-cov2/BLAST_DB/PEPTIDES/.

Contact SRA

Contact SRA staff for assistance at [email protected]

Support Center

Last updated: 2021-02-19T16:19:15Z