Download SRA sequence data using Amazon Web Services (AWS)
SRA Data in the AWS Registry of Open Data
Amazon Web Services publicly hosts SRA data through the Registry of Open Data. SRA has several datasets in the AWS Registry of Open Data, all of which can be accessed freely, without charge, through either an HTTPS or S3 URL. One dataset contains public SRA data in the originally submitted format from select high value and newly-released studies. The second dataset acts as a centralized repository of SARS-CoV-2 related sequences submitted to NCBI. Included are both the original files submitted by the principal investigator as well as SRA-processed sequences (including normalized sequence files and SRA aligned read format files) that require the SRA Toolkit for analysis. This dataset also includes metadata searchable in AWS Athena by BLAST result, taxonomic analysis, and more, to allow rapid discovery of the most relevant data to your research.
Coronaviridae Datasets
- Runs directory contains normalized sequence data, accessible in multiple formats (fastq, sam, fasta) via the SRA Toolkit and organized by Run accession.
- sra-src directory contains the submitted sequence files in their original format, organized by Run accession.
- VCF directory contains SRA generated VCF files, organized by Run accession.
AWS CLI Access (No AWS account required):
aws s3 ls s3://sra-pub-sars-cov2/ --no-sign-request
Public data
- Contains all public SRA Runs organized by Run accession.
AWS CLI Access (No AWS account required):
aws s3 ls s3://sra-pub-run-odp/ --no-sign-request
Public user-submitted files
- Contains submitted sequence files in their original format, organized by Run accession.
AWS CLI Access (No AWS account required):
aws s3 ls s3://sra-pub-src-2/ --no-sign-request
Accessing SRA Data in AWS
If you know your Run accessions of interest you can access the data several ways. To download files from the AWS Console using a browser, visit the HTTPS URL for the Coronaviridae dataset, Public SRA data, or Public user-submitted files respectively:
- https://s3.console.aws.amazon.com/s3/buckets/sra-pub-sars-cov2/
- https://s3.console.aws.amazon.com/s3/buckets/sra-pub-run-odp/
- https://s3.console.aws.amazon.com/s3/buckets/sra-pub-src-2/
From there you can navigate the directory structure using the provided graphical interface and you can search a given directory for your accession of interest using the provided search box near the top of the page. Once you have navigated to a specific file of interest you can click the Object URL link or use the Object actions button to copy the file to your own S3 bucket or download a copy to local storage.
To access files from within AWS, e.g. from an EC2 instance, you can use the AWS CLI to perform an S3 copy or sync, using a command like this:
aws s3 cp s3://sra-pub-sars-cov2/README.txt $USER/$HOME/README.txt
These data can also be accessed using various other tools and libraries. Access to files in the AWS Registry of Open Data is free. This is true whether you use the HTTPS or S3 URL. For S3 URLs, the transfer is free even if it crosses an AWS region boundary; there is no inter-regional data transfer fee.
If you don't know the Run accessions you are interested in, you can start by searching in the
SRA Run Selector,
AWS Athena, or
SRA Entrez.
A full list of Coronaviridae-containing SRA runs as detected with NCBI's kmer analysis tool is available here: ftp://ftp.ncbi.nlm.nih.gov/sra/reports/AccList/ .
Introduction for First Time Users
Amazon Elastic Compute Cloud (EC2) is the Amazon Web Service you use to create and run virtual machines in the cloud. AWS calls these virtual machines 'instances'. You will need to install your bioinformatics tools for data analysis and the SRA Toolkit for accessing the SRA data.
Creating an AWS Instance
Users will need to address accounts on their own.
Please work with your organization for credential and billing questions. If using a personal account, this guide attempts to stay within AWS Free Tier for users who are still eligible.
Users of this guide are expected to have experience using a Unix command-line interface.
Sign-in and Enter the Amazon EC2 Console
Sign-in using your AWS account: Amazon AWS Console.
Create an AWS Instance
Please follow this Amazon step-by-step guide
that will help you launch a Linux virtual machine on Amazon EC2 within Amazon AWS Free Tier.
Please make sure to create your EC2 instance in the US East (N. Virginia) us-east-1 region.
Connect to the Instance
Use either a Unix/OSX terminal or your preferred ssh application to connect the same as the Amazon tutorial linked above. - This AMI username is ec2-user.
Terminate the Instance
- Remember to terminate the EC2 instance from the AWS console when you have finished using it. If you do not terminate the instance, charges can be generated on your account even when no users are connected.
- Data stored on the EC2 instance will be deleted when the instance is terminated. Users will likely want to have stable s3 storage to store results from their work.
The SRA Toolkit in AWS
Installing The SRA Toolkit in your instance
Once you connected, you will be able to work in Unix-like command line environment where you can install and configure the SRA Toolkit.
Using the SRA Toolkit in AWS
- For downloading public SRA data from our cloud buckets to your cloud storage you can use the SRA Toolkit utilities as described in the SRA Download Guide
- For downloading dbGAP data from our cloud buckets to your cloud storage you need to use jwt.cart file as descibed in the Downloading dbGaP data with JWT
Youtube Video Tutorial - Setting up AWS - demo
Engage
NCBI wants your feedback on SRA in the Cloud. Contact [email protected] with questions or if you would like to provide input on new functionality.