Download a virus genome data package
Download an NCBI Datasets Virus Genome Data Package using the NCBI Datasets command-line tools
Download a virus genome data package
This guide describes how to download an NCBI Datasets Virus Genome Data Package for all genomes available in NCBI Virus using the NCBI Datasets command-line tools.
The default virus genome data package includes genome sequences and metadata. Options are available to include CDS and protein fasta sequences, annotation and BioSample metadata. Refer to the datasets command-line (CLI) reference for all available flags and subcommands.
Download a virus data package with all monkeypox genomes by taxon
Get virus genome metadata using the organism name or NCBI Taxonomy ID.
datasets download virus genome taxon monkeypox
Download a virus data package by NCBI accession
For virus genomes, the accessions should be a Genbank nucleotide accessions, instead of genome assembly accessions (which start with GCA or GCF).
datasets download virus genome accession NC_063383.1
Download a virus data package for a list of virus genome accessions.
Use the --inputfile
flag and provide a text file, with one accession per line.
datasets download virus genome accession --inputfile virus_accession_list.txt
Choosing which data files to include in the data package
Virus data packages
contain genome sequences and metadata by default. You can choose to add additional data files or only include metadata in the data package using --include
with one or more terms. For a full list of available data files, see the <em>datasets</em> reference
.
Below are a few examples of using the --include
flag to choose which data files to include in the data package.
Get genome and protein sequences for the monkeypox reference genome:
datasets download virus genome taxon monkeypox --refseq --include genome,protein
Get genome, CDS, protein sequences and biosample report for the SARS-CoV-2 reference genome:
datasets download virus genome taxon sars-cov-2 --refseq --include genome,protein,cds,biosample
Get a data package with only the primary virus data report (metadata):
datasets download virus genome taxon sars-cov-2 --include none
Filtering by genome assembly properties
When downloading a virus protein data package, you can filter the results by different properties, including the following:
- reference status
- annotation status
- geographic location
- completeness
- release date
- update date
- host
- Pango lineage (SARS-Cov-2 only)
Get data for the monkeypox reference genome:
datasets download virus genome taxon monkeypox --refseq
Get data for Influenza A genomes isolated from dogs:
datasets download virus genome taxon 'Influenza A' --host dog
Get data for Influenza A genomes isolated in the USA:
datasets download virus genome taxon 641809 --geo-location USA
Get data for Influenza A genomes isolated in the U.S. state of California (CA):
datasets download virus genome taxon 641809 --usa-state CA
Get data for monkeypox genomes released after January 1, 2021:
datasets download virus genome monkeypox --released-after 01/01/2021