Download a virus genome data package

Download an NCBI Datasets Virus Genome Data Package using the NCBI Datasets command-line tools

Download a virus genome data package

Download an NCBI Datasets Virus Genome Data Package using the NCBI Datasets command-line tools

This guide describes how to download an NCBI Datasets Virus Genome Data Package for all genomes available in NCBI Virus using the NCBI Datasets command-line tools.

The default virus genome data package includes genome sequences and metadata. Options are available to include CDS and protein fasta sequences, annotation and BioSample metadata. Refer to the datasets command-line (CLI) reference for all available flags and subcommands.

Download a virus data package with all monkeypox genomes by taxon

Get virus genome metadata using the organism name or NCBI Taxonomy ID.

datasets download virus genome taxon monkeypox

Download a virus data package by NCBI accession

For virus genomes, the accessions should be a Genbank nucleotide accessions, instead of genome assembly accessions (which start with GCA or GCF).

datasets download virus genome accession NC_063383.1

Download a virus data package for a list of virus genome accessions.

Use the --inputfile flag and provide a text file, with one accession per line.

datasets download virus genome accession --inputfile virus_accession_list.txt

Choosing which data files to include in the data package

Virus data packages contain genome sequences and metadata by default. You can choose to add additional data files or only include metadata in the data package using --include with one or more terms. For a full list of available data files, see the <em>datasets</em> reference .

Below are a few examples of using the --include flag to choose which data files to include in the data package.

Get genome and protein sequences for the monkeypox reference genome:

datasets download virus genome taxon monkeypox --refseq --include genome,protein

Get genome, CDS, protein sequences and biosample report for the SARS-CoV-2 reference genome:

datasets download virus genome taxon sars-cov-2 --refseq --include genome,protein,cds,biosample

Get a data package with only the primary virus data report (metadata):

datasets download virus genome taxon sars-cov-2 --include none

Filtering by genome assembly properties

When downloading a virus protein data package, you can filter the results by different properties, including the following:

  • reference status
  • annotation status
  • geographic location
  • completeness
  • release date
  • update date
  • host
  • Pango lineage (SARS-Cov-2 only)
     

Get data for the monkeypox reference genome:

datasets download virus genome taxon monkeypox --refseq

Get data for Influenza A genomes isolated from dogs:

datasets download virus genome taxon 'Influenza A' --host dog

Get data for Influenza A genomes isolated in the USA:

datasets download virus genome taxon 641809 --geo-location USA

Get data for Influenza A genomes isolated in the U.S. state of California (CA):

datasets download virus genome taxon 641809 --usa-state CA

Get data for monkeypox genomes released after January 1, 2021:

datasets download virus genome monkeypox --released-after 01/01/2021
Generated November 25, 2024