NCBI Datasets Virus Data Package
Sequences and metadata for a set of virus GenBank genomes or SARS-CoV-2 proteins
NCBI Datasets Virus Data Package
The NCBI Datasets Virus Data Package contains sequences and metadata for a set of requested virus GenBank genomes or SARS-CoV-2 proteins. The data package may include genome, coding sequence (CDS) and protein sequences in FASTA format, and a data report containing metadata in JSON Lines format.
Downloading the Package
You can download an NCBI Datasets Virus Data Package using the NCBI Datasets command-line tools.
Package Content
NCBI Datasets Virus Genome Data Package
datasets download virus genome taxon monkeypox --filename monkeypox.zip
unzip monkeypox.zip -d monkeypox
tree monkeypox/
monkeypox/
|-- README.md
|-- md5sum.txt
`-- ncbi_dataset
`-- data
|-- data_report.jsonl
|-- dataset_catalog.json
|-- genomic.fna
`-- virus_dataset.md
NCBI Datasets SARS-CoV-2 Protein Data Package
(note: this package does not contain SARS-CoV-2 genome sequence)
datasets download virus protein S --filename spike-protein.zip
unzip spike-protein.zip -d spike-protein
tree spike-protein/
spike-protein/
|-- README.md
|-- md5sum.txt
`-- ncbi_dataset
`-- data
|-- data_report.jsonl
|-- dataset_catalog.json
|-- protein.faa
`-- virus_dataset.md
Virus Data Report
The virus data report contains metadata describing the genomes and proteins in the data package. The file is in JSON Lines format, where each line is the metadata for one genome or one protein. Use the dataformat tool for easy conversion to a tabular format of selected fields.
- Path:
ncbi_dataset/data/data_report.jsonl
- Schema: Virus Data Report
Virus Annotation Report
The virus annotation contains metadata describing the annotated genes and proteins on the genomes in the data package. The file is in JSON Lines format, where each line is the metadata for one genome or one protein. Use the dataformat tool for easy conversion to a tabular format of selected fields.
- Path:
ncbi_dataset/data/annotation_report.jsonl
- Schema: Virus Annotation Report (available soon)
FASTA Sequence Files
Genomic FASTA
Nucleotide sequence of the viral genome.
- Path:
ncbi_dataset/data/genomic.fna
- Schema: Nucleotide FASTA
Example FASTA Defline:
>MW583405.1 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/TX-CDC-9N37-8996/2021, complete genome
CDS FASTA
Nucleotide sequence of the coding sequence of each protein and mature peptide.
- Path:
ncbi_dataset/data/cds.fna
- Schema: Nucleotide FASTA
Example FASTA Defline:
>NC_045512.2:21563-25384 surface glycoprotein [organism=Severe acute respiratory syndrome coronavirus 2] [isolate=Wuhan-Hu-1]
Protein FASTA
Protein sequence of each protein and mature peptide.
- Path:
ncbi_dataset/data/protein.faa
- Schema: Protein FASTA
Example FASTA Defline:
>QMT27626.1:1-180 leader protein [polyprotein=ORF1ab polyprotein] [organism=Severe acute respiratory syndrome coronavirus 2] [isolate=SARS-CoV-2/human/USA/WA-S1488/2020]
Other files
Virus README
The virus README describes the available SARS-CoV-2 data packages, their content and options for querying.
- Path:
ncbi_dataset/data/virus_dataset.md
README.md
The README contains a general project description common to all data packages.
- Path:
README.md
Dataset catalog
The dataset catalog lists each data file contained within or referenced by this package. Each data file is associated with a content type and location.
- Path:
ncbi_dataset/dataset_catalog.json
MD5 checksum file
The MD5 checksum file contains MD5 hash values for each file contained in the data package after decompression. These hash values can be used as a checksum to verify that a file has not changed as the result of an error during download or decompression. Each line of the MD5 checksum file corresponds to a file in the package after decompression, where the first column contains the MD5 hash value and the second column contains the path to the file.
- Path:
md5sum.txt
Related information
- Learn how to download virus genome data using the NCBI Datasets command-line tools