NCBI Datasets Genome Package
Sequences, annotation and metadata for a set of requested genome assemblies
NCBI Datasets Genome Package
The NCBI Datasets Genome Data Package contains genome sequences and metadata for a set of requested assembled genomes. The data package can be customized to include any combination of genome, transcript and protein sequences in FASTA format, annotation in GFF3, GTF, and GBFF formats, additional metadata as a sequence data report in JSON Lines format, and a subset of metadata in tabular format.
Package Content
NCBI Datasets Genome Data Package
This example shows the contents of the genome data package for the human reference genome, GRCh38 (GCF_000001405.40).
datasets download genome accession GCF_000001405.40 --filename GRCh38.zip --no-progressbar
unzip -q GRCh38.zip -d GRCh38
tree GRCh38/
GRCh38/
|-- README.md
|-- md5sum.txt
`-- ncbi_dataset
`-- data
|-- GCF_000001405.40
| `-- GCF_000001405.40_GRCh38.p14_genomic.fna
|-- assembly_data_report.jsonl
`-- dataset_catalog.json
Genome data report
The genome data report contains metadata describing the genomes in the data package. The file is in JSON Lines format, where each line is the metadata for one genome. Use the dataformat tool for easy conversion to a tabular format of selected fields.
- Path:
ncbi_dataset/data/assembly_data_report.jsonl
- Schema: Genome Data Report
Genome data table
The genome data table is a tabular representation of a subset of metdata in the genome data report and is only provided through the NCBI Datasets Genomes website . Each row of the data table represents one genome in the data package.
The columns of the data table are Organism Scientific Name, Organism Common Name, Organism Qualifier, Taxonomy id, Assembly Name, Assembly Accession, Source, Annotation, Level, Contig N50, Size, Submission Date, Gene Count, BioProject and BioSample
- Path:
ncbi_dataset/data/data_summary.tsv
Genome specific files
Each genome is placed in its own subdirectory inside the data package. The directory name is the assembly accession, for example GCF_000001405.40
.
Sequence data report
The sequence data report describes all nucleotide sequences that comprise the genome assembly. The file is in JSON Lines format, where each line describes one nucleotide sequence. Use the dataformat tool for easy conversion to a tabular format of selected fields.
- Path:
ncbi_dataset/data/<assembly_accession>/sequence_report.jsonl
- Schema: Genome Sequence Data Report
FASTA sequence Files
Genomic FASTA
Assembled chromosomes, unlocalized sequences for which the chromosome is known, and unplaced sequences for which the chromosome is unknown, are contained in a single FASTA file by default.
- Path:
ncbi_dataset/data/<assembly_accession>/*_genomic.fna
- Schema: Nucleotide FASTA
Example FASTA Defline:
>NC_000001.11 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly
Transcript FASTA
- Path:
ncbi_dataset/data/<assembly_accession>/rna.fna
- Schema: Nucleotide FASTA
Example FASTA Defline:
>NM_000014.6 Homo sapiens alpha-2-macroglobulin (A2M), transcript variant 1, mRNA
Protein FASTA
- Path:
ncbi_dataset/data/<assembly_accession>/protein.faa
- Schema: Protein FASTA
Example FASTA Defline:
>NP_000005.3 alpha-2-macroglobulin isoform a precursor [Homo sapiens]
Annotation Files
Genome GFF3
The genome annotation file in GFF3 format describes genes and other features annotated on each genome.
- Path:
ncbi_dataset/data/<assembly_accession>/genomic.gff
- Schema: Genome GFF3
Genome GBFF
The genome sequence and annotation file in GBFF format includes genomic sequence and describes genes and other features annotated on each genome.
- Path:
ncbi_dataset/data/<assembly_accession>/genomic.gbff
- Schema: GenBank Flat File
Genome GTF
The genome annotation file in GTF format describes genes and other features annotated on each genome.
- Path:
ncbi_dataset/data/<assembly_accession>/genomic.gtf
- Schema: Genome GTF
Other files
README.md
The README contains a general project description common to all data packages.
- Path:
README.md
Dataset catalog
The dataset catalog lists each data file contained within or referenced by the package. Each data file is associated with a content type and location.
- Path:
ncbi_dataset/dataset_catalog.json
MD5 checksum file
The MD5 checksum file contains MD5 hash values for each file contained in the data package after decompression. These hash values can be used as a checksum to verify that a file has not changed as the result of an error during download or decompression. Each line of the MD5 checksum file corresponds to a file in the package after decompression, where the first column contains the MD5 hash value and the second column contains the path to the file.
- Path:
md5sum.txt
Related information
Go retrieve a genome package using one of these tools:
- Browse and download datasets at the NCBI Datasets Genome Page
- Browse and download with the command line tool