NCBI Datasets Genome Package

Sequences, annotation and metadata for a set of requested genome assemblies

NCBI Datasets Genome Package

Sequences, annotation and metadata for a set of requested genome assemblies

The NCBI Datasets Genome Data Package contains genome sequences and metadata for a set of requested assembled genomes. The data package can be customized to include any combination of genome, transcript and protein sequences in FASTA format, annotation in GFF3, GTF, and GBFF formats, additional metadata as a sequence data report in JSON Lines format, and a subset of metadata in tabular format.

Package Content

NCBI Datasets Genome Data Package

This example shows the contents of the genome data package for the human reference genome, GRCh38 (GCF_000001405.40).

datasets download genome accession GCF_000001405.40 --filename GRCh38.zip --no-progressbar
unzip -q GRCh38.zip -d GRCh38
tree GRCh38/

GRCh38/
|-- README.md
|-- md5sum.txt
`-- ncbi_dataset
    `-- data
        |-- GCF_000001405.40
        |   `-- GCF_000001405.40_GRCh38.p14_genomic.fna
        |-- assembly_data_report.jsonl
        `-- dataset_catalog.json

Genome data report

The genome data report contains metadata describing the genomes in the data package. The file is in JSON Lines format, where each line is the metadata for one genome. Use the dataformat tool for easy conversion to a tabular format of selected fields.

Genome data table

The genome data table is a tabular representation of a subset of metdata in the genome data report and is only provided through the NCBI Datasets Genomes website . Each row of the data table represents one genome in the data package.

The columns of the data table are Organism Scientific Name, Organism Common Name, Organism Qualifier, Taxonomy id, Assembly Name, Assembly Accession, Source, Annotation, Level, Contig N50, Size, Submission Date, Gene Count, BioProject and BioSample

  • Path: ncbi_dataset/data/data_summary.tsv

Genome specific files

Each genome is placed in its own subdirectory inside the data package. The directory name is the assembly accession, for example GCF_000001405.40.

Sequence data report

The sequence data report describes all nucleotide sequences that comprise the genome assembly. The file is in JSON Lines format, where each line describes one nucleotide sequence. Use the dataformat tool for easy conversion to a tabular format of selected fields.

FASTA sequence Files

Genomic FASTA

Assembled chromosomes, unlocalized sequences for which the chromosome is known, and unplaced sequences for which the chromosome is unknown, are contained in a single FASTA file by default.

  • Path: ncbi_dataset/data/<assembly_accession>/*_genomic.fna
  • Schema: Nucleotide FASTA

Example FASTA Defline:

>NC_000001.11 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly
Transcript FASTA

Example FASTA Defline:

>NM_000014.6 Homo sapiens alpha-2-macroglobulin (A2M), transcript variant 1, mRNA
Protein FASTA
  • Path: ncbi_dataset/data/<assembly_accession>/protein.faa
  • Schema: Protein FASTA

Example FASTA Defline:

>NP_000005.3 alpha-2-macroglobulin isoform a precursor [Homo sapiens]

Annotation Files

Genome GFF3

The genome annotation file in GFF3 format describes genes and other features annotated on each genome.

  • Path: ncbi_dataset/data/<assembly_accession>/genomic.gff
  • Schema: Genome GFF3
Genome GBFF

The genome sequence and annotation file in GBFF format includes genomic sequence and describes genes and other features annotated on each genome.

Genome GTF

The genome annotation file in GTF format describes genes and other features annotated on each genome.

  • Path: ncbi_dataset/data/<assembly_accession>/genomic.gtf
  • Schema: Genome GTF

Other files

README.md

The README contains a general project description common to all data packages.

  • Path: README.md
Dataset catalog

The dataset catalog lists each data file contained within or referenced by the package. Each data file is associated with a content type and location.

  • Path: ncbi_dataset/dataset_catalog.json
MD5 checksum file

The MD5 checksum file contains MD5 hash values for each file contained in the data package after decompression. These hash values can be used as a checksum to verify that a file has not changed as the result of an error during download or decompression. Each line of the MD5 checksum file corresponds to a file in the package after decompression, where the first column contains the MD5 hash value and the second column contains the path to the file.

  • Path: md5sum.txt

Go retrieve a genome package using one of these tools:

Generated November 25, 2024