NCBI Datasets Gene Package

Sequences and metadata for a set of requested genes

NCBI Datasets Gene Package

Sequences and metadata for a set of requested genes

The NCBI Datasets Gene Data Package contains sequences and metadata for a set of requested genes. Users can simultaneously select gene, transcript, protein, CDS, 5'-UTR and 3'-UTR sequences in FASTA format, data reports containing metadata in JSON Lines format, and a subset of metadata in tabular format. There are two types of gene data packages: a eukaryotic gene data package and a prokaryotic gene data package . Differences between these two types of gene data package are described below.

Package content

NCBI Datasets Eukaryotic Gene Data Package

This example of Human Breast Cancer gene 1 (symbol: BRCA1; GeneID: 672) illustrates the contents of a eukaryotic gene data package, with the default files included.

datasets download gene gene-id 672 --filename human-brca1.zip
unzip human-brca1.zip -d human-brca1
tree

human-brca1
|-- README.md
|-- md5sum.txt
`-- ncbi_dataset
    `-- data
        |-- data_report.jsonl
        |-- dataset_catalog.json
        |-- protein.faa
        `-- rna.fna

NCBI Datasets Prokaryotic Gene Data Package

This example of E. coli restriction endonuclease (WP_000769114.1) illustrates the contents of a typical prokaryotic gene data package, with its default files.

datasets download gene accession WP_000769114.1 --filename endonuclease.zip
unzip endonuclease.zip -d endonuclease
tree

endonuclease
|-- README.md
|-- md5sum.txt
`-- ncbi_dataset
    `-- data
        |-- annotation_report.jsonl
        |-- data_report.jsonl
        |-- dataset_catalog.json
        |-- gene.fna
        `-- protein.faa

Data Package Files

Data Reports

Gene data report

The gene data report contains metadata describing the genes in the data package. The file is in JSON Lines format, where each line is the metadata for one gene. The dataformat tool is available for easy conversion to a tabular format of selected fields. The content of the gene data report differs in the eukaryotic and prokaryotic data packages. For details, see the schemas below.

Eukaryotic gene data report (Gene Report)

Prokaryotic gene data report (Prokaryotic gene report)

Gene product report

The gene product report contains metadata describing the gene products, including transcripts and proteins, for the genes in the data package. The file is in JSON Lines format, where each line is the metadata for one gene. The dataformat tool is available for easy conversion to a tabular format of selected fields.

Path: ncbi_dataset/data/product_report.jsonl
Schema: Gene Product Report

Gene annotation report

The gene annotation report contains metadata describing the annotated locations of the genes in the data package and is only provided for WP_ accessions. The file is in JSON Lines format, where each line is the metadata for one gene. Use the dataformat tool for easy conversion to a tabular format of selected fields.

Gene data table

The gene data table is a tabular representation of a subset of metdata in the gene data report and is only provided for eukaryotic genes. Each row of the data table represents one transcript of each gene in the data package.

The columns of the data table are Gene ID, Symbol, Gene name, Gene type, Scientific name, Transcripts, and Query.

FASTA sequence files

You can request three FASTA sequence files.

Gene FASTA

Example FASTA Defline:

>NC_000004.12:c122621066-122610108 IL21 [organism=Homo sapiens] [GeneID=59067] [chromosome=4]
Transcript FASTA

Example FASTA Defline:

>NM_021803.4 IL21 [organism=Homo sapiens] [GeneID=59067] [transcript=1]
Protein FASTA

Example FASTA Defline:

>NP_001193935.1 IL21 [organism=Homo sapiens] [GeneID=59067] [isoform=2 precursor]
5p-utr FASTA
  • Path: ncbi_dataset/data/5p_utr.fna
  • Schema: Nucleotide FASTA Example FASTA Defline:
>NM_007297.4:1-194 BRCA1 [organism=Homo sapiens] [GeneID=672] [transcript=3] [region=5'utr]
3p-utr FASTA

Path: ncbi_dataset/data/3p_utr.fna

>NM_007297.4:5646-7028 BRCA1 [organism=Homo sapiens] [GeneID=672] [transcript=3] [region=3'utr]
CDS FASTA

Path: ncbi_dataset/data/cds.fna

>NM_007297.4:195-5645 BRCA1 [organism=Homo sapiens] [GeneID=672] [transcript=3] [region=cds]

Other files

README.md

The README contains a general project description common to all data packages.

  • Path: README.md
Dataset catalog

The dataset catalog lists each data file contained within or referenced by the package. Each data file is associated with a content type and location.

  • Path: ncbi_dataset/dataset_catalog.json
MD5 checksum file

The MD5 checksum file contains MD5 hash values for each file contained in the data package after decompression. These hash values can be used as a checksum to verify that a file has not changed as the result of an error during download or decompression. Each line of the MD5 checksum file corresponds to a file in the package after decompression, where the first column contains the MD5 hash value and the second column contains the path to the file.

  • Path: md5sum.txt

Retrieve a gene package using one of these tools:

Generated November 25, 2024