NCBI Datasets Gene Package
Sequences and metadata for a set of requested genes
NCBI Datasets Gene Package
The NCBI Datasets Gene Data Package contains sequences and metadata for a set of requested genes. Users can simultaneously select gene, transcript, protein, CDS, 5'-UTR and 3'-UTR sequences in FASTA format, data reports containing metadata in JSON Lines format, and a subset of metadata in tabular format. There are two types of gene data packages: a eukaryotic gene data package and a prokaryotic gene data package . Differences between these two types of gene data package are described below.
Package content
NCBI Datasets Eukaryotic Gene Data Package
This example of Human Breast Cancer gene 1 (symbol: BRCA1; GeneID: 672) illustrates the contents of a eukaryotic gene data package, with the default files included.
datasets download gene gene-id 672 --filename human-brca1.zip
unzip human-brca1.zip -d human-brca1
tree
human-brca1
|-- README.md
|-- md5sum.txt
`-- ncbi_dataset
`-- data
|-- data_report.jsonl
|-- dataset_catalog.json
|-- protein.faa
`-- rna.fna
NCBI Datasets Prokaryotic Gene Data Package
This example of E. coli restriction endonuclease (WP_000769114.1) illustrates the contents of a typical prokaryotic gene data package, with its default files.
datasets download gene accession WP_000769114.1 --filename endonuclease.zip
unzip endonuclease.zip -d endonuclease
tree
endonuclease
|-- README.md
|-- md5sum.txt
`-- ncbi_dataset
`-- data
|-- annotation_report.jsonl
|-- data_report.jsonl
|-- dataset_catalog.json
|-- gene.fna
`-- protein.faa
Data Package Files
Data Reports
Gene data report
The gene data report contains metadata describing the genes in the data package. The file is in JSON Lines format, where each line is the metadata for one gene. The dataformat tool is available for easy conversion to a tabular format of selected fields. The content of the gene data report differs in the eukaryotic and prokaryotic data packages. For details, see the schemas below.
Eukaryotic gene data report (Gene Report)
- Path:
ncbi_dataset/data/data_report.jsonl
- Schema: Gene Data Report
Prokaryotic gene data report (Prokaryotic gene report)
- Path:
ncbi_dataset/data/data_report.jsonl
- Schema: Gene Data Report
Gene product report
The gene product report contains metadata describing the gene products, including transcripts and proteins, for the genes in the data package. The file is in JSON Lines format, where each line is the metadata for one gene. The dataformat tool is available for easy conversion to a tabular format of selected fields.
Path: ncbi_dataset/data/product_report.jsonl
Schema: Gene Product Report
Gene annotation report
The gene annotation report contains metadata describing the annotated locations of the genes in the data package and is only provided for WP_ accessions. The file is in JSON Lines format, where each line is the metadata for one gene. Use the dataformat tool for easy conversion to a tabular format of selected fields.
- Path:
ncbi_dataset/data/annotation_report.jsonl
- Schema: Gene Annotation Report
Gene data table
The gene data table is a tabular representation of a subset of metdata in the gene data report and is only provided for eukaryotic genes. Each row of the data table represents one transcript of each gene in the data package.
The columns of the data table are Gene ID, Symbol, Gene name, Gene type, Scientific name, Transcripts, and Query.
- Path:
ncbi_dataset/data/data_table.tsv
- Schema: Gene Data Report Schema
FASTA sequence files
You can request three FASTA sequence files.
Gene FASTA
- Path:
ncbi_dataset/data/gene.fna
- Schema: Nucleotide FASTA
Example FASTA Defline:
>NC_000004.12:c122621066-122610108 IL21 [organism=Homo sapiens] [GeneID=59067] [chromosome=4]
Transcript FASTA
- Path:
ncbi_dataset/data/rna.fna
- Schema: Nucleotide FASTA
Example FASTA Defline:
>NM_021803.4 IL21 [organism=Homo sapiens] [GeneID=59067] [transcript=1]
Protein FASTA
- Path:
ncbi_dataset/data/protein.faa
- Schema: Protein FASTA
Example FASTA Defline:
>NP_001193935.1 IL21 [organism=Homo sapiens] [GeneID=59067] [isoform=2 precursor]
5p-utr FASTA
- Path: ncbi_dataset/data/5p_utr.fna
- Schema: Nucleotide FASTA Example FASTA Defline:
>NM_007297.4:1-194 BRCA1 [organism=Homo sapiens] [GeneID=672] [transcript=3] [region=5'utr]
3p-utr FASTA
Path: ncbi_dataset/data/3p_utr.fna
- Schema: Nucleotide FASTA Example FASTA Defline:
>NM_007297.4:5646-7028 BRCA1 [organism=Homo sapiens] [GeneID=672] [transcript=3] [region=3'utr]
CDS FASTA
Path: ncbi_dataset/data/cds.fna
- Schema: Nucleotide FASTA Example FASTA Defline:
>NM_007297.4:195-5645 BRCA1 [organism=Homo sapiens] [GeneID=672] [transcript=3] [region=cds]
Other files
README.md
The README contains a general project description common to all data packages.
- Path:
README.md
Dataset catalog
The dataset catalog lists each data file contained within or referenced by the package. Each data file is associated with a content type and location.
- Path:
ncbi_dataset/dataset_catalog.json
MD5 checksum file
The MD5 checksum file contains MD5 hash values for each file contained in the data package after decompression. These hash values can be used as a checksum to verify that a file has not changed as the result of an error during download or decompression. Each line of the MD5 checksum file corresponds to a file in the package after decompression, where the first column contains the MD5 hash value and the second column contains the path to the file.
- Path:
md5sum.txt
Related information
Retrieve a gene package using one of these tools:
- See the NCBI Datasets Gene Page to download gene data packages from the web
- Learn how to download gene data packages using our command-line tools