Gene report
Gene record metadata
Gene report
The downloaded gene package contains a gene data report in
JSON Lines
format in the file:
ncbi_dataset/data/data_report.jsonl
Each line of the gene data report file is a hierarchical JSON
object that represents a single gene record. The schema of the gene record is defined in the tables below
where each row describes a single field in the report or a sub-structure, which is a collection of fields.
The outermost structure of the report is GeneDescriptor.
Table fields that include a Table Field Mnemonic can be used with the
dataformat command-line tool's --fields
Sample report
{
"annotations": [
{
"annotationName": "GCF_000001405.40-RS_2024_08",
"annotationReleaseDate": "2024-08-23",
"assemblyAccession": "GCF_000001405.40",
"assemblyName": "GRCh38.p14",
"genomicLocations": [
{
"genomicAccessionVersion": "NC_000019.10",
"genomicRange": {
"begin": "58345183",
"end": "58353492",
"orientation": "minus"
},
"sequenceName": "19"
}
]
},
{
"annotationName": "GCF_009914755.1-RS_2024_08",
"annotationReleaseDate": "2024-08-23",
"assemblyAccession": "GCF_009914755.1",
"assemblyName": "T2T-CHM13v2.0",
"genomicLocations": [
{
"genomicAccessionVersion": "NC_060943.1",
"genomicRange": {
"begin": "61441599",
"end": "61449907",
"orientation": "minus"
},
"sequenceName": "19"
}
]
}
],
"chromosomes": [
"19"
],
"commonName": "human",
"description": "alpha-1-B glycoprotein",
"ensemblGeneIds": [
"ENSG00000121410"
],
"geneGroups": [
{
"id": "1",
"method": "NCBI Ortholog"
}
],
"geneId": "1",
"nomenclatureAuthority": {
"authority": "HGNC",
"identifier": "HGNC:5"
},
"omimIds": [
"138670"
],
"orientation": "minus",
"proteinCount": 1,
"summary": [
{
"description": "The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins. [provided by RefSeq, Jul 2008]"
}
],
"swissProtAccessions": [
"P04217"
],
"symbol": "A1BG",
"synonyms": [
"A1B",
"ABG",
"GAB",
"HYST2477"
],
"taxId": "9606",
"taxname": "Homo sapiens",
"transcriptCount": 1,
"transcriptTypeCounts": [
{
"count": 1,
"type": "PROTEIN_CODING"
}
],
"type": "PROTEIN_CODING"
}
GeneDescriptor Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
geneId | gene-id | NCBI GeneID | uint64 | NCBI Gene ID | 2778 |
symbol | symbol | Symbol | string | gene symbol | GNAS |
description | description | Description | string | gene name | GNAS complex locus |
taxId | tax-id | Taxonomic ID | uint64 | NCBI Taxonomy ID for the organism | 9606 |
taxname | tax-name | Taxonomic Name | string | Taxonomic name of the organism | Homo sapiens |
commonName | common-name | Common Name | string | Common name of the organism | human |
type | gene-type | Gene Type | GeneType | ||
rnaType | rna-type | RNA Type | RnaType | ||
orientation | orientation | Orientation | Orientation | ||
referenceStandards repeated | ref-standard- | Reference Standard | GenomicRegion | Clinical reference standard NG | |
genomicRegions repeated | genomic-region- | Genomic Region | GenomicRegion | Pseudogene, non-genic regulatory element and other genomic region NG | |
chromosomes repeated | chromosomes | Chromosomes | string | 1 X,Y | |
nomenclatureAuthority | name- | Nomenclature | NomenclatureAuthority | ||
swissProtAccessions repeated | swissprot-accessions | SwissProt Accessions | string | ||
ensemblGeneIds repeated | ensembl-geneids | Ensembl GeneIDs | string | ||
omimIds repeated | omim-ids | OMIM IDs | string | ||
synonyms repeated | synonyms | Synonyms | string | ||
replacedGeneId | replaced-gene-id | Replaced NCBI GeneID | uint64 | The NCBI Gene ID for the gene that was merged into the current gene record | |
annotations repeated | annotation- | Annotation | Annotation | ||
transcriptCount | transcript-count | Transcripts | uint32 | ||
proteinCount | protein-count | Proteins | uint32 | ||
transcriptTypeCounts repeated | TranscriptTypeCount | ||||
geneGroups repeated | group- | Gene Group | GeneGroup | ||
summary repeated | summary-source | Summary Source | GeneSummary |
Annotation Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
assemblyAccession | assembly-accession | Assembly Accession | string | ||
assemblyName | assembly-name | Assembly Name | string | ||
annotationName | release-name | Release Name | string | ||
annotationReleaseDate | release-date | Release Date | string | ||
genomicLocations repeated | genomic-range- | Genomic Range | GenomicLocation |
GeneGroup Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
id | id | Identifier | string | ||
method | method | Method | string |
GeneSummary Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
source | source | Source | string | ||
description | description | Description | string | ||
date | date | Date | string |
GenomicLocation Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
genomicAccessionVersion | accession | Accession | string | ||
sequenceName | seq-name | Seq Name | string | ||
genomicRange | range- | Range | |||
exons repeated | exon- | Exons | Range |
GenomicRegion Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
geneRange | gene-range- | Gene Range | SeqRangeSet | The range of this Gene record on this genomic region. | |
type | genomic-region-type | Genomic Region Type | GenomicRegion.GenomicRegionType |
NomenclatureAuthority Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
authority | authority | Authority | string | The nomenclature authority for this gene record | HGNC |
identifier | id | ID | string | The nomenclature authority identifier for this gene record | HGNC:4392 |
Range Structure
A 1-based range on a sequence record.
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
begin | start | Start | uint64 | ||
end | stop | Stop | uint64 | ||
orientation | orientation | Orientation | Orientation | ||
order | order | Order | uint32 | ||
ribosomalSlippage | coming soon | coming soon | int32 | When ribosomal slippage is desired, fill out slippage amount between this and previous range. |
SeqRangeSet Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accessionVersion | accession | Sequence Accession | string | NCBI Accession.version of the sequence | |
range repeated | range- | Range | Series of intervals on above accession_version |
TranscriptTypeCount Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
type | Transcript.TranscriptType | ||||
count | coming soon | coming soon | uint32 |
GeneType Enumeration
NB: GeneType values match Entrez Gene
Name | Number | Description |
---|---|---|
UNKNOWN | 0 | |
tRNA | 1 | |
rRNA | 2 | |
snRNA | 3 | |
scRNA | 4 | |
snoRNA | 5 | |
PROTEIN_CODING | 6 | |
PSEUDO | 7 | these will have NG or NR |
TRANSPOSON | 8 | |
miscRNA | 9 | |
ncRNA | 10 | |
BIOLOGICAL_REGION | 11 | these will have NG |
OTHER | 255 |
GenomicRegion.GenomicRegionType Enumeration
Name | Number | Description |
---|---|---|
UNKNOWN | 0 | |
REFSEQ_GENE | 1 | |
PSEUDOGENE | 2 | |
BIOLOGICAL_REGION | 3 | |
OTHER | 4 |
Orientation Enumeration
Name | Number | Description |
---|---|---|
none | 0 | |
plus | 1 | |
minus | 2 |
RnaType Enumeration
Name | Number | Description |
---|---|---|
rna_UNKNOWN | 0 | |
premsg | 1 | |
tmRna | 2 |
Transcript.TranscriptType Enumeration
Name | Number | Description |
---|---|---|
UNKNOWN | 0 | |
PROTEIN_CODING | 1 | |
NON_CODING | 2 | |
PROTEIN_CODING_MODEL | 3 | |
NON_CODING_MODEL | 4 |
Scalar Value Types
Protocol buffers type | Notes | C++ | Python | Java | Go |
---|---|---|---|---|---|
double | double | float | double | float64 | |
float | float | float | float | float32 | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | int/long | long | int64 |
uint32 | Uses variable-length encoding. | uint32 | int/long | int | uint32 |
uint64 | Uses variable-length encoding. | uint64 | int/long | long | uint64 |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | int/long | long | int64 |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | int/long | long | uint64 |
sfixed32 | Always four bytes. | int32 | int | int | int32 |
sfixed64 | Always eight bytes. | int64 | int/long | long | int64 |
bool | bool | boolean | boolean | bool | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | str/unicode | String | string |
bytes | May contain any arbitrary sequence of bytes. | string | str | ByteString | []byte |