Gene report

Gene record metadata

Gene report

Gene record metadata

The downloaded gene package contains a gene data report in JSON Lines format in the file:

ncbi_dataset/data/data_report.jsonl

Each line of the gene data report file is a hierarchical JSON object that represents a single gene record. The schema of the gene record is defined in the tables below where each row describes a single field in the report or a sub-structure, which is a collection of fields. The outermost structure of the report is GeneDescriptor.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option. Refer to the dataformat CLI tool reference to see how you can use this tool to transform gene data reports from JSON Lines to tabular formats.

Sample report

{
  "annotations": [
    {
      "annotationName": "GCF_000001405.40-RS_2024_08",
      "annotationReleaseDate": "2024-08-23",
      "assemblyAccession": "GCF_000001405.40",
      "assemblyName": "GRCh38.p14",
      "genomicLocations": [
        {
          "genomicAccessionVersion": "NC_000019.10",
          "genomicRange": {
            "begin": "58345183",
            "end": "58353492",
            "orientation": "minus"
          },
          "sequenceName": "19"
        }
      ]
    },
    {
      "annotationName": "GCF_009914755.1-RS_2024_08",
      "annotationReleaseDate": "2024-08-23",
      "assemblyAccession": "GCF_009914755.1",
      "assemblyName": "T2T-CHM13v2.0",
      "genomicLocations": [
        {
          "genomicAccessionVersion": "NC_060943.1",
          "genomicRange": {
            "begin": "61441599",
            "end": "61449907",
            "orientation": "minus"
          },
          "sequenceName": "19"
        }
      ]
    }
  ],
  "chromosomes": [
    "19"
  ],
  "commonName": "human",
  "description": "alpha-1-B glycoprotein",
  "ensemblGeneIds": [
    "ENSG00000121410"
  ],
  "geneGroups": [
    {
      "id": "1",
      "method": "NCBI Ortholog"
    }
  ],
  "geneId": "1",
  "nomenclatureAuthority": {
    "authority": "HGNC",
    "identifier": "HGNC:5"
  },
  "omimIds": [
    "138670"
  ],
  "orientation": "minus",
  "proteinCount": 1,
  "summary": [
    {
      "description": "The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins. [provided by RefSeq, Jul 2008]"
    }
  ],
  "swissProtAccessions": [
    "P04217"
  ],
  "symbol": "A1BG",
  "synonyms": [
    "A1B",
    "ABG",
    "GAB",
    "HYST2477"
  ],
  "taxId": "9606",
  "taxname": "Homo sapiens",
  "transcriptCount": 1,
  "transcriptTypeCounts": [
    {
      "count": 1,
      "type": "PROTEIN_CODING"
    }
  ],
  "type": "PROTEIN_CODING"
}

GeneDescriptor Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
geneIdgene-idNCBI GeneIDuint64NCBI Gene ID2778
symbolsymbolSymbolstringgene symbolGNAS
descriptiondescriptionDescriptionstringgene nameGNAS complex locus
taxIdtax-idTaxonomic IDuint64NCBI Taxonomy ID for the organism9606
taxnametax-nameTaxonomic NamestringTaxonomic name of the organismHomo sapiens
commonNamecommon-nameCommon NamestringCommon name of the organismhuman
typegene-typeGene TypeGeneType
rnaTyperna-typeRNA TypeRnaType
orientationorientationOrientationOrientation
referenceStandards repeatedref-standard-Reference StandardGenomicRegionClinical reference standard NG
genomicRegions repeatedgenomic-region-Genomic RegionGenomicRegionPseudogene, non-genic regulatory element and other genomic region NG
chromosomes repeatedchromosomesChromosomesstring1
X,Y
nomenclatureAuthorityname-NomenclatureNomenclatureAuthority
swissProtAccessions repeatedswissprot-accessionsSwissProt Accessionsstring
ensemblGeneIds repeatedensembl-geneidsEnsembl GeneIDsstring
omimIds repeatedomim-idsOMIM IDsstring
synonyms repeatedsynonymsSynonymsstring
replacedGeneIdreplaced-gene-idReplaced NCBI GeneIDuint64The NCBI Gene ID for the gene that was merged into the current gene record
annotations repeatedannotation-AnnotationAnnotation
transcriptCounttranscript-countTranscriptsuint32
proteinCountprotein-countProteinsuint32
transcriptTypeCounts repeatedTranscriptTypeCount
geneGroups repeatedgroup-Gene GroupGeneGroup
summary repeatedsummary-sourceSummary SourceGeneSummary

Annotation Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
assemblyAccessionassembly-accessionAssembly Accessionstring
assemblyNameassembly-nameAssembly Namestring
annotationNamerelease-nameRelease Namestring
annotationReleaseDaterelease-dateRelease Datestring
genomicLocations repeatedgenomic-range-Genomic RangeGenomicLocation

GeneGroup Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
ididIdentifierstring
methodmethodMethodstring

GeneSummary Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
sourcesourceSourcestring
descriptiondescriptionDescriptionstring
datedateDatestring

GenomicLocation Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
genomicAccessionVersionaccessionAccessionstring
sequenceNameseq-nameSeq Namestring
genomicRangerange-Range
exons repeatedexon-ExonsRange

GenomicRegion Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
geneRangegene-range-Gene RangeSeqRangeSetThe range of this Gene record on this genomic region.
typegenomic-region-typeGenomic Region TypeGenomicRegion.GenomicRegionType

NomenclatureAuthority Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
authorityauthorityAuthoritystringThe nomenclature authority for this gene recordHGNC
identifieridIDstringThe nomenclature authority identifier for this gene recordHGNC:4392

Range Structure

A 1-based range on a sequence record.

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
beginstartStartuint64
endstopStopuint64
orientationorientationOrientationOrientation
orderorderOrderuint32
ribosomalSlippagecoming sooncoming soonint32When ribosomal slippage is desired, fill out slippage amount between this and previous range.

SeqRangeSet Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionVersionaccessionSequence AccessionstringNCBI Accession.version of the sequence
range repeatedrange-RangeSeries of intervals on above accession_version

TranscriptTypeCount Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
typeTranscript.TranscriptType
countcoming sooncoming soonuint32

GeneType Enumeration

NB: GeneType values match Entrez Gene

NameNumberDescription
UNKNOWN0
tRNA1
rRNA2
snRNA3
scRNA4
snoRNA5
PROTEIN_CODING6
PSEUDO7these will have NG or NR
TRANSPOSON8
miscRNA9
ncRNA10
BIOLOGICAL_REGION11these will have NG
OTHER255

GenomicRegion.GenomicRegionType Enumeration

NameNumberDescription
UNKNOWN0
REFSEQ_GENE1
PSEUDOGENE2
BIOLOGICAL_REGION3
OTHER4

Orientation Enumeration

NameNumberDescription
none0
plus1
minus2

RnaType Enumeration

NameNumberDescription
rna_UNKNOWN0
premsg1
tmRna2

Transcript.TranscriptType Enumeration

NameNumberDescription
UNKNOWN0
PROTEIN_CODING1
NON_CODING2
PROTEIN_CODING_MODEL3
NON_CODING_MODEL4

Scalar Value Types

Protocol buffers typeNotesC++PythonJavaGo
doubledoublefloatdoublefloat64
floatfloatfloatfloatfloat32
int32Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.int32intintint32
int64Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.int64int/longlongint64
uint32Uses variable-length encoding.uint32int/longintuint32
uint64Uses variable-length encoding.uint64int/longlonguint64
sint32Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.int32intintint32
sint64Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.int64int/longlongint64
fixed32Always four bytes. More efficient than uint32 if values are often greater than 2^28.uint32intintuint32
fixed64Always eight bytes. More efficient than uint64 if values are often greater than 2^56.uint64int/longlonguint64
sfixed32Always four bytes.int32intintint32
sfixed64Always eight bytes.int64int/longlongint64
boolboolbooleanbooleanbool
stringA string must always contain UTF-8 encoded or 7-bit ASCII text.stringstr/unicodeStringstring
bytesMay contain any arbitrary sequence of bytes.stringstrByteString[]byte
Generated November 25, 2024