Prokaryote gene report
Prokaryote gene record identifiers, protein info, and taxonomic scope
The downloaded prokaryote package contains a prokaryote gene data report in
JSON Lines
format in the file:
ncbi_dataset/data/data_report.jsonl
Each line of the prokaryote gene data report file is a hierarchical
JSON
object that represents a single prokaryote gene record. The schema of the prokaryote gene record
is defined in the tables below where each row describes a single field in the report or a sub-structure,
which is a collection of fields. The outermost structure of the report is ProkaryoteGene.
Table fields that include a Table Field Mnemonic can be used with the
dataformat command-line tool's
--fields
option. Refer to the
dataformat CLI tool reference to see how you
can use this tool to transform prokaryote gene data reports from JSON Lines to tabular formats.
Sample report
{
"accession": "WP_001435165.1",
"geneSymbol": "merC",
"numberOfGenomeMappings": 15,
"proteinLength": 137,
"proteinName": "organomercurial transporter MerC",
"proteinNameEvidence": {
"accession": "NF010318.0",
"category": "HMM",
"source": "NCBI Protein Cluster (PRK)"
},
"taxonomyScope": {
"organismName": "Gammaproteobacteria",
"taxId": 1236
}
}
ProkaryoteGene Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|
accession | accession | Accession | string | The RefSeq WP_ prefixed accession for the protein sequence. | WP_000443665.1
|
geneSymbol | gene-symbol | Gene Symbol | string | The gene symbol | ligA
|
proteinName | protein-name | Protein Name | string | The protein name | NAD-dependent DNA ligase LigA
|
proteinLength | protein-length | Protein Length | uint32 | Length of the protein | 671
|
taxonomyScope | | | Organism | | |
numberOfGenomeMappings | mapping-count | Number of Genome Mappings | uint32 | The number of nucleotide mappings | 7642
|
proteinNameEvidence | name-evidence- | Protein Name Evidence | ProkaryoteGene.ProteinNameEvidence | | |
description | description | Description | string | Description | Catalyzes the formation of a phosphodiester at the site of a single-strand break in duplex DNA
|
ecNumber repeated | ec-number | EC Number | string | EC Number | 6.5.1.2
|
InfraspecificNames Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|
breed | breed | Breed | string | A homogenous group of animals within a domesticated species | Hereford
boxer
|
cultivar | cultivar | Cultivar | string | A variety of plant within a species produced and maintained by cultivation | B73
|
ecotype | ecotype | Ecotype | string | A population or subspecies occupying a distinct habitat | Alpine
|
isolate | isolate | Isolate | string | The individual isolate from which the sequences in the genome assembly were derived | L1 Dominette 01449 registration number 42190680
Pmale09
|
sex | sex | Sex | string | Male or female | female
|
strain | strain | Strain | string | A genetic variant, subtype or culture within a species | SE11
|
LineageOrganism Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|
taxId | coming soon | coming soon | uint32 | NCBI Taxonomy identifier | 11118
|
name | coming soon | coming soon | string | Scientific name | Coronaviridae
|
Organism Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|
taxId | tax-id | Taxonomic ID | uint32 | NCBI Taxonomy identifier | 9606
2697049
|
organismName | name | Name | string | Scientific name | Homo sapiens
Severe acute respiratory syndrome coronavirus 2
|
commonName | common-name | Common Name | string | Common name | human
pangolin
MERS
SARS2
|
lineage repeated | | | LineageOrganism | Lineage ordered from superkingdom level to increasingly more specific taxonomic entries | |
pangolinClassification | pangolin | Pangolin Classification | string | | B.1.1.7
|
infraspecificNames | infraspecific- | Infraspecific Names | InfraspecificNames | | |
ProkaryoteGene.ProteinNameEvidence Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|
accession | accession | Accession | string | Accession | NF005932.1
|
category | category | Category | string | Catagory | HMM
|
source | source | Source | string | Source | NCBI Protein Cluster (PRK)
|
Scalar Value Types
Protocol buffers type | Notes | C++ | Python | Java | Go |
---|
double | | double | float | double | float64 |
float | | float | float | float | float32 |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | int/long | long | int64 |
uint32 | Uses variable-length encoding. | uint32 | int/long | int | uint32 |
uint64 | Uses variable-length encoding. | uint64 | int/long | long | uint64 |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | int/long | long | int64 |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | int/long | long | uint64 |
sfixed32 | Always four bytes. | int32 | int | int | int32 |
sfixed64 | Always eight bytes. | int64 | int/long | long | int64 |
bool | | bool | boolean | boolean | bool |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | str/unicode | String | string |
bytes | May contain any arbitrary sequence of bytes. | string | str | ByteString | []byte |
Generated November 25, 2024