Virus report

Virus record metadata

Virus report

Virus record metadata

The downloaded virus package contains a virus data report in JSON Lines format in the file:

ncbi_dataset/data/data_report.jsonl

Each line of the virus data report file is a hierarchical JSON object that represents a single virus record. The schema of the virus record is defined in the tables below where each row describes a single field in the report or a sub-structure, which is a collection of fields. The outermost structure of the report is VirusAssembly.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option. Refer to the dataformat CLI tool reference to see how you can use this tool to transform virus data reports from JSON Lines to tabular formats.

Sample report

{
  "accession": "NC_045512.2",
  "bioprojects": [
    "PRJNA485481"
  ],
  "completeness": "COMPLETE",
  "geneCount": 11,
  "host": {
    "lineage": [
      {
        "name": "cellular organisms",
        "taxId": 131567
      },
      {
        "name": "Eukaryota",
        "taxId": 2759
      },
      {
        "name": "Opisthokonta",
        "taxId": 33154
      },
      {
        "name": "Metazoa",
        "taxId": 33208
      },
      {
        "name": "Eumetazoa",
        "taxId": 6072
      },
      {
        "name": "Bilateria",
        "taxId": 33213
      },
      {
        "name": "Deuterostomia",
        "taxId": 33511
      },
      {
        "name": "Chordata",
        "taxId": 7711
      },
      {
        "name": "Craniata",
        "taxId": 89593
      },
      {
        "name": "Vertebrata",
        "taxId": 7742
      },
      {
        "name": "Gnathostomata",
        "taxId": 7776
      },
      {
        "name": "Teleostomi",
        "taxId": 117570
      },
      {
        "name": "Euteleostomi",
        "taxId": 117571
      },
      {
        "name": "Sarcopterygii",
        "taxId": 8287
      },
      {
        "name": "Dipnotetrapodomorpha",
        "taxId": 1338369
      },
      {
        "name": "Tetrapoda",
        "taxId": 32523
      },
      {
        "name": "Amniota",
        "taxId": 32524
      },
      {
        "name": "Mammalia",
        "taxId": 40674
      },
      {
        "name": "Theria",
        "taxId": 32525
      },
      {
        "name": "Eutheria",
        "taxId": 9347
      },
      {
        "name": "Boreoeutheria",
        "taxId": 1437010
      },
      {
        "name": "Euarchontoglires",
        "taxId": 314146
      },
      {
        "name": "Primates",
        "taxId": 9443
      },
      {
        "name": "Haplorrhini",
        "taxId": 376913
      },
      {
        "name": "Simiiformes",
        "taxId": 314293
      },
      {
        "name": "Catarrhini",
        "taxId": 9526
      },
      {
        "name": "Hominoidea",
        "taxId": 314295
      },
      {
        "name": "Hominidae",
        "taxId": 9604
      },
      {
        "name": "Homininae",
        "taxId": 207598
      },
      {
        "name": "Homo",
        "taxId": 9605
      },
      {
        "name": "Homo sapiens",
        "taxId": 9606
      }
    ],
    "organismName": "Homo sapiens",
    "taxId": 9606
  },
  "isAnnotated": true,
  "isolate": {
    "collectionDate": "2019-12",
    "name": "Wuhan-Hu-1"
  },
  "length": 29903,
  "location": {
    "geographicLocation": "China",
    "geographicRegion": "Asia"
  },
  "maturePeptideCount": 26,
  "nucleotide": {
    "sequenceHash": "A926D55E"
  },
  "proteinCount": 12,
  "releaseDate": "2020-01-13T00:00:00Z",
  "sourceDatabase": "RefSeq",
  "submitter": {
    "affiliation": "National Center for Biotechnology Information, NIH",
    "country": "USA",
    "names": [
      "Wu,F.",
      "Zhao,S.",
      "Yu,B.",
      "Chen,Y.M.",
      "Wang,W.",
      "Song,Z.G.",
      "Hu,Y.",
      "Tao,Z.W.",
      "Tian,J.H.",
      "Pei,Y.Y.",
      "Yuan,M.L.",
      "Zhang,Y.L.",
      "Dai,F.H.",
      "Liu,Y.",
      "Wang,Q.M.",
      "Zheng,J.J.",
      "Xu,L.",
      "Holmes,E.C.",
      "Zhang,Y.Z.",
      "Baranov,P.V.",
      "Henderson,C.M.",
      "Anderson,C.B.",
      "Gesteland,R.F.",
      "Atkins,J.F.",
      "Howard,M.T.",
      "Robertson,M.P.",
      "Igel,H.",
      "Baertsch,R.",
      "Haussler,D.",
      "Ares,M. Jr.",
      "Scott,W.G.",
      "Williams,G.D.",
      "Chang,R.Y.",
      "Brian,D.A.",
      "Chen,Y.-M.",
      "Song,Z.-G.",
      "Tao,Z.-W.",
      "Tian,J.-H.",
      "Pei,Y.-Y.",
      "Zhang,Y.-L.",
      "Dai,F.-H.",
      "Wang,Q.-M.",
      "Zheng,J.-J.",
      "Zhang,Y.-Z."
    ]
  },
  "updateDate": "2020-07-18T00:00:00Z",
  "virus": {
    "lineage": [
      {
        "name": "Viruses",
        "taxId": 10239
      },
      {
        "name": "Riboviria",
        "taxId": 2559587
      },
      {
        "name": "Orthornavirae",
        "taxId": 2732396
      },
      {
        "name": "Pisuviricota",
        "taxId": 2732408
      },
      {
        "name": "Pisoniviricetes",
        "taxId": 2732506
      },
      {
        "name": "Nidovirales",
        "taxId": 76804
      },
      {
        "name": "Cornidovirineae",
        "taxId": 2499399
      },
      {
        "name": "Coronaviridae",
        "taxId": 11118
      },
      {
        "name": "Orthocoronavirinae",
        "taxId": 2501931
      },
      {
        "name": "Betacoronavirus",
        "taxId": 694002
      },
      {
        "name": "Sarbecovirus",
        "taxId": 2509511
      },
      {
        "name": "Severe acute respiratory syndrome-related coronavirus",
        "taxId": 694009
      },
      {
        "name": "Severe acute respiratory syndrome coronavirus 2",
        "taxId": 2697049
      }
    ],
    "organismName": "Severe acute respiratory syndrome coronavirus 2",
    "pangolinClassification": "B",
    "taxId": 2697049
  }
}

VirusAssembly Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAccessionstringThe accession.version of the viral nucleotide sequence. Includes both GenBank and RefSeq accessionsNC_045512.2
isAnnotatedis-annotatedIs AnnotatedboolThe viral genome has been annotated by either the submitter (GenBank) or by NCBI (RefSeq)
isolateisolate-IsolateIsolate
sourceDatabasesourcedbSource databasestringIndicates if the source of the viral nucleotide record is from a GenBank submitter or from NCBI-derived curation (RefSeq)RefSeq
GenBank
proteinCountprotein-countProtein countuint32The total count of annotated proteins including both proteins and polyproteins but not processed mature peptides
hosthost-HostOrganismTaxon from which the virus sample was isolated
virusvirus-VirusOrganismViral taxon
bioprojects repeatedbioprojectsBioProjectsstringAssociated BioProject accessions, when availablePRJNA485481
locationgeo-GeographicVirusAssembly.CollectionLocation
updateDateupdate-dateUpdate datestringDate the viral nucleotide accession was last updated in NCBI Virus
releaseDaterelease-dateRelease datestringDate the viral nucleotide accession was first released in NCBI Virus
completenesscompletenessCompletenessVirusAssembly.Completeness
lengthlengthLengthuint32Length of the viral nucleotide sequence
geneCountgene-countGene countuint32Total count of genes annotated on the viral nucleotide sequence
maturePeptideCountmatpeptide-countMature peptide countuint32Total count of processed mature peptides annotated on the viral nucleotide sequence
biosamplebiosample-accBioSample accessionstringAssociated Biosample accessionsSAMN15394129
molTypemol-typeMolecule typestringICTV (International Committee on Taxonomy of Viruses) viral classification based on nucleic acid composition, strandedness and method of replication
nucleotideSeqRangeSetFastaThe whole genomic nucleotide record of the CDS feature.
purposeOfSamplingpurpose-of-samplingPurpose of SamplingPurposeOfSampling
sraAccessions repeatedsra-accsSRA AccessionsstringSRA accessions linked to the genbank genome
submittersubmitter-SubmitterVirusAssembly.SubmitterInfoName, affiliation, and country of the submitter(s)
labHostlab-hostLab HoststringThis sequence is from viruses passaged in this host
isLabHostis-lab-hostIs Lab HostboolIf true, this sequence is from viruses passaged in a laboratory
isVaccineStrainis-vaccine-strainIs Vaccine StrainboolIf true, this sequence is derived from a virus used as a vaccine or potential vaccine

InfraspecificNames Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
breedbreedBreedstringA homogenous group of animals within a domesticated speciesHereford
boxer
cultivarcultivarCultivarstringA variety of plant within a species produced and maintained by cultivationB73
ecotypeecotypeEcotypestringA population or subspecies occupying a distinct habitatAlpine
isolateisolateIsolatestringThe individual isolate from which the sequences in the genome assembly were derivedL1 Dominette 01449 registration number 42190680
Pmale09
sexsexSexstringMale or femalefemale
strainstrainStrainstringA genetic variant, subtype or culture within a speciesSE11

Isolate Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
namelineageLineagestringBioSample harmonized attribute names https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/
sourcelineage-sourceLineage sourcestringSource material from which the viral specimen was isolatedblood
feces
lung
collectionDatecollection-dateCollection datestringThe collection date for the sample from which the viral nucleotide sequence was derived

LineageOrganism Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
taxIdcoming sooncoming soonuint32NCBI Taxonomy identifier11118
namecoming sooncoming soonstringScientific nameCoronaviridae

Organism Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
taxIdtax-idTaxonomic IDuint32NCBI Taxonomy identifier9606
2697049
organismNamenameNamestringScientific nameHomo sapiens
Severe acute respiratory syndrome coronavirus 2
commonNamecommon-nameCommon NamestringCommon namehuman
pangolin
MERS
SARS2
lineage repeatedLineageOrganismLineage ordered from superkingdom level to increasingly more specific taxonomic entries
pangolinClassificationpangolinPangolin ClassificationstringB.1.1.7
infraspecificNamesinfraspecific-Infraspecific NamesInfraspecificNames

Range Structure

A 1-based range on a sequence record.

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
beginstartStartuint64
endstopStopuint64
orientationorientationOrientationOrientation
orderorderOrderuint32
ribosomalSlippagecoming sooncoming soonint32When ribosomal slippage is desired, fill out slippage amount between this and previous range.

SeqRangeSetFasta Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
seqIdseq-idSequence IDstringSeq_id may include location info in addition to a sequence accession
accessionVersionaccessionAccessionstringAccession and version of the viral nucleotide sequence
titletitleTitlestring
sequenceHashhashHashstringUnique identifier for identical sequences
range repeatedrange-RangeRangeSeries of intervals on above accession_version

VirusAssembly.CollectionLocation Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
geographicLocationlocationLocationstringCountry of virus specimen collectionUSA
France
geographicRegionregionRegionstringRegion of virus specimen collectionAsia
North America
usaStatestateStatestringTwo letter abbreviation of the state of the virus specifime collection (if United States)NY
VA

VirusAssembly.SubmitterInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
names repeatednamesNamesstringList of submitters or authors of the virus assemblyJane D
John S
affiliationaffiliationAffiliationstringThe submitter’s organization and/or institutionCenters for Disease Control and Prevention, Respiratory Viruses Branch, Division of Viral Diseases
Public Health Directorate, Communicable Disease Laboratory
countrycountryCountrystringThe country representing the submitter’s affilationUSA
China

Orientation Enumeration

NameNumberDescription
none0
plus1
minus2

PurposeOfSampling Enumeration

NameNumberDescription
PURPOSE_OF_SAMPLING_UNKNOWN0
PURPOSE_OF_SAMPLING_BASELINE_SURVEILLANCE1

VirusAssembly.Completeness Enumeration

NameNumberDescription
UNKNOWN0
COMPLETE1
PARTIAL2

Scalar Value Types

Protocol buffers typeNotesC++PythonJavaGo
doubledoublefloatdoublefloat64
floatfloatfloatfloatfloat32
int32Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.int32intintint32
int64Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.int64int/longlongint64
uint32Uses variable-length encoding.uint32int/longintuint32
uint64Uses variable-length encoding.uint64int/longlonguint64
sint32Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.int32intintint32
sint64Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.int64int/longlongint64
fixed32Always four bytes. More efficient than uint32 if values are often greater than 2^28.uint32intintuint32
fixed64Always eight bytes. More efficient than uint64 if values are often greater than 2^56.uint64int/longlonguint64
sfixed32Always four bytes.int32intintint32
sfixed64Always eight bytes.int64int/longlongint64
boolboolbooleanbooleanbool
stringA string must always contain UTF-8 encoded or 7-bit ASCII text.stringstr/unicodeStringstring
bytesMay contain any arbitrary sequence of bytes.stringstrByteString[]byte
Generated November 25, 2024