Working with JSON Lines data reports
Here are some frequently asked questions and examples of how to work with the metadata and data reports from NCBI data packages.
Working with JSON Lines data reports
NCBI Datasets tools provide data in zip files that we call “data packages.” These data packages contain metadata in one or more data report files in JSON Lines (pronounced “jason-lines”) format (see why and reference documentation).
For all JSON Lines data reports, each line represents a single record. However, the number and type of JSON Lines data reports varies depending on the type of data package. For example, genome data packages may include the following two types of data reports:
- a single genome assembly data report, where each line represents one genome assembly record, and
- one genome sequence data report per genome assembly record, where each line represents one nucleotide sequence record that comprises that assembly
Data report schemas describe each type of data report, including the available fields, with descriptions, examples, and mnemonic terms that can be used with the dataformat command-line tool (CLI).
Here are some frequently asked questions about how to work with JSON Lines data reports.
How do I make the JSON Lines data report more readable?
Make the JSON Lines data report more readable by usingjq
to pretty-print. This is one of the many Tools for JSON and JSON Lines. Alternatively, see the question below to generate a table.First, download a gene data package for a set of NCBI GeneIDs and unzip it.
$ datasets download gene gene-id 1,2,9 --filename genes.zip
Collecting 3 records [================================================] 100% 3/3
Downloading: genes.zip 24.3kB done
$ unzip genes.zip
Archive: genes.zip
inflating: README.md
inflating: ncbi_dataset/data/rna.fna
inflating: ncbi_dataset/data/protein.faa
inflating: ncbi_dataset/data/data_report.jsonl
inflating: ncbi_dataset/data/dataset_catalog.json
Then use jq
to pretty-print the data report to make it more readable (and head
to show the first 10 lines, as in the example below).
$ jq . ncbi_dataset/data/data_report.jsonl | head --lines=10
{
"annotations": [
{
"annotationName": "NCBI Annotation Release 110",
"annotationReleaseDate": "2022-02-25",
"assemblyAccession": "GCF_000001405.40",
"assemblyName": "GRCh38.p14",
"genomicLocations": [
{
"genomicAccessionVersion": "NC_000019.10",
For a complete pretty-printed gene data report, see the Sample report in the gene data report schema.How do I convert a JSON Lines data report to a table?
You can generate a table from the JSON Lines data report using dataformat.First, download a gene data package for a set of NCBI GeneIDs.
$ datasets download gene gene-id 1,2,9 --filename genes.zip
Collecting 3 records [================================================] 100% 3/3
Downloading: genes.zip 24.3kB done
Then, generate a table using dataformat.
$ dataformat tsv gene --fields gene-id,symbol,tax-name,gene-type --package genes.zip
NCBI GeneID Symbol Taxonomic Name Gene Type
1 A1BG Homo sapiens PROTEIN_CODING
2 A2M Homo sapiens PROTEIN_CODING
9 NAT1 Homo sapiens PROTEIN_CODING
Other generic tools may be used for such conversions, but dataformat provides pragmatic handling of more complicated scenarios, such as when a field is multi-valued. For example, a gene may have multiple synonyms:
$ dataformat tsv gene --fields gene-id,symbol,synonyms,tax-name,gene-type --package genes.zip
NCBI GeneID Symbol Synonyms Taxonomic Name Gene Type
1 A1BG A1B,ABG,GAB,HYST2477 Homo sapiens PROTEIN_CODING
2 A2M A2MD,CPAMD5,FWP007,S863-7 Homo sapiens PROTEIN_CODING
9 NAT1 AAC1,MNAT,NATI,NAT-1 Homo sapiens PROTEIN_CODING
Using a generic tool such as Miller , you must extract data packages before use, and you must also be aware of how to flatten/unflatten desired fields:
$ unzip genes.zip
Archive: genes.zip
inflating: README.md
inflating: ncbi_dataset/data/rna.fna
inflating: ncbi_dataset/data/protein.faa
inflating: ncbi_dataset/data/data_report.jsonl
inflating: ncbi_dataset/data/dataset_catalog.json
$ mlr --ijson --opprint --no-auto-flatten cut -f geneId,symbol,synonyms,taxname,type ncbi_dataset/data/data_report.jsonl
geneId symbol synonyms taxname type
1 A1BG ["A1B", "ABG", "GAB", "HYST2477"] Homo sapiens PROTEIN_CODING
2 A2M ["A2MD", "CPAMD5", "FWP007", "S863-7"] Homo sapiens PROTEIN_CODING
9 NAT1 ["AAC1", "MNAT", "NATI", "NAT-1"] Homo sapiens PROTEIN_CODING
How do I find metadata for a single gene described in the JSON Lines data report?
Because each line of the data report, in JSON Lines format, represents a single gene, you can use grep
to get the metadata describing that gene.
After downloading and unzipping a gene data package, use grep
to pull out the line matching the desired gene symbol. The result remains in the form of valid JSON Lines and may be viewed in tabular form via dataformat, or rendered as JSON using jq
to pretty-print (and head
to show the first 10 lines, as in the example below).
$ grep A1BG ncbi_dataset/data/data_report.jsonl |
dataformat tsv gene --fields gene-id,symbol,tax-name,gene-type
NCBI GeneID Symbol Taxonomic Name Gene Type
1 A1BG Homo sapiens PROTEIN_CODING
$ grep A1BG ncbi_dataset/data/data_report.jsonl |
jq . | head --lines=10
{
"annotations": [
{
"annotationName": "NCBI Annotation Release 110",
"annotationReleaseDate": "2022-02-25",
"assemblyAccession": "GCF_000001405.40",
"assemblyName": "GRCh38.p14",
"genomicLocations": [
{
"genomicAccessionVersion": "NC_000019.10",
Generated November 25, 2024