Working with JSON Lines data reports

Here are some frequently asked questions and examples of how to work with the metadata and data reports from NCBI data packages.

Working with JSON Lines data reports

Here are some frequently asked questions and examples of how to work with the metadata and data reports from NCBI data packages.

NCBI Datasets tools provide data in zip files that we call “data packages.” These data packages contain metadata in one or more data report files in JSON Lines (pronounced “jason-lines”) format (see why and reference documentation).

For all JSON Lines data reports, each line represents a single record. However, the number and type of JSON Lines data reports varies depending on the type of data package. For example, genome data packages may include the following two types of data reports:

Data report schemas describe each type of data report, including the available fields, with descriptions, examples, and mnemonic terms that can be used with the dataformat command-line tool (CLI).

Here are some frequently asked questions about how to work with JSON Lines data reports.

How do I make the JSON Lines data report more readable?

Make the JSON Lines data report more readable by using jq to pretty-print. This is one of the many Tools for JSON and JSON Lines. Alternatively, see the question below to generate a table.

First, download a gene data package for a set of NCBI GeneIDs and unzip it.

$ datasets download gene gene-id 1,2,9 --filename genes.zip
Collecting 3  records [================================================] 100% 3/3
Downloading: genes.zip    24.3kB done

$ unzip genes.zip
Archive:  genes.zip
  inflating: README.md               
  inflating: ncbi_dataset/data/rna.fna  
  inflating: ncbi_dataset/data/protein.faa  
  inflating: ncbi_dataset/data/data_report.jsonl  
  inflating: ncbi_dataset/data/dataset_catalog.json

Then use jq to pretty-print the data report to make it more readable (and head to show the first 10 lines, as in the example below).

$ jq . ncbi_dataset/data/data_report.jsonl | head --lines=10
{
  "annotations": [
    {
      "annotationName": "NCBI Annotation Release 110",
      "annotationReleaseDate": "2022-02-25",
      "assemblyAccession": "GCF_000001405.40",
      "assemblyName": "GRCh38.p14",
      "genomicLocations": [
        {
          "genomicAccessionVersion": "NC_000019.10",
For a complete pretty-printed gene data report, see the Sample report in the gene data report schema.

How do I convert a JSON Lines data report to a table?

You can generate a table from the JSON Lines data report using dataformat.

First, download a gene data package for a set of NCBI GeneIDs.

$ datasets download gene gene-id 1,2,9 --filename genes.zip
Collecting 3  records [================================================] 100% 3/3
Downloading: genes.zip    24.3kB done

Then, generate a table using dataformat.

$ dataformat tsv gene --fields gene-id,symbol,tax-name,gene-type --package genes.zip
NCBI GeneID     Symbol  Taxonomic Name  Gene Type
1       A1BG    Homo sapiens    PROTEIN_CODING
2       A2M     Homo sapiens    PROTEIN_CODING
9       NAT1    Homo sapiens    PROTEIN_CODING

Other generic tools may be used for such conversions, but dataformat provides pragmatic handling of more complicated scenarios, such as when a field is multi-valued. For example, a gene may have multiple synonyms:

$ dataformat tsv gene --fields gene-id,symbol,synonyms,tax-name,gene-type --package genes.zip
NCBI GeneID	Symbol	Synonyms	Taxonomic Name	Gene Type
1	A1BG	A1B,ABG,GAB,HYST2477	Homo sapiens	PROTEIN_CODING
2	A2M	A2MD,CPAMD5,FWP007,S863-7	Homo sapiens	PROTEIN_CODING
9	NAT1	AAC1,MNAT,NATI,NAT-1	Homo sapiens	PROTEIN_CODING

Using a generic tool such as Miller , you must extract data packages before use, and you must also be aware of how to flatten/unflatten desired fields:

$ unzip genes.zip
Archive:  genes.zip
  inflating: README.md               
  inflating: ncbi_dataset/data/rna.fna  
  inflating: ncbi_dataset/data/protein.faa  
  inflating: ncbi_dataset/data/data_report.jsonl  
  inflating: ncbi_dataset/data/dataset_catalog.json

$ mlr --ijson --opprint --no-auto-flatten cut -f geneId,symbol,synonyms,taxname,type ncbi_dataset/data/data_report.jsonl
geneId symbol synonyms                               taxname      type
1      A1BG   ["A1B", "ABG", "GAB", "HYST2477"]      Homo sapiens PROTEIN_CODING
2      A2M    ["A2MD", "CPAMD5", "FWP007", "S863-7"] Homo sapiens PROTEIN_CODING
9      NAT1   ["AAC1", "MNAT", "NATI", "NAT-1"]      Homo sapiens PROTEIN_CODING

How do I find metadata for a single gene described in the JSON Lines data report?

Because each line of the data report, in JSON Lines format, represents a single gene, you can use grep to get the metadata describing that gene.

After downloading and unzipping a gene data package, use grep to pull out the line matching the desired gene symbol. The result remains in the form of valid JSON Lines and may be viewed in tabular form via dataformat, or rendered as JSON using jq to pretty-print (and head to show the first 10 lines, as in the example below).

$ grep A1BG ncbi_dataset/data/data_report.jsonl |
      dataformat tsv gene --fields gene-id,symbol,tax-name,gene-type
NCBI GeneID     Symbol  Taxonomic Name  Gene Type
1       A1BG    Homo sapiens    PROTEIN_CODING

$ grep A1BG ncbi_dataset/data/data_report.jsonl |
      jq . | head --lines=10
{
  "annotations": [
    {
      "annotationName": "NCBI Annotation Release 110",
      "annotationReleaseDate": "2022-02-25",
      "assemblyAccession": "GCF_000001405.40",
      "assemblyName": "GRCh38.p14",
      "genomicLocations": [
        {
          "genomicAccessionVersion": "NC_000019.10",
Generated November 25, 2024