NCBI » GEO » Info » Programmatic access to GEOLogin

Programmatic access to GEO

Introduction

GEO data can be programmatically accessed using a suite of programs called the Entrez Programming Utilities (E-Utils).

E-Utils are a set of server-side programs that provide a stable interface to the search, retrieval, and linking functions of the Entrez system, using a fixed URL syntax. E-Utils are designed to be called from within a computer program that can process their output. Output is provided in XML format.

GEO data are stored in two separate databases:

  • Entrez GEO DataSets: contains descriptive and accession information for all records (db=gds)
  • Entrez GEO Profiles: contains gene annotation and synoptic/visual data for each expression profile (db=geoprofiles)

Three key concepts to keep in mind are:

  1. E-Utils are only capable of retrieving data that is stored within the Entrez system. For GEO databases, only metadata is stored in Entrez. To retrieve full GEO records, complete data tables, or raw data files, a second step is required, namely constructing an FTP URL (see FTP directory structure table) and downloading the data.
  2. Each Entrez record is identified using a unique integer ID (UID). UIDs are used for both data input and output. Search history parameters (query_key and WebEnv) can also be used to identify previous search results.
  3. Your initial search can be refined using field qualifiers which can filter results based on data types, publication date ranges, and much more.

A typical workflow might have the following steps:

  • Use the qualifier fields in Entrez GEO DataSets to fine-tune a search
  • Construct the appropriate eSearch query in your script/program
  • Run the query, retrieve the results in the form of UIDs or history parameters (query_key and WebEnv) as needed
  • Run eSummary or eFetch and/or eLink depending on your needs to retrieve the final metadata or accessions.
  • If you need to download full records or supplementary files, use the accession information to construct an FTP URL and download the data.

For more information, check out the complete E-Utils documentation or the NCBI short course Building Customized Data Pipelines Using the Entrez Programming Utilities.

Examples

For most applications GEO DataSets is the more useful and sensible place to construct a search. All the examples hereon will demonstrate GEO DataSet search and retrievals.

In each example, note that the query_key and WebEnv parameters are for demonstration purposes only. These parameters are stored in the History server for a limited time; perform the eSearch to generate new query_key and WebEnv parameters.

Example I: Retrieve a list of Series released within the last 3 months.

Construct and perform an eSearch in db=gds to retrieve Series IDs using:

Use the query_key and WebEnv parameters from the eSearch to perform an eSummary:

This retrieves summary documents for all Series records.

Example II: Fetch a document summary text file listing all Saccharomyces cerevisiae experiments released within the last 3 months.

Construct and perform an eSearch in db=gds to retrieve relevant Series and DataSet records using:

Use the query_key and WebEnv parameters from the eSearch to perform an eFetch:

This generates a document summary text file listing all Saccharomyces cerevisiae experiments that were released within the last 3 months (Jan 2007 to Mar 2007 in this example).

Example III: Retrieve all CEL files corresponding to Affymetrix Platform HG-U133A.

When looking for data relating to a specific array, it is usually safest to use that Platform's GEO accession number, rather than its name. The official version of HG-U133A has accession number GPL96, as determined by a manual search.

Construct and perform an eSearch query in db=gds for all Series records that have Samples relating to GPL96 and have CEL files, using:

Use the query_key and WebEnv parameters from the eSearch to perform an eSummary:

This returns summary documents for all Series records that contain HG-U133A CEL files.

Extract the Series accession numbers from the eSummary document. You can then use this Series accession list to construct URLs to get the raw data files, for example:

Example IV: Retrieve all PubMed IDs that correlate with rat experiments in GEO.

Construct and perform an eSearch in db=gds to retrieve relevant records using:

Use the query_key and WebEnv parameters from the eSearch to perform an eLink to PubMed:

This lists all the PubMed IDs associated with GEO rat experiments.

More details

eSearch responds to a text query with the list of unique identifiers (UIDs) matching the query in a given database, along with the term translations of the query
eSummary responds to a list of UIDs with the corresponding document summaries
eFetch responds to a list of UIDs with the corresponding data records
ePost accepts a list of UIDs, stores the set on the History Server, and responds with the corresponding query key and Web environment
eLink responds to a list of UIDs in a given database with either a list of related IDs in the same database or a list of linked IDs in another Entrez database
eInfo provides the number of records indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrez databases

All GEO data are available for download from the FTP site. Directory structure is organized by type, GEO accession range, GEO accession number, and format. Range subdirectory name is created by replacing the three last digits of the accession with letters "nnn". For example,

GSM575:
/samples/GSMnnn/GSM575/
GSM1234:
/samples/GSM1nnn/GSM1234/
GSM12345:
/samples/GSM12nnn/GSM12345/
For more information, please see README.

Format Example
SOFT, by DataSet ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS1nnn/GDS1001/soft/GDS1001.soft.gz
SOFT full, by DataSet ftp://ftp.ncbi.nlm.nih.gov/geo/datasets/GDS1nnn/GDS1001/soft/GDS1001_full.soft.gz
SOFT, by Platform ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPLnnn/GPL10/soft/GPL10_family.soft.gz
SOFT, by Series ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSEnnn/GSE1/soft/GSE1_family.soft.gz
MINiML, by Platform ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPLnnn/GPL10/miniml/GPL10_family.xml.tgz
MINiML, by Series ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSEnnn/GSE1/miniml/GSE1_family.xml.tgz
SeriesMatrix ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSEnnn/GSE1/matrix/GSE1_series_matrix.txt.gz
Supplementary files, by Platform ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL1nnn/GPL1073/suppl/
Supplementary files, by Series ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1000/suppl/GSE1000_RAW.tar
Supplementary files, by Sample ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1nnn/GSM1137/suppl/GSM1137.CEL.gz
Last modified: July 16, 2024