dbGaP Study Submission Guide
You must register your study before submitting data.
Register study --> Prepare files for submission --> Check files before submission --> Submit --> dbGaP curators process --> Receive signal and submit high throughput sequences: BAM, CRAM, FASTQ --> Preview and Approve --> Release
Submission Onboarding
Videos: An Overview of the dbGaP Submission Process
Part 1 - Register Your Study
Part 2 - Submit Your Data
Part 3 - Review and Release Your Study
What's new?
- There are three new videos for an overview of the dbGaP submission process. (June 2024)
- There is a new Login Guide for dbGaP PIs and Submitters for dbGaP Submission System and Submission Portal. (June 2024)
- An enhanced dbGaP Advanced Search is now available for users to filter for third-party annotations of Common Data Elements, dbGaP Collections, sensitivity designations of Genomic Summary Results (GSR), and studies with External Data Sources (EDS). (August 2023)
- We now have 12 dbGaP Collections with more in progress. A dbGaP Collection includes studies or portions of studies that share the same consent group, disease, or funding project. One Data Access Request (DAR) will provide you with the ability to request for all the studies within a dbGaP Collection at once. To search for dbGaP Collections, visit Advanced Search. For more information, please see the glossary entry "Collections". (July 2023)
- A new Subject Sample Telemetry Report (SSTR) webpage and API are available to search and filter on study level Subject and Sample IDs, consents, summary counts, processing status, and molecular and sequence sample uses. See our blog post for more details. (April 2023)
- Jump to "Previous Updates"
Use the questions below to jump to relevant sections or use your browser's find function to search for keywords.
- 1. What files do I need to submit to dbGaP?
- 2. Where can I download dbGaP Submission Guide Templates to generate the files I need to submit?
- 3. What is the Study Config?
- 4. What is a dbGaP Subject?
- 5. What is a dbGaP Sample?
- 6. What do I need to know about protecting study participants' privacy, HIPAA, and subject de-identification for dbGaP data submissions?
- 7. What is a Phenotype Dataset (DS) File?
- 8. What is a Phenotype Data Dictionary (DD) File?
- 9. How do I create Subject Consent (SC) DS and DD files?
- 10. How do I create Subject Sample Mapping (SSM) DS and DD files?
- 11. How do I create Pedigree DS and DD files?
- 12. What data must be included in the Subject Phenotypes and Sample Attributes?
- 13. How do I create Subject Phenotypes DS and DD files?
- 14. How do I create Sample Attributes DS and DD files?
- 15. How do I submit Medical Images and in what format?
- 16. How do I verify that my DS and DD Files will pass dbGaP's phenotype quality control (QC) tests?
- 17. What type of Study Documents may I submit and in what format?
- 18. What should I know about editing, proofreading, and copyright?
- 19. How do I submit Molecular Data to dbGaP?
- 20. How do I submit High Throughput Sequence data and alignment information?
- 21. How do I submit Copy Number Variation (CNV) data?
- 22. How do I link individual study subjects/samples to samples that have been submitted to NCBI databases: GEO, GenBank, SRA (public)?
- 23. What are Association Analysis Data Files and how should they be formatted?
- 24. Who can submit files to dbGaP?
- 25. Where do I submit my dbGaP files?
- 26. What if there are errors or updates in the data and I need to resubmit?
- 27. What happens once I submit my core data files and phenotype files to the dbGaP database?
- 28. When and what will be released?
- 29. Whom may I contact with questions about my dbGaP data submission?
- 30. How can I submit additional data after my study is released?
- GLOSSARY OF TERMS
- APPENDIX for Data Dictionary (DD) File Descriptions and Specifications
Prepare Files for Submission
1. What files do I need to submit to dbGaP?
When a study is registered by a Genomic Program Administrator (GPA) in the dbGaP Submission System (SS), the GPA indicates what data is expected to be submitted. This may be verified by the Program Officer (PO) who oversees the study funding. The submitter will separately complete a Study Data Outline (SDO) through the Submission Portal (SP). This outline summarizes the data that will be uploaded and released in the current version. All data claimed in the SDO must be submitted. The GPA and PO will be notified if the information does not match between the SS and SP.
File Submission Checklist
All new study versions must complete the Study Data Outline in the Submission Portal in order to assert what data types will be submitted and released for the current study version. Upon completion, a dbGaP study accession (phs######.v#.p#) will be provided.
Complete the Study Config web form. This will populate the public study report page.
For the remaining data, please submit only the files that have been asserted in the Study Data Outline. To determine which files are applicable, go through the File Applicability section immediately following this list.
- Phenotype Dataset (DS) and Data Dictionary (DD) files
- (1) Subject Consent DS and DD
- (1) Subject Sample Mapping (SSM) DS and DD
- (1) Pedigree DS and DD
- (1 or more) Subject Phenotypes DS and DD
- (1 or more) Sample Attributes DS and DD
- (1 or more) Linking Subject/Sample IDs to samples in other NCBI databases DS and DD
- Molecular Data
- Sequence Data
- Association Analysis Data
- Study Documents
- Medical Images
For faster processing time, submit to the dbGaP Submission Portal by uploading all files in one submission. DO NOT submit BAM, CRAM, FASTQ files until notified.
File Applicability
Phenotype Dataset (DS) and Data Dictionaries (DD)
- Studies that have consented subjects must submit a Subject Consent DS and DD.
- Studies that have individual level phenotype data (demographic, clinical, exposure, etc) should submit 1 or more Subject Phenotypes DS and DD.
- Studies that have molecular data (array, methylation, called variants, etc.) and/or high throughput sequence data (BAM, CRAM, FASTQ) must submit a Subject Sample Mapping DS and DD and 1 or more Sample Attributes DS and DD.
- Studies that have self-reported or known genetic relationships and monozygotic twins must submit a Pedigree DS and DD.
- Studies that have individual subject/sample IDs submitted to NCBI databases (GEO, GenBank, or public SRA) should provide a Linking DS and DD between the individual study subject/sample IDs to the other NCBI database sample accessions. If only experiment or project accessions are available, then provide via the Study Config web form and mark "no" for "Subject/Sample ID links to public NCBI databases".
Molecular Data
Any GWAS, SNP array, imputations, transcriptomic, epigenomic, gene expression, variant calls from WGS, WXS, and targeted sequencing data. This does not include raw sequencing data and alignment information, which is submitted separately.
Sequence Data
Any high throughput sequence data (WGS, WXS, RNA-Seq, etc) in BAM, CRAM, FASTQ formats. Sequence data should be submitted only after: 1) you have received an email with an attached sequence metadata file containing the registered subject and sample IDs, and consents. This process ensures that submitted sequences are tied to sample IDs that belong to consented subjects. 2) The sequence metadata has been processed and you have received an email to upload sequences.
Association Analyses
Any aggregated genomic level data
Study Documents
Any consent forms, protocols, questionnaires, etc. that correspond to the data.
Medical Images
Any CT scans, eye images, etc.
2. Where can I download dbGaP Submission Guide Templates to generate the files I need to submit?
Download all Submission Templates: dbGaP_Submission_Package_20220812.zip
Download individual Submission Templates: https://ftp.ncbi.nlm.nih.gov/dbgap/dbGaP_Submission_Guide_Templates/Individual_Submission_Templates/
Study Config
3. What is the Study Config?
The Study Config is a web form that collects a description of the study data, methods, and findings, inclusion/exclusion, study history, references, attributions, and terms that will be indexed to enable users to search for your study in dbGaP Advanced Search. The study config must be submitted in order to have a dbGaP study accession (phs######.v#.p#) that can be published in dbGaP and used in journal publications. Here is an example of the study report page populated by the information in the study config: (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000001.v3.p1).
To fill out the study config, go to your study's dbGaP Submission Portal (https://submit.ncbi.nlm.nih.gov/dbgap/).
- Click on "Create" if newly filling out the study config or click on "Edit" to modify an existing study config.
- Once done, press "Submit" and you will be taken back to the study's Submission Portal page.
- To preview the study config, click on "Preview Study Report Page".
You may edit the study config until the study is released. To edit, go to your study's Submission Portal, click on "Edit" under "Study Config". Once the study is released, please contact your phenotype curator to make edits.
If you would like to see in advance what items will be collected in the web form, open 1_StudyConfig.docx.
Study Participant De-identification
4. What is a dbGaP Subject?
A dbGaP Subject is defined as a single human person/individual/patient that arises from a single germline. Each subject should be submitted with a single, unique, de-identified subject ID. Subjects submitted to dbGaP must be consented to submit to a public database. Subject IDs should be an integer or string value. Integers should not have zero padding. IDs should not have spaces. Specifically, only the following characters can be included in the ID: English letters, Arabic numerals, period (.), hyphen (-), underscore (_), at symbol (@), and the pound sign (#). Once a variable name for the subject ID has been chosen, please use the same variable name throughout all the phenotype files for consistency. For example, please do not use SUBJECT_ID in one file and INDIVIDUAL_ID in another file. Please also do not use "dbGaP" in your submitted ID name, since dbGaP will assign a dbGaP subject ID that will be included in the final dump files along with the submitted subject ID. Subjects that are known to be the same person across dbGaP studies will be assigned the same dbGaP subject ID.
5. What is a dbGaP Sample?
A dbGaP Sample is defined as the ID of the final preps submitted to dbGaP by a genotyping center, a sequencing group, or to an NCBI resource, such as GEO or GenBank. A single subject may be mapped to multiple samples, but a single sample should not be mapped to multiple subjects unless the samples are pooled.* For example, if one subject (SUBJECT_ID) provided one sample, and that sample was processed to generate 2 sequencing runs or 1 sequencing and 1 genotyping array run, the data file would show two rows, both using the same subject ID, but having 2 unique sample IDs.
*Please inquire about pooled samples if applicable. This would only apply to pooled samples that belong to consented subjects. If the samples are pooled from controls that are publicly available, there is no need for marking the pooled samples, and a single sample ID may be assigned.
Each sample should be submitted with a single, unique, de-identified sample ID. Sample IDs should be an integer or string value. Integers should not have zero padding. IDs should not have spaces. Specifically, only the following characters can be included in the ID: English letters, Arabic numerals, period (.), hyphen (-), underscore (_), at symbol (@), and the pound sign (#). Once a variable name for the sample ID has been chosen, please use the same variable name throughout all the phenotype files for consistency. For example, please do not use SAMPLE_ID in one file and SAMPLE_NAME in another file. Please also do not use "dbGaP" in your submitted ID name, since dbGaP will assign a dbGaP sample ID that will be included in the final dump files along with the submitted sample ID.
6. What do I need to know about protecting study participants' privacy, HIPAA, and subject de-identification for dbGaP data submissions?
To comply with HIPAA, personally identifying information must be removed from all data, e.g. names, cities, dates, telephone numbers, social security numbers, and any other potentially identifying information, characteristic, or code.
A 2-Step de-identification is required for all IDs submitted in dbGaP data files.
Example: Two step removal of identifiers
Step one: Personal Information → Remove identifiers → Create Study person ID
Step two: Study person ID → Create Subject ID submitted to dbGaP.
Subject IDs submitted to dbGaP may be randomly assigned or may be consecutive numbers without any identifying information (i.e., the submitted Subject ID should not be based on the study person ID or any personal identifiers such as subject's birth date, health record number, or name). The same applies to sample IDs.
Dates directly tied to an individual that is smaller than a year cannot be submitted. In other words, the month and day should be removed and only year be kept. Alternatively, the date can be normalized to days relative to a set point. For the algorithm dbGaP uses to find HIPAA sensitive dates, see glossary entry: HIPAA.
There may be HIPAA sensitive data inherent to a study, such as cities (ex. Framingham) and small populations (ex. Hutterites) that are shown on dbGaP pages. For ages over 89, the individual level data will only be accessible to Authorized Access users, while the public variable summary will winsorize all ages over 89. For other extreme values, since HIPAA does not specify particular cut-off values, value distribution curves are checked and extreme values are hidden case by case.
The NIH Data Management and Sharing Policy published a Supplemental Information to the NIH Policy for Data Management and Sharing: Protecting Privacy When Sharing Human Research Participant Data (NOT-OD-22-213).
Phenotype Dataset (DS) and Data Dictionary (DD) Files
This set of files is referred to as phenotype datasets and data dictionaries since this is curated by the phenotype curator.
7. What is a Phenotype Dataset (DS) File?
A Dataset (DS) file is a rectangular table of data values, subject/sample IDs, and variables, to be submitted either in .txt or .xlsx format, with .txt being the preferred format. There are 5 types of datasets required for submission:
- Subject Consent (SC) DS – 1 file only per study. This is a list of subjects (person), their consents, and biological sex.
- Subject Sample Mapping (SSM) DS – 1 file only per study. This is a list of subjects (person) mapped to their samples submitted as molecular data and high throughput sequence data.
- Pedigree DS – 1 file only per study if there are self-reported or known genetic relationships.
- Subject Phenotypes DS – 1 or more files per study. This is person-level phenotypes.
- Sample Attributes DS – 1 or more files per study. This is sample-level attributes.
Required if applicable: Sample Mapping to other NCBI databases (e.g. Trace, GEO, GenBank, public SRA) – 1 or more files per study.
Each column represents a single phenotypic variable. Row # 1 (column headers) of a data file will contain only the variable names.
Each row contains phenotypes of one Subject or attributes of one Sample. Following the first row (column headers), each subsequent row will reflect data of one subject or sample, depending on the type of file.
8. What is a Phenotype Data Dictionary (DD) File?
A Data Dictionary (DD) file is a table that defines and describes the variables in the corresponding dataset file (DS). It should be submitted in either .txt or .xlsx format, with .xlsx being the preferred format. Each dataset (DS) file must be submitted with a corresponding DD file. You may review a complete list of data dictionary descriptions and specifications, including those required in your DD file in the APPENDIX.
The required columns and specifications for a DD File are:
Column 1: VARNAME – variable name. Best if the varname reflects the measurements taken (e.g. HDL_am, ALCOHOL_day, TREATMENT_tamoxi). Do not use "dbGaP" in the variable name.
Column 2: VARDESC – variable description. Be specific so that it is clear what you have measured. For example, "blood pressure" is useful, but "brachial blood pressure while sitting" is more informative. Alternatively, submit study documents with details of data collection — dbGaP will link appropriate document sections to variables. For the AFFECTION_STATUS, please fill the disease name in the VARDESC.
Column 3: UNITS – units of measurement. If there are no units, leave the entry blank. If none of the variables have units, the UNITS column may be omitted.
Last set of columns: VALUES – encoded values with definitions to describe the codes used in the DS. Fill single value in one cell; no compound values in one cell. See VALUES in APPENDIX for full requirement details.
Example:
Last column with header | Leave header blank | Leave header blank | Leave header blank |
---|---|---|---|
VALUES | |||
10=Elementary | 20=High School | 40=College | 4=Graduate School |
1=2-4 drinks per day | 2=5-7 drinks per day | 3=>7 drinks per day |
Study Meta DS and DD Files: Subject Consent, Subject Sample Mapping (SSM), and Pedigree
9. How do I create Subject Consent (SC) DS and DD files?
The Subject Consent (SC) DS contains a comprehensive list of all unique de-identified subject IDs, their assigned consent group, and biological sex value. Open the templates under Phenotype_Data:
2a_SubjectConsent_DS.txt
2b_SubjectConsent_DD.xlsx
The 2 variables required for the DS File are SUBJECT_ID and CONSENT.
Column 1: SUBJECT_ID
The first column must be the de-identified IDs of the subjects. Enter a single de-identified subject ID for each person, and preferably use "SUBJECT_ID" as the subject ID header. A person should not have multiple SUBJECT_IDs. If necessary, you may use another variable name (but be consistent in all study files). Please do not use "dbGaP" in the variable name or the ID itself. See SUBJECT_ID in Glossary for full requirement details. dbGaP will assign a study repository aka namespace to every study. The repository/namespace + submitted SUBJECT_ID will be assigned a dbGaP generated subject ID.
IDs listed in the SUBJECT_ID column must include:
- All consented de-identified subject IDs with submitted phenotype
- All consented de-identified subject IDs with molecular data (e.g. genotypes, high throughput sequences, GEO)
- Unconsented pedigree members used for linking purposes only (without submitted data)
- Unconsented HapMap subjects used as controls or other publicly available controls with unrestricted use in genotype data
The second column must be the consents of the subjects. Enter a single consent value for each person using an integer (1,2,3…) encoded in the DD. The DD consents must exactly match the consents registered in the Submission System (SS), including modifiers. If they do not match, we cannot process your study. If you are a submitter and do not have access to the SS, you can see the consent groups in the dbGaP Submission Portal for your study by clicking "View consent group" in the box on the upper right. For questions regarding the registered consent groups and Data Use Limitations (DUL), please contact your GPA. For unconsented pedigree linking members or publicly available controls with unrestricted use (including HapMaps), set CONSENT=0. Aside from the aforementioned controls with unrestricted use, no other samples may belong to CONSENT=0 individuals. See CONSENT in Glossary for full requirement details.
In the corresponding DD, do not include CONSENT code 0 in the corresponding DD. dbGaP will automatically add the consent 0 code in the DD. It will be listed as "0=Subjects used as genotyping controls and/or pedigree linking members" (no quotes). For all other consent groups > 0, use the format: code=Consent Group's Title (Consent Group's Abbreviation). For example, here is what a study with 2 consent groups might look like in the DD.
Last column with header | Leave header blank |
---|---|
VALUES | |
1=General Research Use (NPU) (GRU-NPU) | 2=Health/Medical/Biomedical (GSO) (HMB-GSO) |
Column 3: SEX
Provide the biological sex value of the person listed in the SUBJECT_ID column. To speed up study processing through the dbGaP auto-pipeline, sex values have been restricted to M/Male/1 or F/Female/2 or UNK/Unknown or left empty, and should match the sex values entered into the Pedigree DS if a pedigree DS is applicable. All other values will require a resubmission.
Aliases or Overlapping Subjects between Studies
Include the variables SUBJECT_SOURCE and SOURCE_SUBJECT_ID ONLY IF:
- Your study has subjects that are included in another dbGaP study OR
- Your subjects are available in a public repository with an established namespace (Coriell, NRGR, NINDS, NIMH, etc.)
This will enable dbGaP to assign the same dbGaP generated subject ID for a person and prevent users from double counting the same person downloaded from multiple dbGaP studies. If you are planning to make SUBJECT_ID = SOURCE_SUBJECT_ID for all subjects, in other words, your list of subjects is a complete overlap to the source, please let your phenotype curator know. Rather than submitting 2 additional variables (SUBJECT_SOURCE and SOURCE_SUBJECT_ID) in your Subject Consent files, your curator can assign the same dbGaP labeled study repository as the other dbGaP study or repository.
Column 4 and 5: SUBJECT_SOURCE and SOURCE_SUBJECT_ID (Submit both variables. We are unable to process SOURCE_SUBJECT_ID without a SUBJECT_SOURCE).
SUBJECT_SOURCE: Provide the namespace, such as the name of the public repository or existing dbGaP subject repository.
SOURCE_SUBJECT_ID: Provide the de-identified subject ID used in the source. Follow guidelines for SUBJECT_ID.
- For subjects who have participated in another dbGaP study, the value to use for SUBJECT_SOURCE is indicated under "dbgap_subject_repository," the 6th column of the Subject Sample Telemetry Report (SSTR). The SSTR may be found by first performing a search for the dbGaP accession number of the overlapping study on the dbGaP home page to locate that study's public page. From there, select the link Subject Sample Telemetry Report (SSTR). If you do not know the accession number of the overlapping study or are unsure of what to use as the SUBJECT_SOURCE, please work with your phenotype curator to obtain this information.
- For referencing HapMap subjects from Coriell, the SUBJECT_SOURCE value should be written as "Coriell". The SOURCE_SUBJECT_ID should be written as the de-identified subject ID assigned by Coriell. Please make sure the SEX value of the subject matches the value listed on the Coriell website.
- The SUBJECT_ID and SOURCE_SUBJECT_ID can have identical or different IDs.
- For SUBJECT_IDs that map to more than one existing or public repository, use SUBJECT_SOURCE and SOURCE_SUBJECT_ID for the first set of aliases and create additional columns for SUBJECT_SOURCE2 and SOURCE_SUBJECT_ID2, SUBJECT_SOURCE3 and SOURCE_SUBJECT_ID3, etc. Note, do not include the number "1" in the first set of aliases. For example, if you have
SUBJECT_ID 101
who is known asNA1111(Coriell)
and45678(NHGRI)
, for the first alias, use SUBJECT_SOURCE=Coriell and SOURCE_SUBJECT_ID=NA1111; for the second alias, use SUBJECT_SOURCE2=NHGRI and SOURCE_SUBJECT_ID2=45678. - Avoid a SUBJECT_SOURCE that is very general coupled with a SOURCE_SUBJECT_ID that is a simple integer. For example, SUBJECT_SOURCE=University of California and SOURCE_SUBJECT_ID=1. There is a potential for unintended subject collision; that is, two different people are assigned the same source and ID across studies. There are many University of Californias and there are many studies that use 1 as an ID.
Example of Subject Consent DS File
SUBJECT_ID | CONSENT | SEX | SUBJECT_SOURCE | SOURCE_SUBJECT_ID |
---|---|---|---|---|
1 | 1 | 1 | ||
2 | 1 | 1 | NRGR | 1012 |
3 | 1 | 1 | NINDS | NDS00008 |
4 | 1 | 2 | ||
5 | 1 | 2 | Example Consortium | 1284yA8-B |
6 | 1 | UNK | ||
7 | 1 | UNK | ||
8 | 1 | 1 | Coriell | NA1234 |
9 | 1 | 1 | NLM | 13 |
10 | 1 | 2 | ||
1001 | 0 | 2 | ||
1002 | 0 | 1 |
Example of Subject Consent DD File
VARNAME | VARDESC | TYPE | VALUES | ||
---|---|---|---|---|---|
SUBJECT_ID | Subject ID | string | |||
CONSENT | Consent group as determined by DAC | encoded value | 1=General Research Use (GRU) | ||
SEX | Biological sex | encoded value | 1=Male | 2=Female | UNK=Unknown |
SUBJECT_SOURCE | Source repository where subjects originate | string | |||
SOURCE_SUBJECT_ID | Subject ID used in the Source Repository | string |
10. How do I create Subject Sample Mapping (SSM) DS and DD files?
The SSM is a mapping of SUBJECT_IDs (consented subjects and their phenotype data) to SAMPLE_IDs. This list of SAMPLE_IDs is an assertion of the samples that will be submitted in the molecular data, high throughput sequence data, or linked to an NCBI database: GEO, GenBank, non-human public SRA. Open the templates under Phenotype_Data:
3a_SSM_DS.txt
3b_SSM_DD.xlsx
The required variables are SUBJECT_ID and SAMPLE_ID.
Column 1: SUBJECT_ID
The first column must be the de-identified IDs of the subjects. Only enter SUBJECT_IDs that are linked to SAMPLE_IDs with submitted molecular data, high throughput sequence data, or linked to an NCBI database: GEO, GenBank, non-human public SRA. If a subject does not have these data types, do not include the subject ID. Subjects listed in the SUBJECT_ID column must be consented with CONSENT > 0 or are publicly available controls with unrestricted use (CONSENT=0) in the Subject Consent DS. For SUBJECT_IDs with multiple types of molecular data (e.g. SNP array data, RNA expression data, sequencing data), use multiple rows with identical subject ID, but distinct sample IDs. See SUBJECT_ID in Glossary for full requirement details.
Column 2: SAMPLE_ID
The second column must be the de-identified IDs of the samples. The SAMPLE_IDs in this column must be identical to those used in the molecular data (PLINK, VCFs, etc) and sequence metadata. Different sample runs or aliquots of the same sample should be identified by different SAMPLE_IDs, but the same SUBJECT_IDs. Likewise, intended duplicates should also be identified by different SAMPLE_IDs, but the same SUBJECT_IDs. Sample IDs linking to a public NCBI resource (GEO, GenBank, public SRA) should also be included. The SAMPLE_ID column should not have any repeating IDs. See SAMPLE_ID in Glossary for full requirement details.
Can the SAMPLE_ID be the same as the SUBJECT_ID?
Yes, the SAMPLE_ID can be the same as the SUBJECT_ID. Here are some common scenarios:
- If the study has 1:1 subject and sample IDs, please still submit an SSM listing the SUBJECT_ID and SAMPLE_ID identically.
- If the molecular data uses subject IDs, then treat the subject IDs as sample IDs, listing SUBJECT_ID and SAMPLE_ID identically. Please verify that each person is only assigned a single SUBJECT_ID.
- A person has multiple samples and one of the sample IDs is identical to the subject ID. This is acceptable.
SUBJECT_ID | SAMPLE_ID |
---|---|
1 | S1 |
2 | S2 |
3 | S3 |
4 | S4 |
5 | S5 |
6 | S6 |
6 | S7 |
7 | S8 |
7 | S9 |
7 | S10 |
8 | S11 |
8 | S12 |
VARNAME | VARDESC | TYPE | VALUES |
---|---|---|---|
SUBJECT_ID | Subject ID | string | |
SAMPLE_ID | Sample ID | string |
11. How do I create Pedigree DS and DD files?
The Pedigree DS lists the genealogical relationships of subjects within a study. If there are no known relationships, this file does not need to be submitted. However, if dbGaP finds that there are possible relationships between subjects after reviewing the genetic data (with the GRAF [Genetic Relationship and Fingerprinting] software), dbGaP will request a pedigree DS or include a README file with the results of IBD and/or dbGaP GRAF. If the IBD or pedigree information should not be released because of data sharing limitations, please let dbGaP know in writing. See GRAF in the Glossary for more information. Open the templates under Phenotype_Data:
4a_Pedigree_DS.txt
4b_Pedigree_DD.xlsx
The required variables are FAMILY_ID, SUBJECT_ID, FATHER, MOTHER, and SEX.
MZ_TWIN_ID is required if applicable.
Column 1: FAMILY_ID
FAMILY_IDs are de-identified and should be the same for members of the same family.
Column 2: SUBJECT_ID
SUBJECT_IDs should include any person with familial relationships relevant to the study. The SUBJECT_ID column should also include FATHER and MOTHER IDs. All SUBJECT_IDs of the pedigree file should be included in the Subject Consent (SC) DS, where the study subjects have CONSENT >=1 and linking pedigree SUBJECT_IDs have CONSENT=0. See SUBJECT_ID in Glossary for full requirement details.
Columns 3 and 4: FATHER and MOTHER
List FATHER IDs in Column 3 and MOTHER IDs in Column 4. FATHER and MOTHER IDs should be unique and de-identified. Each FATHER ID and MOTHER ID should be included in the SUBJECT_ID column of both the Pedigree DS and the Subject Consent (SC) DS. For SUBJECT_IDs that do not have parents, the FATHER and MOTHER IDs should be filled with 0 or left blank. Dummy IDs should be created for the FATHER and MOTHER IDs if no ID is known and it is necessary to indicate sibling or avuncular relationships.
Column 5: SEX
Provide the biological sex value of the person listed in the SUBJECT_ID column. To speed up study processing through the dbGaP auto-pipeline, sex values have been restricted to M/Male/1 or F/Female/2 or UNK/Unknown or left empty, and should match the sex values entered into the Subject Consent DS. All other values will require a resubmission.
Column 6: MZ_TWIN_ID
De-identified monozygotic twin IDs should indicate monozygotic twins and multiples of the same family. The MZ_TWIN_ID column should distinguish sample duplicates from samples of monozygotic twins. Monozygotic twins and multiples should be assigned the same MZ_TWIN_ID, FATHER_ID, and MOTHER ID, but different SUBJECT_IDs. For dizygotic twins and all other individuals, the MZ_TWIN_ID column should be left blank. If you wish to identify dizygotic twins, an additional variable may be included in the subject phenotypes DS.
How should I list families with half siblings?
You may list families with half siblings using either example with Example 1 being more preferable. Please remember to include SEX column and if applicable, the MZ_TWIN_ID column.
-
Example 1:
FAMILY_ID SUBJECT_ID FATHER MOTHER 1 A C D 1 B C E 1 C 0 0 1 D 0 0 1 E 0 0 -
Example 2:
FAMILY_ID SUBJECT_ID FATHER MOTHER 1 A C D 1 B C E 1 C 0 0 1 D 0 0 2 E 0 0
FAMILY_ID | SUBJECT_ID | FATHER | MOTHER | SEX | MZ_TWIN_ID |
---|---|---|---|---|---|
100 | 1001 | 0 | 0 | 2 | |
100 | 1002 | 0 | 0 | 1 | |
100 | 1 | 1002 | 1001 | 1 | 1 |
100 | 2 | 1002 | 1001 | 1 | 1 |
101 | 1011 | 0 | 0 | 2 | |
101 | 1012 | 0 | 0 | 1 | |
101 | 3 | 1012 | 1011 | 1 | |
102 | 1022 | 0 | 0 | 2 | |
102 | 1023 | 0 | 0 | 1 | |
102 | 4 | 1023 | 1022 | 2 |
VARNAME | VARDESC | TYPE | VALUES | ||
---|---|---|---|---|---|
FAMILY_ID | Family ID | string | |||
SUBJECT_ID | Subject ID | string | |||
FATHER | Father's Subject ID | string | |||
MOTHER | Mother's Subject ID | string | |||
SEX | Biological sex | encoded value | 1=Male | 2=Female | UNK=Unknown |
MZ_TWIN_ID | Twin ID for monozygotic twins and multiples. An MZ_TWIN_ID is not provided for dizygotic twins or multiples. | string |
Subject Phenotypes and Sample Attributes DS and DD Files
12. What data must be included in the Subject Phenotypes and Sample Attributes?
Metadata around the experiment or study and annotations that are necessary to reproduce any published table or analysis must be included with genomic data submissions. In particular, data pertinent to the interpretation of genomic data -- such as associated phenotype data (e.g. clinical information), exposure data, relevant metadata, and descriptive information (e.g. protocols or methodologies used) -- are expected to be shared. To avoid user questions, make sure to include self-reported RACE and relevant dates (e.g., birth, diagnosis, sample collection) written as years or normalized to a set point in time, along with any phenotypes, measured or collected data that are described in your Study Description. For the Subject Phenotypes, it would be data relevant to the individual person. For the Sample Attributes, it would be data relevant to the sample derived from the person. For instance, do not list the RACE variable in the Sample Attributes, since RACE is stable for a person across samples. However, for variables like TREATMENT, if the person was only treated once, and data was collected, then TREATMENT could belong in the Subject Phenotypes table. However, if TREATMENT was completed multiple times, and each time a sample was extracted, then it would be better for TREATMENT to be tracked in the Sample Attributes table.
13. How do I create Subject Phenotypes DS and DD files?
The Subject Phenotypes DS file includes measured and/or descriptive traits per individual person. The primary ID in this file is the SUBJECT_ID. Open the templates under Phenotype_Data:
5a_SubjectPhenotypes_DS.txt
5b_SubjectPhenotypes_DD.xlsx
Column 1: SUBJECT_ID
Each SUBJECT_ID needs to be unique and should be linked to only 1 row of data in the DS. All SUBJECT_IDs included in this file must be found in the subject consent (SC) DS with CONSENT > 0. No CONSENT=0 SUBJECT_IDs should appear in the Subject Phenotypes DS. CONSENT=0 subjects are not permitted to have individual level data. See SUBJECT_ID in Glossary for full requirement details.
All other Column Headers: VARNAMES (variable names)
Submit the following types of variables:
- Review section: "What data must be submitted"
- Affection status: Provide the disease or phenotype of cases in the VARDESC for this variable. Do not use this variable if your study does not involve the comparison of cases and controls for singular diseases or phenotypes sharing a common pathological origin.
- Race/ethnicity/ancestry/heritage
- Relevant dates (e.g., birth, diagnosis) written as years or normalized to a set point in time. Do not include month and days directly tied to the person, which are considered HIPAA sensitive. Click here to see the algorithm dbGaP uses to find HIPAA sensitive dates: HIPAA
- Since the sex variable is already required in the Subject Consent DS, no need to resubmit the SEX variable in the Subject Phenotypes DS. However, if it is part of your data, no need to go through the extra work of removing it from the Subject Phenotype DS.
Can I submit multiple subject phenotypes DS files?
You may submit multiple subject phenotypes DS/DD. Subject phenotypes files can be split by race/ethnicity, cohort, collection period, etc. The file name should indicate how the multiple subject phenotypes are split. The primary ID in each subject phenotypes file should be the SUBJECT_ID.
How do I submit data that has been measured serially or longitudinally?
If each SUBJECT_ID has a series of measurements or the data are longitudinal, below are the formatting options for this data:
- The first subject phenotypes DS may include all the variables that are stable through events, e.g. biological sex, race, prior history. A second subject phenotypes DS may include all the variables that change per event or time for a person. For example, when a dataset has a single SUBJECT_ID listed multiple times due to measures collected at different events, this would be considered a longitudinal dataset. To make a row unique, unique (composite) keys should have scientific significance and aid in searching for covariate data. Unique keys should not be marked for every single variable in the dataset. Going back to the example, in the corresponding DD, mark an "X" under the UNIQUEKEY column for the variables SUBJECT_ID + EVENT. This means that for each subject at some particular event, there are some set of relevant data collected.
- Alternatively, you could create a single subject phenotypes DS, but have your table stretch in columns, where each event and number is a variable, such as mi_event1, mi_event2, stk_event1, stk_event2, etc., and the value would be binary. In this model, each SUBJECT_ID would only be listed once. You'd also need mi_event1_dayssinceoccurance, weight_@_mi_event_1, etc. We have received both types of submissions. We prefer option 1.
Example of a Subject Phenotypes DS File
SUBJECT_ID | AFFECTION_STATUS | RACE | EDUCATION | AGE | AGE_ONSET | HEIGHT | WEIGHT | KRAS |
---|---|---|---|---|---|---|---|---|
1 | 1 | African American | 4 | 35 | 25 | 67 | 180.2 | yes |
2 | 2 | Asian | 20 | 56 | 54 | 67 | 201.5 | no |
3 | 2 | European | 40 | 1000 | 45 | 60 | 160.5 | yes |
4 | 1 | Latin American | 20 | 37 | 35 | 75 | 99.5 | no |
5 | 2 | Asian | 10 | 46 | 40 | 61 | 315.2 | no |
Example of a Subject Phenotypes DD File - this is one table, but has been split into two for viewing purposes. For details about each column header in the DD, see the APPENDIX.
VARNAME | VARDESC | DOCFILE | TYPE | UNITS | MIN | MAX | RESOLUTION | COMMENT1 | COMMENT2 |
---|---|---|---|---|---|---|---|---|---|
SUBJECT_ID | Subject ID | string | |||||||
AFFECTION_STATUS | Case control status of the subject for [please fill in phenotypic term] | Diagnosis.pdf | encoded value | ||||||
RACE | Self-reported race | Main_exam.pdf | string | ||||||
EDUCATION | Level of education | Main_exam.pdf | encoded value | ||||||
AGE | Subject age at enrollment | Diagnosis.pdf | integer, encoded value | years | 0 | >89 | |||
AGE_ONSET | Disease onset age | Diagnosis.pdf | integer | years | 0 | >89 | |||
HEIGHT | Height measured at enrollment | Diagnosis.pdf | decimal | inches | |||||
WEIGHT | Subject's weight | Diagnosis.pdf | decimal, encoded value | pounds | 1 | ||||
KRAS | Somatic mutation in KRAS (Entrez GeneID: 3845) | Cancer.docx | string |
VARIABLE_SOURCE | SOURCE_VARIABLE_ID | VARIABLE_MAPPING | UNIQUEKEY | COLLINTERVAL | ORDER | VALUES | ||||
---|---|---|---|---|---|---|---|---|---|---|
NCI | Subject ID | X | Collected in Exam 1 | |||||||
Collected in Exam 1 | 1=Control | 2=Case | 3=Other | |||||||
MSH | Race | Collected in Exam 1 | ||||||||
MSH | Educational Status | Collected in Exam 1 | 99=NA | 10=Elementary | 20=High School | 40=College | 4=Graduate School | |||
PhenX | PX010101020000 | Identical | Collected in Exam 1 | List | 9999=Missing | 1000=Not assessed | INTEGERS | |||
MSH | Age of Onset | Collected in Exam 1 | ||||||||
MSH | Body Height | Collected in Exam 1 | ||||||||
MSH | Body Weight | Collected in Exam 1, 2, 3 | List | 1000=Not assessed | DECIMALS | 9999=Unknown | ||||
LNC | KRAS gene mutations tested for in Blood or Tissue by Molecular genetics method Nominal | Collected in Exam 3 |
14. How do I create Sample Attributes DS and DD files?
The Sample Attributes DS includes measured and/or descriptive traits per individual sample (not person). A person may be represented by multiple samples. Therefore, the primary id in this file is the SAMPLE_ID. Open the templates under Phenotype_Data:
6a_SampleAttributes_DS.txt
6b_SampleAttributes_DD.xlsx
Column 1: SAMPLE_ID
Only include SAMPLE_IDs that are listed in the subject sample mapping (SSM) DS and belong to SUBJECT_IDs that have CONSENT>0 in the subject consent (SC) DS. SAMPLE_IDs belonging to CONSENT=0 SUBJECT_IDs should not appear in the Sample Attributes DS file. The SAMPLE_ID should use the exact same syntax used for the SAMPLE_ID listed in the SSM. For example, '0AB12' is not the same as 'AB12', nor is '123-1' the same as '123_1'. Each SAMPLE_ID should be represented by 1 row of data in the DS. See SAMPLE_ID in Glossary for full requirement details.
Columns 2-5: NCBI BioSample variables included in the Sample Attributes DS
The NCBI BioSample database (https://www.ncbi.nlm.nih.gov/biosample/) contains descriptions of biological source materials used in experimental assays. Each of your samples will be assigned a BioSample accession number and will thus be searchable through BioSample. The first three variables below must be included to provide meaningful data for each sample's BioSample entry. HISTOLOGICAL_TYPE should only be included if applicable.
- BODY_SITE – the collection site of the sample (ex. skin, breast, peripheral blood, inner oral cavity). If the sample is from a xenograft, you may rename the variable.
- ANALYTE_TYPE – the analyte type of the sample (ex. DNA, RNA). If the same sample ID was used for both DNA and RNA aliquots, the value should be "DNA/RNA" instead of listing the sample twice. The BioSample database does not allow multiple values for the same sample ID.
- IS_TUMOR – the tumor status of the sample. The values can be binary such as yes/no or encoded 1=yes and 2=no. For non-cancer studies, the values in IS_TUMOR should be "no" or "unknown."
- HISTOLOGICAL_TYPE – the sample's cell or tissue type/subtype (ex. melanocytes, buccal cells, embryonic stem cells, carcinoma, lymphoma, and mixed types). If the histological type is not known or is identical to the BODY_SITE, do not include this variable.
All other Column Headers: VARNAMES (variable names)
Most institutes request all data pertinent to the interpretation of genomic data, such as clinical information, exposure data, and relevant metadata pertaining to the sample. Please note that the template (6a_SampleAttributes_DS.txt) provided is based on a cancer study and the variables listed may be useful for cancer studies. However, if your study is not a cancer study, please do not include the cancer variables. Instead, submit additional sample attribute variables that will provide a greater understanding of the study. For example: sample collection date, sample extraction method and date; batch and center effects, sample plate or well number; sample run date, sample QA results; and sample affection status (ex. psoriatic skin sample vs. non-psoriatic skin sample from a case subject who has psoriasis). Relevant dates (e.g., sample collection date) that are directly tied to a person should be written as years or normalized to a set point in time. Do not include month and days directly tied to the person, which are considered HIPAA sensitive. Click here to see the algorithm dbGaP uses to find HIPAA sensitive dates: HIPAA.
Can I submit multiple sample attributes DS files?
You may submit multiple sample attributes DS/DD. You may split out sample attributes files to separate them by race/ethnicity, cohort, collection period, etc. Each of the sample attributes files should have SAMPLE_ID as the primary id. The BioSample required variables should appear only once per SAMPLE_ID, and the values for the BioSample required variables should not conflict. For example, a SAMPLE_ID cannot be marked as both TUMOR and non-TUMOR. In this case, we would request that an additional SAMPLE_ID be created. If this is not possible, please contact the dbGaP phenotype curator.
How do I submit data that has been measured serially or longitudinally?
Each SAMPLE_ID has a series of measurements or the data is longitudinal. In this case, this table may have a SAMPLE_ID listed multiple times. We would treat this as a longitudinal dataset, where SAMPLE_ID + [variable] are the variables that make the row unique. Unique (composite) keys should have scientific significance and aid in searching for covariate data. Unique keys should not be marked for every single variable in the dataset. Mark an "X" under the UNIQUEKEY column for the variables in the corresponding DD. In this case, we recommend submitting the BioSample required variables in a separate sample attributes DS/DD.
Example of a Sample Attributes DS File - this is one table, but has been split into two for viewing purposes.
SAMPLE_ID | BODY_SITE | ANALYTE_TYPE | IS_TUMOR | HISTOLOGICAL_TYPE | COLLECTION_AGE |
---|---|---|---|---|---|
S1 | Skin | DNA | Y | Melanoma | 25 |
S2 | Lung | RNA | Y | Liposarcoma | 54 |
S3 | Buccal | DNA | N | Buccal cells | 45 |
S4 | Skin | RNA | N | Skin | 35 |
S5 | Skin | RNA | N | Keratinocytes | 40 |
PRIMARY_METASTATIC_TUMOR | PRIMARY_TUMOR_LOCATION | TUMOR_STAGE | TUMOR_GRADE | TUMOR_TREATMENT |
---|---|---|---|---|
Primary | Skin | II | G3 | Chemotherapy and biological therapy |
Primary | Peritoneal cavity | III | G2 | Radiation |
NA | NA | NA | NA | NA |
NA | NA | NA | NA | NA |
NA | NA | NA | NA | NA |
Example of a Sample Attributes DD File - For additional options for the DD, see the APPENDIX.
VARNAME | VARDESC | TYPE | UNITS | MIN | MAX | UNIQUEKEY | VALUES | |
---|---|---|---|---|---|---|---|---|
SAMPLE_ID | Sample ID | string | X | |||||
BODY_SITE | Body site where sample was collected | string | ||||||
ANALYTE_TYPE | Analyte type | string | ||||||
IS_TUMOR | Tumor status | encoded value | Y=Is tumor | N=Is not a tumor | ||||
HISTOLOGICAL_TYPE | Cell or tissue type or subtype of sample | string | ||||||
COLLECTION_AGE | Subject's age at sample collection | integer | years | 0 | >89 | |||
PRIMARY_METASTATIC_TUMOR | Primary tumor, metastasis, or transformed cell line | string | ||||||
PRIMARY_TUMOR_LOCATION | Primary tumor location | string | ||||||
TUMOR_STAGE | Tumor stage of sample | string | ||||||
TUMOR_GRADE | Tumor grade of sample | string | ||||||
TUMOR_TREATMENT | Type of tumor treatment for sample | string |
Medical Images
15. How do I submit Medical Images and in what format?
De-identified medical image files may be submitted only if they meet the following criteria: 1) the images correspond to subjects or samples with phenotype and genomic data that are submitted to dbGaP and 2) the images could not be submitted to an image-specific database or be made publicly available. Since no validation or QC is run on images submitted to dbGaP, we will need an attestation as a README to say that you confirm there are no PII or HIPAA-sensitive information included with the image submission. When submitting multiple files, or the file is > 1TB, you need to submit them in zip or tar.
Note: H&E Visium slides should be submitted to NCBI GEO.
Also, create a mapping of SUBJECT_IDs to the image files. Open the templates under Medical_Images:
SubjectImageMappingDS.txt
SubjectImageMappingDD.xlsx
Column 1: SUBJECT_ID
All SUBJECT_IDs included in this file must be found in the subject consent (SC) DS with CONSENT>0. No CONSENT=0 SUBJECT_IDs should appear in the Subject Image Mapping DS. See SUBJECT_ID in Glossary for full requirement details.
Columns 2-5: IMAGE_TYPE, BODY_SITE, FILENAME, FILE_TYPE
Include the following four variables for image data.
- IMAGE_TYPE – the type of image (ex. CT scan, photograph, MRI).
- BODY_SITE – the body site of the image (ex. brain, chest, eye).
- FILENAME - the filename including the file extension.
- FILE_TYPE – the file type (ex. jpg, dng, tif).
All other Column Headers: VARNAMES (variable names)
Any other relevant information related to the image can be included as additional columns.
Example of a Subject Image Mapping DS File
SUBJECT_ID | IMAGE_TYPE | BODY_SITE | FILENAME | FILE_TYPE |
---|---|---|---|---|
1 | photograph | fundus | fundus01a.jpg | jpg |
1 | photograph | fundus | fundus01b.jpg | jpg |
4 | photograph | fundus | fundus04a.jpg | jpg |
4 | photograph | fundus | fundus04b.jpg | jpg |
6 | CT scan | chest | chest06.tif | tif |
7 | CT scan | chest | chest07.tif | tif |
Example of a Subject Image Mapping DD File
VARNAME | VARDESC | TYPE | VALUES |
---|---|---|---|
SUBJECT_ID | Subject ID | string | |
IMAGE_TYPE | Image type | string | |
BODY_SITE | Body site of image | string | |
FILENAME | Filename including the file extension | string | |
FILE_TYPE | File type | string |
16. How do I verify that my DS and DD Files will pass dbGaP's phenotype quality control (QC) tests?
Go through this list prior to submission. This list will help you eliminate the most common errors detected in formatting and data consistency. You can also check your Subject Consent DS, Subject Sample Mapping (SSM) DS, Pedigree DS against your Genotype data (PLINK and VCF) on your system using GaPTools.
- All IDs are two-step de-identified.
- Each DS and DD must be submitted as a separate file. Please do not submit multiple worksheets per file.
- Submit tab-delimited .txt and .xlsx files only. Tab-delimited txt files are preferable for the DS. Excel (.xlsx) format is preferable for the DD. The final files provided to Authorized Users of the study will be in the tab-delimited txt format.
- The DS should be a rectangular table. Column headers should not exceed columns of values. Column headers should not be missing. Primary IDs should not be missing for the row. Remove empty rows or columns between data values or above the headers.
- File names should not contain special characters, spaces, hyphens, brackets, periods, or forward (/) or backward slashes (\).
- Check formatting and spelling of the DS and DD. Remove non-ascii characters, new line feeds or carriage return characters (they sometimes may appear like a square or a question mark in a box), unintended quotes (""").
- Check that "dbGaP" is not used in any of the variable names or the IDs. "dbGaP" is reserved for dbGaP generated items that are included in the study release.
- Variable names between DS and its corresponding DD must be identical in syntax. For example, "day_ enrollment" is not the same as "day_enrollment" or "Day_Enrollment.". In the example of inconsistent variables, notice the letter case difference and extra space.
- Variable names and variable descriptions need to be distinct within a dataset.
- The same variable name must be used for the ID columns. For example, do not use SUBJECT_ID in a dataset, but Patient_ID in another dataset for the same identifier. If you use SUBJECT_ID as the primary subject ID variable name, then use SUBJECT_ID as the variable name in every dataset that lists out the subjects. Likewise, keep the primary sample ID variable name identical throughout all the datasets.
- All SAMPLE_IDs listed in the Subject Sample Mapping (SSM) dataset must match the SAMPLE_IDs in the molecular data and high throughput sequences. The syntax must be identical. For example, SAMPLE_ID "1034_abc.20" is not the same as SAMPLE_ID "1034-abc.20" or "1034_abc.2".
- Remove HIPAA sensitive data, such as patient's name, doctor's name, months and days from dates directly tied to the subject, etc. Year is acceptable. Click here to see the algorithm dbGaP uses to find HIPAA sensitive dates: HIPAA
- Some HIPAA sensitive data are permissible, such as age > 89 for studies that focus on older populations, or geographic locations, etc. Please work with the dbGaP curator to make sure that the public summaries are correctly hidden.
- Define codes for variable values in the respective DD, entering one code definition per cell. For example, "1=Control" in one cell and "2=Case" in a separate cell. Do NOT enter codes delimited with semicolon or commas in a single cell, like "1=Case; 2=Control."
- Each row of each dataset must be unique (marked by a UNIQUEKEY in the DD). Thus, if a SUBJECT_ID is to appear more than once in a Subject Phenotypes DS, there must be at least one other variable that forms a unique key for each row. The same condition must be met if the same SAMPLE_ID is to be repeated in the Sample Attribute DS.
- Remove duplicate SUBJECT_IDs from Subject Consent and Pedigree DS files and remove rows with duplicate SAMPLE_IDs from the SSM DS.
- Remove completely identical rows and empty rows.
- Check that all subjects IDs found in the subject phenotypes DS have CONSENT>0 and all sample IDs in the sample attributes DS belong to subjects that have CONSENT>0. Another way said, CONSENT=0 (pedigree linking members and HapMap controls) and unconsented IDs should not be in any individual-level subject phenotypes or sample attributes DS.
- If there are multiple sex variables captured for the same person, verify that all sex values reported among the phenotype component datasets (Subject Consent, Pedigree, and Subject Phenotypes DS) are consistent with the sex determined by the genotypes, unless the conflicting variable indicates self-reported sex (Subject Phenotypes DS only).
- If there are multiple case control variables captured for the same person, verify that all case control values are consistent for the same individuals.
- Double check for data consistency!
Review the descriptions of variables in the APPENDIX for specific instructions on labeling header columns and file-naming conventions. Also read the Glossary for definitions of variables. To see the QC checks that dbGaP completes for each study, see section "What happens once I submit my core data files and phenotype files?".
Study Documents
17. What type of Study Documents may I submit and in what format?
Any document that describes study methods and data collection should be submitted, e.g., protocols, questionnaires, manuals of procedures and operations, consents, and can be published on the public dbGaP page. The preferred file format is pdf, though Word and Excel documents will be accepted. Please submit tabular images in Excel.
The study documents may be annotated by the phenotype curator or submitted with annotations using variable or dataset names. These annotations can be added directly to the document or to a DD under the "DOCFILE" column. The annotations link text segments to corresponding variables and/or datasets. The final annotations will be visible on the public dbGaP pages. Click on the 2 links below to see how to go from the Variable Summary page to the Study Document page and vice versa.
18. What should I know about editing, proofreading, and copyright?
Proofreading and Editing – Please proofread and edit your documents thoroughly before submission — they will be posted to the public dbGaP web pages.
dbGaP will not perform any copyediting or proofreading. Any content changes require submission of a new version of the document. Documents that contain potential HIPAA rule violations will not be processed and need to be resubmitted following redactions.
Copyright – Previously Published Work – If you submit a published work (article, review, book chapter, questionnaires, etc.) for dbGaP posting, please include documentation that authorizes the public posting on the dbGaP website. If you are unsure about the copyright status of a document, contact the publisher or owner of the work.
NIH does not claim copyright of any submitted documents. However, NIH must be given nonexclusive rights to freely distribute all documents on the dbGaP site.
Molecular Data
19. How do I submit Molecular Data to dbGaP?
No BAM, CRAM, and FASTQ files should be submitted as "Molecular Data" type to the dbGaP Submission Portal. High throughput human sequence data and alignment information should be submitted through a separate process: High throughput sequencing submission instructions.
Molecular data, that is not high throughput sequence data, should be submitted to the dbGaP Submission Portal under the section "Other files" with type "Molecular Data". It should be submitted along with the phenotype data or as early as possible so that it enters a dbGaP genotype curator's queue. Do not submit each file separately, but bundle the files. To compress and bundle files, zip first then tar. Do not tar first then zip as this will significantly delay the processing time.
For VCFs, the files should be compressed using bgzip instead of zip as bgzip's block compression method can be directly used with VCFtools and BCFtools. This enables dbGaP to run qc checks quickly and report back to you any errors. For VCF files larger than 300GB, please split by chromosome, then tar the set of VCFs and submit as a single tarball.
Essential requirement: Sample IDs must be de-identified. Every sample ID found in an individual level Molecular Data file must be mapped to a consented subject in the Subject Sample Mapping (SSM) dataset. See SAMPLE_ID in Glossary for full requirement details. Sample IDs that do not follow the requirements will not be processed. If sample IDs are modified, please also modify the corresponding Sample Attributes dataset.
Please include a README with a brief description of the data that you are submitting. It should minimally include genotyping steps, genome build, and technology if applicable.
Common questions and errors:
- The sample ID is ideally the final aliquot used for a sequencing run or well on an array plate. A person with a given subject ID can have many samples.
- If a sample ID is a technical control such as Coriell HapMap sample or a publicly available control, it must be mapped to a subject ID in the Subject Sample Mapping (SSM) dataset and that subject ID must be explicitly marked as CONSENT=0 in the Subject Consent (SC) dataset.
- Single cells or multiplexed single cells should each be given a unique sample ID.
- Sample IDs in sequence derived genotypes (VCFs) must be identical to the sample IDs used in the corresponding sequence data (BAMs).
- Include a File Sample Mapping (FSM) file to map sample IDs to single sample data files.
- Include README to describe content of data files and QC anomalies especially if the content is not in one of the formats listed below and fits into the "Other" category.
- Check that files are not truncated.
See the Molecular Data section for guidelines, common errors, dbGaP qc checks, and where to submit molecular data.
Click on the links below to hop to a specific molecular data type:
- Genotype (SNP array in PLINK format and if available, raw data (Illumina .idat or Affymetrix .cel), and genotype reports)
- SNP, CNV, and structural variants derived from sequence data (.vcf)
- Imputation (IMPUTE2, MACH, MINIMAC, SHAPEIT)
- Expression/Epigenetic array or counts (.txt, .tsv)
- Somatic and/or germline mutation annotations (.maf)
- Other (individual and summary level data (.txt or .csv matrix), -omics, single cell, UCSC BED format, etc.)
20. How do I submit High Throughput Sequence data and alignment information?
dbGaP accepts high throughput human sequence data in BAM, CRAM, and FASTQ formats. Choose one data storage option below. Existing studies may have a combination of the options but all new submissions should follow a single option.
-
NCBI Data Storage (SRA): Both sequence metadata and sequence files are submitted to NCBI and available for download from NCBI servers OR direct cloud access through Google Cloud and AWS (Amazon).
-
Cloud Data Storage (External Data Source including Trusted Partners): The sequence metadata is submitted to NCBI with details of sequence file cloud storage locations. This option requires sponsoring institutes to configure your study with an NIH data repository. Sequence files will be accessed either through the cloud storage provider using dbGaP credentials via Authorized Access or through an NIH data repository platform if available.
Note: Sequence Read Archive (SRA) - Please do not submit individual-level human sequence data directly to the SRA. While SRA brokers the sequence data for dbGaP, the sequence data should be uploaded through the dbGaP pipeline described below. This ensures that individual-level human sequence data is properly tied to consented individuals in a dbGaP study and users are able to request for data through dbGaP's Authorized Access. For non-human sequence data, such as microbiome or 16S rRNA cleaned of human contamination, please work directly with SRA by going to their website. This is colloquially referred to as "public SRA" as the data does not require Authorized Access. If you would like to link publicly available metagenomic sequences free of human sequence contaminants to controlled access subjects or samples in a dbGaP study, jump to public SRA.
Steps to submitting Human Sequence to a dbGaP study
Option 1: NCBI Data Storage (SRA) - sequence data will be submitted to dbGaP
- Update or verify that your study is configured for sequence data submission by selecting yes to #5 "Sequence" in the Submission Portal Study Data Outline.
- Submit Subject Consent (SC) and the Subject Sample Mapping (SSM) files. A dbGaP phenotype curator will validate and load the submitted IDs and consents in the dbGaP database, and each sample ID will be assigned an NCBI BioSample ID (SAMN#). This process instantiates IDs and verifies that sequences submitted for samples belong to consented subjects. This may take a few days.
- dbGaP Submission Portal sends email with a sequence metadata spreadsheet attached with your registered sample IDs already entered.
- Complete and Submit the sequence metadata spreadsheet to the dbGaP Submission Portal for only sequence data you plan to submit for this version of the study. Do not include sequence data that have previously been submitted to the study (for example in an earlier version) in the spreadsheet. Remove sample IDs that do not have sequence data. Take care to not edit the spreadsheet column headers and only use the controlled vocabulary options in fields with a selection menu to ensure that the sequence metadata will pass automated checks. Detailed instructions to complete each column in the sequence metadata spreadsheet is at: https://www.ncbi.nlm.nih.gov/sra/docs/submitdbgap/#submission-overview.
- You will receive an email in 2-3 business days indicating if your sequence metadata has errors and needs to be resubmitted OR has been validated and loaded.
- Once your sequence metadata spreadsheet has been validated and loaded, you will receive email instructions to upload sequences through ASPERA. You will be provided with a private key to use the asp-dbgap account.
- Once your sequence data has been uploaded, the files will be validated. Specifically, the number of samples, number of files, file names, and md5s must match exactly what was indicated in the sequence metadata spreadsheet. You will be notified within 5 business days of your sequence upload status.
- All sequence data must be processed before a study can be released through Authorized Access.
Option 2: Cloud Data Storage (External Data Source including Trusted Partners) - sequence data will not be submitted to dbGaP, rather EDS will provide a cloud location
- This option is only for studies that have an External Data Source (EDS) registered in the dbGaP Submission System and the EDS would like to provide cloud data storage locations that can be linked. Most studies will work independently with their EDS to store data and use dbGaP only for authorization, and will not need this option.
- Update or verify that your study is configured for sequence data submission by selecting yes to #5 "Sequence" in the Submission Portal Study Data Outline.
- Submit Subject Consent (SC) and the Subject Sample Mapping (SSM) files. A dbGaP phenotype curator will validate and load the submitted IDs and consents in the dbGaP database, and each sample ID will be assigned an NCBI BioSample ID (SAMN#). This process instantiates IDs and verifies that sequences submitted for samples belong to consented subjects. This may take a few days.
- dbGaP Submission Portal sends email with a sequence metadata spreadsheet attached with your registered sample IDs already entered as additional columns necessary for cloud data submissions.
- Complete and Submit the sequence metadata spreadsheet to the dbGaP Submission Portal for only sequence data you plan to submit for this version of the study. Do not include sequence data that have previously been submitted to the study (for example in an earlier version) in the spreadsheet. Remove sample IDs that do not have sequence data. Take care to not edit the spreadsheet column headers and only use the controlled vocabulary options in fields with a selection menu to ensure that the sequence metadata will pass automated checks. Detailed instructions to complete each column in the sequence metadata spreadsheet is at: https://www.ncbi.nlm.nih.gov/sra/docs/submitdbgap/#submission-overview. Additionally, you will have 5 additional columns specific to cloud data storage: active_location_URL, Bases, Reads, coverage, AvgReadLength.
- A sequence curator will verify that files referenced in your sequence metadata can be accessed. You will need to grant access to NCBI operated accounts for this process to occur.
- All sequence data must be processed before a study can be released through Authorized Access.
Do NOT upload sequences until you receive the email confirmation that your sequence metadata spreadsheet has been loaded. Sequence metadata (.xlsx) that is uploaded to the dbGaP Submission Portal typically takes 2-3 business days to process.
Do NOT submit sequence files to the dbGaP Submission Portal ASPERA account (subasp) as the sequence files are destined for the Sequence Read Archive (SRA) that uses dbGaP's controlled access.
Do submit sequence files to the ASPERA account (asp-dbgap) named in the email instructions. Sequence files (BAM, CRAM, FASTQ) typically take 3-5 business days to process, depending on the number and size.
Please split pairs of FASTQ files into subsets that are 250 GB or less when uncompressed. In those cases, additional columns of filetype, filename, and MD5 checksum can be added using the same column titles.
Instructions for sequence metadata and data upload can be found here: https://www.ncbi.nlm.nih.gov/sra/docs/submitdbgap/.
Contact for questions or status update: [email protected].
Tracking samples
A link to the Subject Sample Telemetry Report (SSTR) will be provided when the IDs and consents have been loaded. The SSTR includes a complete list of subjects, samples, consents, dbGaP assigned IDs and study repository, BioSample variables, and sequence_data_details.
RNA sequences
If RNA sequences will be submitted, please consider also submitting expression or read counts and determine whether they will be submitted to dbGaP or NCBI GEO (public). If the counts are submitted to dbGaP, upload as Molecular Data in the dbGaP Submission Portal. If the counts are submitted to NCBI GEO, submit a linking of subject or sample IDs to GEO accessions (GSM######) following the instructions here. This will enable dbGaP to link to GEO and vice versa.
Whole Genome, Exome, or Targeted sequences
If whole genome, exome, or targeted sequences will be submitted, please consider also submitting derived variant calls (VCFs or MAFs), which are more frequently used. VCFs or MAFs should be uploaded as Molecular Data in the dbGaP Submission Portal. Please make sure that the sample IDs used in the VCFs or MAFS are the same sample IDs listed in your sequence data and Subject Sample Mapping (SSM. If the sample IDs are not found in the files, please create a separate 2 column table to map sample IDs to file names.
May I submit identical sequences (i.e., same file name and md5)?
SRA's system currently will not process duplicate sequences that have the same file name and md5, whether the sequences are submitted to dbGaP SRA or public SRA. If you have a need to submit duplicate sequences to two different dbGaP studies, please directly contact SRA for guidance: [email protected].
21. How do I submit Copy Number Variation (CNV) data?
CNV is coordinated with NCBI dbVar. Individual-level CNV data should be submitted to dbGaP and released via controlled access. Summary-level (probe/primer and other assay and frequency information) copy number variation data should be submitted to dbVar and released by the public dbVar. Please click on dbVar Submission Guide if your study includes CNV data.
22. How do I link individual study subjects/samples to samples that have been submitted to NCBI databases: GEO, GenBank, SRA (public)?
Create a linking DS and DD of SUBJECT_IDs or SAMPLE_IDs to the accessions used in the applicable databases. In the Submission Portal, mark "yes" for "Subject/Sample ID links to public NCBI databases" in the Study Data Outline (SDO). Upload these files as "Phenotype data" if keyed off of the subject IDs OR upload these files as "Sample Attributes" if keyed off of the sample IDs. If only experiment or project accessions are available, then in the SDO, mark "no" for "Subject/Sample ID links to public NCBI databases". Experiment or project accessions and their corresponding URL can be listed in the Study Config web form under "Study Web Links".
GEO: repository of high-throughput gene expression data and hybridization arrays, chips, microarrays. Open the templates under Sample_NCBI_DB_Linking:
SampleGEOLinkingDS.txt
SampleGEOLinkingDD.xlsx
GenBank: genetic sequence database comprising an annotated collection of all publicly available DNA sequences. Open the templates under Sample_NCBI_DB_Linking:
SampleGenBankLinkingDS.txt
SampleGenBankLinkingDD.xlsx
SRA (public): archive of raw sequencing data and alignment information from high-throughput sequencing platforms of non-human data. Open the templates under Sample_NCBI_DB_Linking:
SamplePublicSRALinkingDS.txt
SamplePublicSRALinkingDD.xlsx
Column 1: SUBJECT_ID or SAMPLE_ID
Eliminate extra work. If additional sample IDs need to be created and/or added to the SSM to account for the sample to NCBI database accession mapping, use the subject IDs in the Subject Consent DS instead. Create a mapping of subject IDs to the NCBI database accession: Column 1 will list SUBJECT_IDs and Column 2 will list the corresponding NCBI database accession. Otherwise, use the SAMPLE_ID found in the SSM DS: Column 1 will list SAMPLE_IDs and Column 2 will list the corresponding NCBI database accession.
A sample ID can be listed multiple times if it has multiple accessions (such as GEO accessions) derived from the same sample. See SAMPLE_ID in Glossary for full requirement details.
Column 2: NCBI database accession (i.e. GEO_ACCESSION, GENBANK_ACCESSION, SRA_ACCESSION)
The sample accessions of the various NCBI databases should be linked to the submitted subject or sample IDs. This column should have distinct IDs. GEO_ACCESSIONS begin with GSM#######. GENBANK_ACCESSIONS begin with HM#######. Non-human SRA_ACCESSIONS begin with SAMN########.
Example of GEO Linking DS and DD File using GSM####### accessions
SAMPLE_ID | GEO_ACCESSION |
---|---|
S2 | GSM18467693 |
S2 | GSM18467694 |
S2 | GSM18467695 |
S10 | GSM18467696 |
S10 | GSM18467697 |
S10 | GSM18467698 |
VARNAME | VARDESC |
---|---|
SAMPLE_ID | Sample ID |
GEO_ACCESSION | GEO accession ID (GSM#) |
Example of GenBank Linking DS and DD File using HM####### accessions
SAMPLE_ID | GENBANK_ACCESSION |
---|---|
S2 | HM258784 |
S2 | HM258785 |
S2 | HM258786 |
S10 | HM258787 |
S10 | HM258788 |
S10 | HM258789 |
VARNAME | VARDESC |
---|---|
SAMPLE_ID | Sample ID |
GENBANK_ACCESSION | GenBank accession ID (HM#) |
Example of SRA Linking DS and DD File (non-human sequences that are publicly available) using BioSample SAMN######## accessions
SAMPLE_ID | SRA_ACCESSION |
---|---|
S2 | SAMN2506412 |
S10 | SAMN2506420 |
S13 | SAMN2506432 |
S14 | SAMN2506433 |
S15 | SAMN2506434 |
S16 | SAMN2506435 |
VARNAME | VARDESC |
---|---|
SAMPLE_ID | Sample ID |
SRA_ACCESSION | SRA public sequence accession ID (SAMN#) |
Association Analyses
23. What are Association Analysis Data Files and how should they be formatted?
Association analyses are Genomic Summary Results (GSR) that do not include individual level data. They are from genomic association studies and include linkage and burden testing on genotypic and phenotypic traits. They vary on trait, variant type, frequency, and analytic method. To facilitate data sharing, we have created a unified guideline for Minimum Information Required for Association Data (MIRAD) listed below. We also accept the newer GWAS-SSF (GWAS Summary Statistics Format) according to Hayhurst, et al., 2023.
MIRAD includes four essential data elements.
- Locus Identifier The identifier includes locus ID and location, but is not limited to rs#, gene ID and SV# for SNP, gene, and structural variant. They can be mapped to the current genome build and can evolve with future reference genome assemblies and NCBI annotations.
- Variation summary It contains information about alleles, allele frequencies, sample size, and genotype counts per sample group within each locus. To limit the ability of unauthorized parties to infer individual participants, data like counts and frequency are only accessible to users who have been approved for Authorized Access.
- Statistical significance and Effect size p-value and/or FDR either come from univariate testing on variants from a single locus or from burden testing on a set of rare variants from a target-region provided by sequencing projects. The effect size includes odds ratio, regression coefficient, relative risk, etc., on effect allele. These data not only help users to find causal variant and haplotype, but also can be used to estimate locus contribution to the heredity of the trait or disease(s).
- Phenotype Definition and Analysis Metadata Descriptions of the analysis and method, include phenotypic covariates, parameters, and ancestry of participants, are needed for reproducing the result set once the individual data are fully available. The main trait or disease analyzed should be defined based on controlled vocabulary in MeSH terms, study population information, and relevant publications using PMIDs should be provided.
Reasoning: Sharing of these data elements allow other researchers to evaluate supporting evidence and independently verify discoveries with different samples and data models. If individual level genotype is inaccessible, people can directly use them for meta-analysis to increase statistical power or for the development of hypotheses. The data, like locus info, effect allele and effect size, can provide valuable information for genomic medicine.
Our practice: Using MIRAD, dbGaP has developed several templates for data submission and genome browser display. You are welcome to join the discussion, make suggestions, and comment on the MIRAD proposal. The dbGaP team is committed to bringing new discoveries to the public and research communities and are happy to work with researchers to promote data sharing within the scientific community.
See the instructions in Association_Analysis.xlsx for Case-Controls (Worksheet 1) or Others (Worksheet 2). Each analysis metadata sheet is given a separate analysis accession (pha#.v#) and will need to have a unique name. If GWAS results are submitted as outputs of the software, please give brief descriptions of the column headers, indicating the linking-columns and/or relationships when several files are involved.
The GSR will be posted on the public FTP site, unless the study investigator and GPA specify that the data is sensitive in the dbGaP Submission System and needs to be restricted under dbGaP Authorized Access. Additionally, there is the option to add a study with analyses to CADA. CADA stands for the Compilation of Aggregate Genomic Data and is a collection of analyses across many dbGaP studies that can be accessed with a single Data Access Request.
Guidance for submitting a large number of analyses
Please make sure that the analyses metadata are consistent, so that processing can be scripted. For curators to quickly review the MeSH and population terms, please submit a separate 3 or 4 column table with:
Column 1: Analyses metadata file name
Column 2: MeSH term https://www.ncbi.nlm.nih.gov/mesh
Column 3: Ancestry, Race, Ethnicity using broad categories (i.e., Asian, Black or African American, Middle Eastern or North African, Native Hawaiian or Other Pacific Islander, White, More than one population, Hispanic or Latino, Not Hispanic or Latino, African, East Asian, West Asian, European, American). Please match the syntax listed here.
Column 4 (Optional): Column 3 in greater detail using free text, i.e., expanding what multiple populations mean or more specificity
Submitting Files
24. Who can submit files to dbGaP?
A dbGaP study must be registered in the dbGaP Submission System before data can be submitted. Please click on "How to Submit" for the overall schema. The study investigator and the person designated by the study investigator (PI Submitter) will be able to submit along with any other individuals they add as a submitter.
25. Where do I submit my dbGaP files?
Submit all files through the dbGaP Submission Portal. Go to https://submit.ncbi.nlm.nih.gov/dbgap/. To safeguard study participants' privacy, dbGaP will not accept individual-level data via email. Once the study is registered, a Submission Portal account is provided to the study investigator and anyone that the study investigator lists as a submitter. To obtain access to the Submission Portal account, please accept the email invitation you have received immediately. The email invitation will expire in 7 days. Once accepted, you may submit your files any time thereafter. Individuals with "manager" roles in the dbGaP Submission Portal can also add in additional submitters.
Additional guidance for file upload:
- Subject Phenotypes - upload all Subject Phenotypes DS and DD and any linking files to other NCBI databases if keyed off of subjects.
- Sample Attributes - upload all Sample Attributes DS and DD and any linking files to other NCBI databases if keyed off of samples.
- Sequence Metadata - dbGaP will email you a Sequence Metadata file after the subject IDs, sample IDs, and consents have been loaded. This process ensures that submitted sequences are tied to sample IDs that belong to consented subjects. Upload the Sequence Metadata once you have completed filling out the remaining columns. Once a Sequence Metadata has been validated and loaded into our system, you will no longer be able to replace that file. You will be sent a separate email with instructions to upload your high throughput sequence data (BAM, CRAM, FASTQ). To add additional samples, submit another Sequence Metadata file with only the new samples. If the validation reports errors, you will be able to "Replace" the existing Sequence Metadata file. If you need to remove or modify entries in a validated file, please contact [email protected].
- Other files
- Molecular Data - select type "Molecular Data". No high throughput sequence data (FASTQ, BAM, and CRAM) should be submitted here.
- Study Documents - select type "Document: Phenotype" if the document can be made available on the public webpage. Some READMEs, genotype qc results, etc are not appropriate for public distribution, and should be submitted under type "Molecular Data" instead and packaged for Authorized Access only.
26. What if there are errors or updates in the data and I need to resubmit?
If you must resubmit your files for a new iteration of the current version, please follow these instructions:
- Do not submit individual-level data through email. Resubmit data through the dbGaP Submission Portal (https://submit.ncbi.nlm.nih.gov/dbgap/), so that we have a formal record of your submission.
- Update the Study Data Outline (SDO) in the Submission Portal to indicate new data types or remove incorrectly marked data types.
- Submit only new or updated files. Do not resubmit unchanged files as every submitted file is compared to previously submitted files, which will add significantly to the processing time.
- There are 3 options to update the Phenotype Component in the Submission Portal. Notify your phenotype curator of the changes.
- Option 1: Replace previously submitted datasets (DS) or data dictionaries (DD).
- Option 2: Add new DS and DD pairs. Do not use the "replace" button in the Submission Portal, but rather add pairs of DS and DD.
- Option 3: Delete previously submitted DS and DD pairs. There are no replacements.
- Keep resubmitted phenotype filenames the same or add the date to the existing filename, i.e. yyyymmdd (ex. 20190101). Do not submit filenames used two versions or more ago. dbGaP will crosscheck the latest file submission against the previous submission and report any unexpected changes.
- Double check the submission by going through the phenotype QC checklist of common errors: Quality Control
- There are 2 options to update the Molecular Data in the Submission Portal. Notify your genotype curator of the changes.
- Option 1: Add new data. In the Submission Portal, upload under "Other files" with type "Molecular Data".
- Option 2: Delete previously submitted molecular data.
- To replace, first Delete then Add.
- Replacing a Sequence Metadata file is only possible when the Sequence Metadata has failed validation. Otherwise, only new Sequence Metadata files can be added. New Sequence Metadata files should only include samples with new sequences that will be submitted. To replace or delete previously submitted samples or sequences, contact an SRA curator: [email protected]. The Sequence Metadata will first need to be updated and validated before sequences can be uploaded through ASPERA.
To download a copy of the phenotype component files from the Submission Portal, you must have "Manager" permissions in the Submission Portal. The phenotype component includes Subject Consent, Subject Sample Mapping, Subject Phenotypes, Sample Attributes, Subject/Sample to NCBI linking datasets (DS) and data dictionaries (DD).
- In the box on the upper right of the Submission Portal, click "Download Phenotype Files".
- Select phenotype files.
- Click "Download". Once the download request is initiated, the "Download" button will be disabled until the request expires.
- You will receive two emails when your files are ready to download. This should occur within a day.
- Email 1 "dbGaP: phs00####.v# Phenotype Files Ready for Download"
- Email 2 "dbGaP: phs00####.v# Passphrase"
- In Email 1, click on the "File download URL" link, which will take you to the encrypted TAR file.
- Click on "Save File" to download the encrypted TAR file, i.e.,
phs00####_v#.tar.gpg
. This link will expire after 72 hours. Thereafter, you will need to make a new request. -
The TAR file will need to be decrypted in order to open. Use Windows or Unix below.
In Windows:
- Go to https://gnupg.org/download/
- Find "gpg4win". Click and download the executable.
- Open Kleopatra, a GUI app for GnuPG
- Select "Decrypt/Verify", and navigate to the downloaded TAR file and open it.
- When prompted, enter the passphrase from Email 2.
- If decryption is successful, the message "Decryption succeeded" will be displayed
- Enter desired output location and click "Save All".
- Go to the output folder to view the decrypted files.
In Unix:
- It is likely that your system already has gpg installed. If not, download and install GnuPG from https://gnupg.org/download/ (GnuPG Binary Releases).
- From command line, run the following command:
gpg --batch --passphrase < decryption_key > -d < downloaded_gpg_file > | tar xf -
- Go to the output directory to view decrypted files.
dbGaP Processing and Release
27. What happens once I submit my core data files and phenotype files to the dbGaP database?
dbGaP curators work through the study queue in the order the study is submitted to the dbGaP Submission Portal. Study submissions should be complete, which may include all phenotype component files, molecular (non-sequence¹) data, high througput sequence data , study documents, and analyses. Completed study submissions can be released as soon as:
- dbGaP has finished processing the study;
- If there are high throughput human sequence data and all sequences appear ready/public in the Subject Sample Telemetry Report (SSTR);
- The registration information is consistent with the submitted data and the study registration in the Submission System is marked "Completed by GPA";
- The study investigator or PI assistant has given permission to release the study. If you have additional files to submit for the release, then the study submission is incomplete and will not have priority in the processing queue.
You can track your study's progress through the Study Status Report (SSR).
We are offering pre-validation tools for you to check your data before submitting to dbGaP on your system using GaPTools.
dbGaP will run several quality control (qc) checks upon submission.
- Automated preprocessing checks will immediately be run after submission for studies with PLINK or VCFs, Subject Consents DS, Subject Sample Mapping (SSM) DS, and Pedigree DS. The automated system will email all submitters with results from the five types of files. If one of the five types is resubmitted, the automated system will be re-run. Here is a web page showing errors and warnings that the automated system may detect: https://www.ncbi.nlm.nih.gov/gap/public_utils/messages/.
-
Manual and scripted qc checks will be completed by the dbGaP curators of your study. The phenotype curator and genotype curator will separately report back errors detected, since the processing occurs at different times depending on the queue and the errors can be complex within each component.
- Phenotype Curation:
- The phenotype curator coordinates the entire study release and processes the information in the Submission System registration (SS), Submission Portal (SP), Study Config, DS and DD (Subject Consent, SSM, Pedigree, Subject Phenotypes, Sample Attributes, and Sample to NCBI Database Mapping), Study Documents, and Medical Images.
- All individual level data are split by consents.
- The manual portion includes vetting the Study Data Outline, validating consents, and checking for incongruent phenotypic values and summaries.
- Scripted qc checks look for inconsistencies between files and between all dbGaP studies, formatting errors that make loading of the datasets (DS) and data dictionaries (DD) into the dbGaP database impossible, inconsistencies between DS and DD with regard to subject consent, sex, affection status, and potential HIPAA violations.
- See Question 16 for common errors we encounter.
- Genotype Curation:
- The genotype curator processes all molecular data EXCEPT for high throughput sequence data.
- Molecular data may include SNP array, methylation, expression/epigenetic data, CNV, VCF, MAF, imputation, and other formats.
- QC checks include sex checks, pedigree checks, and unintened duplications.
- For data where these checks are not relevant, the data is packaged and split by consents.
- BAM, FASTQ, CRAM files are not processed by the dbGaP genotype curator, but by the sequence pipeline.
- Combined Curation:
- Inconsistencies between molecular data sample IDs and phenotype sample IDs, unintended data duplications, incorrect pedigree information, Subject relationships will further be checked using dbGaP software, GRAF (Genetic Relationship and Fingerprinting).
- The reports and counts in the Subject Sample Telemetry Report (SSTR) will be reviewed.
- Phenotype Curation:
dbGaP subjects with genomic data and that have been designated "non-sensitive" for release of Genomic Summary Results (GSR) in the dbGaP Submission System will also be analyzed using GRAF-pop and included for the ALFA (Allele Frequency Aggregator) project. Studies may be contacted to correct the submitted data or provide a README if:
- They contain allele frequencies that deviate from the expected range of known allele frequencies for the 12 diverse populations and/or
- The submitted ancestry or population deviates from the computed ancestry for a large number of samples.
Careful adherence to this submission guide and the emailed error reports can eliminate the need for resubmission and quicken the schedule for release.
Splitting Files by Consents
dbGaP will assign dbGaP-generated subject IDs and sample IDs and split the final individual level datasets (both phenotypes and genotypes) for release by consent, with the exception of the three meta study DS (Subject Consent, SSM and Pedigree). Subject IDs that have been marked as aliases will be assigned the same dbGaP subject ID. The dbGaP-generated IDs will appear in the final dump files, NCBI BioSample website, and the Subject Sample Telemetry Report (SSTR).
Prior to posting your study, dbGaP will provide you with access to a preview site of your study that shows study content as it might appear on the final public dbGaP page: https://www.ncbi.nlm.nih.gov/gap. Once all the study components have been processed and you have reviewed the preview site and the SSTR, dbGaP will send an email to request the study investigator's or PI assistant's approval to release the study.
¹Sequence data (e.g. BAM, CRAM, FASTQ) should be submitted only after: 1) you have received an email with an attached sequence metadata file containing the registered subject and sample IDs, and consents. This process ensures that submitted sequences are tied to sample IDs that belong to consented subjects. 2) The sequence metadata has been processed and you have received an email to upload sequences.
28. When and what will be released?
The release occurs approximately 6-8 weeks following receipt of final datasets that are without error. If there are errors, the processing time will increase. The study registration in the Submission System must be marked "Completed by GPA". Once the study investigator or PI assistant and dbGaP approve of the posting of the study, it will be released in 2-3 business days to the following sites.
Public dbGaP page (https://www.ncbi.nlm.nih.gov/gap) – includes a study report page, public summary phenotype variables and datasets, molecular data summary, study documents, analyses browser, and indexing of various study terms for users to search and filter for studies. When your study becomes publicly available, the URL will appear like https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs00####.v#.p#, where the last part of the URL is the study accession number.
Public FTP site (https://ftp.ncbi.nlm.nih.gov/dbgap/studies/) – features the study manifest (a list of all released files), study configuration (a list of how the study is configured in the Authorized Access system), release notes (summarizes the data that has been released and any changes since the last version), summary statistics of phenotype variables, phenotype data dictionaries, study documents, and analyses aka genomic summary results (truncated, gene-level, and/or summary level).
Authorized Access portal (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login) – this is the management portal for individual-level data. This site can be used to submit a data access request, manage access requests, and download approved datasets.
What if I have a paper publication or must meet a specific release date?
If you need to schedule a study release to coincide with a publication (e.g. hold the study until a certain date, try to complete study processing by a certain date), communicate to dbGaP the specific date and/or at least a general time frame as soon as you know it. dbGaP will work with you to accommodate your release schedule whenever possible.
How often can dbGaP release my study?
A dbGaP study can be released quarterly at most. Finalized data, that is data without error, must be submitted 6-8 weeks in advance for qc checks and processing. Please contact us if we need to work out a release schedule.
What should I do if I need my study accession public before the data has been processed?
If a publication requires that the study is public on the dbGaP page, please let us know and we can release the study report page in advance. The phenotype and molecular data can then be released at a later time.
Can an embargo date be applied?
There are no longer publication embargo dates. See https://osp.od.nih.gov/scientific-sharing/genomic-data-sharing-faqs/. However, if you need dbGaP to postpone a study from release until a certain date, please confer with the PO and GPA assigned to your dbGaP study to agree on a date of release (only weekdays). Once a date has been decided, please email the dbGaP phenotype curator along with the PO and GPA to let us know the agreed upon date.
29. Whom may I contact with questions about my dbGaP data submission?
General dbGaP questions and Authorized Access questions: [email protected]
dbGaP Submission Portal questions: [email protected]
Phenotype and molecular data questions, please contact the assigned study curator(s).
- Phenotype curators: study config, IDs, consents, phenotype data, study documents, medical images, study release schedules, etc.
- Genotype curators: all molecular data EXCEPT for high throughput sequence related questions
- Sequence curators: high throughput sequence data. Contact [email protected]
dbGaP Team Lead: Michael Feolo [email protected]
dbGaP Versions
30. How can I submit additional data after my study is released?
Once your study is released, it is a historical record in the dbGaP database. If you would like to submit new data or update existing data (correct, remove, or add rows or columns of data), you will need to create a new version of your study. This means that the study accession of your study will be updated, e.g., phs000024.v1.p1 to phs000024.v2.p?, where the version number (v#) will increment by one and the participant set number (p#) will increment by one if subjects have been retired or have moved from one consent group to another. If only new subjects have been added, the p# will not be incremented. Once a new version of a study is released, the prior version will no longer be available for download. The new version will encompass all files from the previous version and any newly submitted data.
For new versions of a study, we ask users to continue to follow the guidelines in this Submission Guide. Repeated formatting errors will increase processing time. More importantly, if the data is inconsistent such as IDs do not match the prior version, counts between files do not match, or reported sex values do not match genotyped sex values, the processing time will be substantially longer to process each iteration of the new version. Double check the submission by going through the checklist of common errors: Quality Control
If you need a copy of the phenotype component files that were last uploaded to the Submission Portal, please see the download instructions: here. Please note that these files may have been modified for release, so if you use these files for updates, please make sure to incorporate the prior changes. If you are an investigator who would like to access your own data after your study is released, you will need to have your GPA register you as an "Investigator with Streamlined Access".
For Submission Portal issues, email [email protected].
For data specific questions, email the assigned phenotype or genotype curator.
Submission Portal: https://submit.ncbi.nlm.nih.gov/dbgap/
- To create a new version, go to the dbGaP Submission Portal and complete the Study Data Outline (SDO). If you only want to edit the Study Config of the released study, DO NOT complete the SDO, rather contact your phenotype curator.
- Once a new version is created, your Genomic Program Administrator (GPA) will be notified. The GPA will need to complete the registration in the dbGaP Submission System for the new version. Any consent changes should be provided to the GPA and those consents should be reflected in the Submission System as early as possible. Since all processed files are packed by consents, any changes in consents after processing will require your study to be reprocessed and significantly delay your study release. Please also contact your GPA for Acknowledgment Statement changes.
- Phenotype Component
- Do not submit files that have been submitted previously and are unchanged. This will add significant time to your processing.
- Update the Study Config so that it is cumulative and describes all versions of the study.
- The Subject Consent (SC) files, Subject Sample Mapping (SSM) files, and Pedigree files should always be cumulative, e.g., all subject and sample IDs included in version 1 should be included in the version 2 SC, SSM and pedigree files. If a subject or sample ID is not included, dbGaP will mark the subject or sample ID as retired and the data will no longer be available in the new version. High throughput sequences belonging to retired samples will also be removed.
- For Subject Phenotypes and Sample Attributes datasets (DS), all subjects listed must be consented in the SC and all samples must belong to consented subjects in the SSM. dbGaP will not concatenate multiple datasets into a single dataset. When adding data, consider how users might best use the data -- should the data from all versions be in a single Subject Phenotypes DS and single Sample Attributes DS or split into many Subject Phenotypes DS and Sample Attributes DS? For more guidance on whether to update a previously submitted dataset or add a brand new dataset, see "Can I submit multiple subject phenotypes DS files?" and "Can I submit multiple sample attributes DS files?".
- There are 3 options to update the phenotype components in the Submission Portal. Notify your phenotype curator of the changes.
- Option 1: Replace previously released datasets (DS) or data dictionaries (DD). These DS or DD will be cumulative including subjects, samples, and variables from prior versions. Any subjects, samples, or variables removed from these DS and DD will be considered retired. The phenotype table accession (pht#) will remain the same, and the pht version will be incremented.
- Option 2: Add new DS and DD pairs. Previously released subject phenotypes and sample attributes datasets will be kept, and there will be additional DS and DD added. Do not use the "replace" button in the Submission Portal, but rather add pairs of datasets (DS) and data dictionaries (DD). New phenotype table accessions (pht#) will be assigned.
- Option 3: Delete previously released DS and DD pairs. There are no replacements. The phenotype table accession (pht#) will be retired for this version.
- Molecular Data
- Submit molecular data with your phenotype component submission, so that this component enters the genotype curator's queue as soon as possible.
- There are 2 options to update molecular data in the Submission Portal. Notify your genotype curator of the changes.
- Option 1: Add new data. In the Submission Portal, upload under "Other files" with type "Molecular Data".
- Option 2: Delete previously released molecular data.
- To replace previously released molecular data, first Delete then Add. The genotype accession (phg#) version will be incremented.
- If consents have been updated, a genotype curator will re-split the molecular data files according to the new consents, so that you will not need to resubmit for consent updates.
- Review the instructions for the Molecular Data section: here.
- Sequence Data
- Do not submit sequences (BAM, CRAM, FASTQ) until you have been sent a Sequence Metadata file. You will be emailed a Sequence Metadata file if you selected "yes" for Sequence Data in the Study Data Outline (SDO) and once the IDs and consents from the Subject Consent and Subject Sample Mapping datasets have been processed and loaded.
- The Sequence Metadata file should only include samples with new sequences that will be submitted. To replace or delete previously submitted sequences, contact an SRA curator: [email protected]
- Fill out the Sequence Metadata and upload to the Submission Portal. The Sequence Metadata file will be validated, and you will be sent a second email to upload sequences.
- Review the instructions for High Throughput Sequence Data section: here.
- Retain the format and corrections that were made in the previous version following the Submission Guide. Remaking the same changes will take additional time.
- Check that variable names in the Dataset and the matching Data Dictionary are identical in spelling, i.e. have the same number of spaces, same case, etc.
- Check that every variable has a variable description. Check that coded values in the Dataset have code meanings listed in the Data Dictionary.
- Check that the sex of a subject remains consistent throughout a single study. If the sex has been changed as a result of a correction, please let dbGaP know via email.
- Check that the case control status of a subject remains consistent throughout a single study.
- Check that all subjects have been assigned a consent group.
- Check that the existing subject and sample ID mappings remain the same between versions, unless there is an error and an ID needs to be remapped. In case of ID remapping, please let dbGaP know which IDs need to be remapped.
- Check that all samples are mapped to a subject and therefore to a consent group.
- Check that the data files contain the values you expect. Check for truncated values. Compare new files to the final files submitted for the previous version to check for differences and to make sure all changes are intended. If you need more information regarding which files were incorporated into the final release of the previous version, please make a request to dbGaP.
- To help us better understand the new version, please let your phenotype curator know:
- How many new subjects been added? Note: Subjects refer to a person. A person can have many samples.
- How many subjects been deleted? To protect subject identity, if only 1 subject (person) is being deleted or added and there are no additional changes, we ask that either additional subjects are added or 1 additional subject is retired. This minimizes the possibility of comparing variable summaries between versions and identifying the phenotypes for that 1 person.
- How many subjects have changed consent groups?
- How many samples have been added?
- How many samples have been deleted?
- How many samples been remapped to different subjects? List.
- Have any samples and subjects been renamed? If yes, provide a 4 column table with the the column headers: Old Sample, New Sample, Old Subject, New Subject. If only the sample are being renamed, then provide only the first 2 columns. Submit to the dbGaP Submission Portal under "Other files" with Type "Special".
- Are there updates to the phenotype component? New, replaced, or deleted datasets or variables? Have variables been renamed?
- Are there updates to the molecular data? New, replaced or deleted files? What type of molecular data is being added?
- Are there updates to the sequence data (BAM, CRAM, FASTQ)? What type of sequence data is being added (WGS, WXS, targeted, etc.)?
- Are there any other updates we should be aware of?
GLOSSARY OF TERMS
Authorized Access, (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login), is the management portal for individual-level data. This site can be used to submit a Data Access Request (DAR), manage access requests, and download approved datasets.
NCBI's Allele Frequency Aggregator (ALFA) pipeline computes allele frequencies for variants in dbGaP across approved unrestricted studies and provides the data as open-access to the public through dbSNP. Studies must be registered as GSR insensitive in order to be included in ALFA.
A dbGaP Collection is a virtual study under which other studies are grouped; it has no data of its own. A dbGaP Collection provides streamlined access to data across dbGaP studies or portions of dbGaP studies that share the same consent group, disease, or funding project. Data access for a collection is controlled by a single data access committee. The data in a collection is not harmonized across studies or otherwise altered from the original study. Investigators using data within a dbGaP collection are required to follow the use restrictions and acknowledgement instructions from the original dataset.
To search for dbGaP Collections, go to dbGaP Advanced Search.
To create a new dbGaP Collection, see Special Studies.
Study participant consents are determined by your institution's IRB. When filling out the Submission Certification, the consents should then be matched to the NIH Standard Data Use Limitation consent groups. Each person should belong to a single consent group. If a subject belongs to two or more consent groups in your study, pick the more stringent of the two consent groups, so that each person belongs to a single consent group. The consent groups will then be registered in the dbGaP Submission System by your study's Genomic Program Administrator (GPA). If you are the study investigator, you can see the consent groups in the dbGaP Submission System. If you are a submitter, you can see the consent groups in the dbGaP Submission Portal for your study by clicking "View consent group" in the box on the upper right. dbGaP Authorized Access users request for studies by consent. For questions regarding the registered consent group and DUL, please contact your GPA.
See the NIH Guidance for Genomic Sharing Plan: https://sharing.nih.gov/genomic-data-sharing-policy/developing-genomic-data-sharing-plans
See NIH Standard Data Use Limitations: https://sharing.nih.gov/genomic-data-sharing-policy/institutional-certifications/completing-an-institutional-certification-form#step-5
A study should be designated with at least one NIH consent group title.
Consent Group Titles Consent Group Abbreviations Description General Research Use GRU Use of the data is limited only by the terms of the Data Use Certification. Health/Medical/Biomedical HMB The dataset can only be used for studying health, medical or biomedical conditions, and does not include the study of population origins or ancestry. Disease-Specific (Disease/Trait/Exposure) DS-xxx The dataset can be used only for research on a specific disease or related condition.
Additional modifiers can be added if applicable.
Consent Group Limitations Consent Group Abbreviations Description IRB approval required IRB The requesting institution's IRB or equivalent body must approve the requested use. Publication required PUB The requestor must share their results with the larger scientific community. Collaboration required COL The requestor must provide a letter of collaboration with the primary study investigator(s). Not-for-profit use only NPU The dataset can only by used by not-for-profit organizations. State specifically if the data should not be made available to commercial organizations. Methods MDS The dataset can be used for methods research and development (e.g., development of statistical software or algorithms). Genetic studies only GSO The dataset can only be used only for genetic studies.
For example, a study might have two consent groups: 1) General Research Use with IRB approval and Not-for-profit use and 2) General Research Use. Therefore, a subset of the subjects would have the GRU-IRB-NPU designation, while the remaining subjects would be GRU. There should be no overlapping subjects between the two consent groups.
Note: "Other" may be selected when it is definitive that no standardized consent group and modifier listed above can be used as the data use limitation of a study. "Other" is not an official designation and should not be used as the Consent Group Title or Abbreviation. The GPA and PI should determine a Consent Group Title and Abbreviation that best represents the data use limitation. Since Abbreviations are used in file names, and file names have character limits, please choose a concise Abbreviation.
What is the DAC? NIH Data Access Committees (DACs) review requests to access data in the Database of Genotypes and Phenotypes (dbGaP). See https://osp.od.nih.gov/scientific-sharing/data-access-request-dar-approvals-and-disapprovals-by-data-access-committee-dac/
DAC Chairs and Emails: https://osp.od.nih.gov/wp-content/uploads/NIH_DACs_Chairs.pdf
Genomic Data Sharing (GDS): https://sharing.nih.gov/genomic-data-sharing-policy
How to make a Data Access Request: https://sharing.nih.gov/accessing-data/accessing-genomic-data/how-to-request-and-access-datasets-from-dbgap
The Signing Official should confirm that the individual listed as the IT Director, has a background in computer security, has the institutional (and not just a department) authority and can confirm that your institution has the capacity to protect shared data, and will comply with NIH Genomic Data Sharing Policy.
dbGaP Data Access and Use Report: https://ncbi.nlm.nih.gov/projects/gap/cgi-bin/DataUseSummary.cgi
See Consents for examples.
Study Accession Number - Once the Study Data Outline (SDO) is completed, a study accession is assigned: phs######.v#.p#. The study accession is a unique, stable, and versioned identifier (ID) that can be used in publications. It is prefixed by "phs," indicating a phenotype study.
The version number (.v#) and participant set number (.p#) do not change during iterations within a release cycle, but following release and only after changes have been made to existing data or new data is added. The Study v# is always incremented, while the v# for its components are only incremented when there are changes to that specific component. The p# is incremented when subjects in an existing study set changes consent status. The p# is never incremented when only new subjects are added and existing subjects have not changed consents.
Dataset Accession Number - Each phenotype table (SC, SSM, pedigree, subject phenotypes, and sample attributes) is assigned a pht######.v#.
Variable Accession Number - Each variable in a phenotype table (SC, SSM, pedigree, subject phenotypes, and sample attributes) is assigned a phv########.v#.
Document Accession Number - Each study document (e.g. protocols, questionnaires, manuals of procedures and operations) is assigned a phd######.#, where .# is the version number.
Molecular Data Accession Number - Each grouping of molecular data is assigned a phg######.v#.
Analysis Accession Number - Each analysis is assigned a pha#######.v#.
Dummy IDs are IDs created by the submitter to fill in unknown mother and father IDs when establishing a sibling relationship in the pedigree file. It is important that the dummy ID for the mother and father ID be unique. It is assumed that the dummy mother ID and father ID are identical for full sibling pairs.
Dump files is the term used to describe the individual-level phenotype data (SC, SSM, pedigree, subject phenotypes, and sample attributes) generated and distributed through controlled access. Dump file names have the study accession (phs), table accession (pht), a short dataset name, and consent designations. Each file has variable accessions and dbGaP-assigned subject IDs and/or sample IDs in addition to data submitted. The SSM dataset dump file also has BioSample IDs.
External Data Source (EDS) is a non-dbGaP entity that is a public or private, national or international organization that is able to meet core NIH standards for establishing data quality and data management service protocols for NIH, based on the programmatic need of an NIH funding Institute or Center (IC). EDS are NIH Institute and Center Supported Repositories. Studies with data in the EDS will require credentialed users to apply for access to the data through dbGaP Authorized Access. For more information, see NIH's Scientific Data Sharing page: https://sharing.nih.gov/data-management-and-sharing-policy/sharing-scientific-data/repositories-for-sharing-scientific-data.
To create a dbGaP study with an EDS, see Special Studies.
Genomic Program Administrator (GPA)
GPAs work with investigators to facilitate study registration and data submission for controlled-access data repositories. Each NIH IC has designated one or more senior staff as GPAs to support genomic data sharing implementation activities. https://sharing.nih.gov/contacts-and-help#gds_support
https://osp.od.nih.gov/wp-content/uploads/What_are_Genomic_Summary_Results.pdf
GSR Sensitive - This is a designation made by the GPA in the Submission System when a study is registered. The full analysis results are available for download through the dbGaP Authorized Access System upon approval of the Data Access Request (DAR). The publicly available analysis results on the public FTP are truncated (top hits) and have potentially identifiable information (frequencies and direction of effect) redacted.
GSR Insensitive - This is a designation made by the GPA in the Submission System when a study is registered. Both full (unredacted) association results with frequency of alleles/direction of effect and top hits are available on the public FTP site.
GRAF (Genetic Relationship and Fingerprinting) is a C++ program that quickly finds closely related subjects using SNP genotype data. Access GRAF at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/Software.cgi
HIPAA - algorithm to detect HIPAA sensitive dates
- Two 1 or 2-digit numbers and a 2 or 4-digit number, in this order, separated by "/", "-" or ".", e.g., "3/5/1994" or "12-28-03".
- One 4-digit number and two 1 or 2-digit numbers separated by "/", "-" or ".", e.g., "1994.2.13".
- A 1 or 2-digit number and a 4-digit number starting with 19 or 20 separated by "/", e.g., "10/1994" (but not "10.1994").
- A 1 or 2-digit number followed by a "/" and a 2-digit number starting with 0, e.g., "3/04" (but not "3/94").
- A month name or abbreviation and a 1, 2, or 4-digit number, in either order, separated by some non-letter, non-number characters or not separated, e.g., "JAN '93", "FEB64", "May 3rd" (but not "May be 14").
- A 6-digit number is considered to be potential date value if its first four digits make a valid date in mmdd format (i.e., first two digits read as month and second two as day of the month). For example, 112876 is considered to be a potential date value since 1128 is a valid date (Nov. 28) in mmdd format; 231208 or 113198 is not a potential date since 23/12 or 11/31 is not a valid date in month/day format. If all of the values, or first 10 values, of a variable are 6-digit potential dates, this variable together with it potential date values will be reported by the scripts.
- An 8-digit number is considered to be a potential date value if it makes a valid date in the 20th or 21st century in either mmddyyyy or yyyymmdd format. For example, "19940822" is considered to be a potential date since it can be read as 1994/08/22 (Aug. 22, 1994). "10312005" is a potential date value since it can be read as 10/31/2005 (Oct. 31, 2005). "19080230" is not considered to be a potential date since neither 1908/02/30 nor 19/08/0230 is a valid date in the 20th or 21st century. If all of the values or the first 10 values of a variable are 8-digit numbers of potential date values, the variable will be reported as containing potential HIPAA violations.
In addition to date values, the QC scripts also report data values that look like social security numbers (e.g., "123-45-6789" or "123456789"), phone numbers (e.g., "321-456-7890" or "(301)456-7890"), zip codes (e.g., "MD 20892"), etc. A few cases of this kind of sensitive information have been detected by the QC scripts. However, other cases like names of people are not reported by the QC scripts. A few cases of names of patients and providers have been detected by visual inspection.
Institutional Certification (Institution Cert)
https://sharing.nih.gov/genomic-data-sharing-policy/institutional-certifications
The IT Director is a person who has the institutional (and not just a department) authority and can confirm that your institution has the capacity to protect shared data, and will comply with NIH Genomic Data Sharing Policy. The IT Director should have a background in computer security and should not be the same person as the PI, any of the collaborators, the Signing Official, or the IRB review board. For example, your Chief Information Officer would be appropriate.
Login
The Login Guide for dbGaP PIs and Submitters for dbGaP Submission System and Submission Portal addresses login through the Authenticator App, NIH smart card, eRA Commons account, and other third-party accounts. There is an FAQ that includes common questions, including merging multiple accounts and webpage error messages.
Parent-Child Study / Umbrella Study and Substudies / Cohort
See Special Studies
The study investigator may designate an individual to be the PI Assistant in the dbGaP Submission System. This individual will have "manager" and "submitter" permissions in the dbGaP Submission Portal and will be the primary contact for dbGaP. This individual will be able to provide final approval for the study release.
A dbGaP Sample is defined as the ID of the final preps submitted to dbGaP by a genotyping center, runs from high throughput sequencing by a sequencing group, or data submitted to an NCBI resource, such as GEO or GenBank. A single subject may be mapped to multiple samples, but a single sample should not be mapped to multiple subjects unless the samples are pooled.* For example, if one subject (SUBJECT_ID) provided one sample, and that sample was processed to generate 2 sequencing runs or 1 sequencing and 1 genotyping array run, the data file would show two rows, both using the same subject ID, but having 2 unique sample IDs.
*Please inquire about pooled samples if applicable. This would only apply to pooled samples that belong to consented subjects. If the samples are pooled from controls that are publicly available, there is no need for marking the pooled samples, and a single sample ID may be assigned.
Each sample should be submitted with a single, unique, de-identified sample ID. Sample IDs should be an integer or string value. Integers should not have zero padding. IDs should not have spaces. Specifically, only the following characters can be included in the ID: English letters, Arabic numerals, period (.), hyphen (-), underscore (_), at symbol (@), and the pound sign (#). Once a variable name for the sample ID has been chosen, please use the same variable name throughout all the phenotype files for consistency. For example, please do not use SAMPLE_ID in one file and SAMPLE_NAME in another file. Please also do not use "dbGaP" in your submitted ID name, since dbGaP will assign a dbGaP sample ID that will be included in the final dump files along with the submitted sample ID.
Study investigators can be granted "Investigators with streamlined access" in the Submission System where the study is registered. The system will automatically create a project for the submitting investigators in Authorized Access. This project will not be provisioned with a DAR, or require SO or DAC approval. It will also not expire.
The Study Data Outline (SDO) is filled out by the submitter in the Submission Portal and informs dbGaP curators what data is expected to be uploaded and released for the current study version. The SDO is based off of the File Submission Checklist and File Applicability. The SDO must be filled out to obtain the study accession (phs#) or create a new version after release. The study accession can be used in publications. After the SDO is submitted, a submitter may begin to create/edit the study config and submit data. To edit the SDO, go to the box on the upper right and click on "Study data outline". A few of the common questions we have after reviewing the SDO is whether VCFs called from sequence data will be submitted and whether expression counts from RNASeq will be submitted.
Study Registration
Submission System (SS) Statuses
Incomplete - This status denotes that the registration in the SS is missing key items, such as names, emails, Institutional Certifications (ICs), consents, acknowledgements, etc.
Awaiting GPA's Approval - This status denotes that the registration in the SS has been filled out, but not yet approved by the GPA. When version 2 and later are created, the system is automatically set to "Awaiting GPA's Approval". The GPA can modify or accept existing entries. Changes to consents can significantly delay the study release.
Review by PI - This status denotes that the registration in the SS is awaiting review by the PI.
Completed by GPA - This status denotes that the registration in the SS has been completed by the GPA. For processing, the admin IC and consents must be finalized. Changes to the consents once processing begins can significantly delay the study release.
Deleted - This status denotes that the study was once registered in the SS but now no longer should be processed for release.
Release Postponed - For substudies that are registered but not to be released with the current version of Parent-Child studies, the status will be marked postponed. We strongly suggest not registering substudies ahead, but to register them when they can be released in the parent version that they were registered in.
Once the study is released, the SS will take on the Authorized Access Statuses below.
Study Components Statuses
In Queue - The study queue is ordered by submission date. For studies that resubmit, we will prioritize the first iteration. Thereafter, the study will be worked on in the order it has been submitted.
In Process - A curator is completing manual and automated QC checks, validating consents and data values for consistency, correcting errors, adding search features, and packaging files by consents.
Waiting on Submitter - A curator is waiting on submitters to submit data, correct reported issues, modify or submit additional data, or reconcile consents.
Preview - This status is used solely for the phenotype component to denote that the study is on the dbGaP preview site and needs to be reviewed by the submitters.
Completed - This status denotes that a component (phenotypes, molecular data, high throughput sequences) has passed manual and automated qc checks, validated, and split by consents.
Postponed - This status denotes that the study is completed and ready for release, but is postponed to accomodate publication schedules and other agreements between the funding agency and the study. No new data or modifications should be submitted during this time.
Authorized Access (AA) Statuses
Released - A released study is available on public pages, through Authorized Access, and public FTP sites. For more details, see: When and what will be released?
Withdrawn – A study may be withdrawn permanently from AA after study release. The study and its Data Access Requests (DARs) are not available, except for admins in some interfaces.
Suspended - A study may be suspended temporarily from AA after study release. The study is not available for PIs, but DAC members still can see existing DARs for this study and make approvals/rejection.
A Study Status Report (SSR) is used to track the progress of your study processing, and includes contact emails for your phenotype curator, genotype curator, Program Officer (PO), and GPA. There is a link to your study's SSR from the Submission System, Submission Portal, preview site instructions, and preview site.
Subject Sample Telemetry Report (SSTR)
The SSTR is a public study level report that displays loaded subject and sample IDs, consents, summary counts, processing status, and molecular and sequence sample uses. This report is populated at multiple time points when: 1) subject IDs, sample IDs, and consents are loaded, 2) BioSample assigns BioSample IDs, 3) molecular data or sequences are loaded. Note: Sequence data refers to all high throughput sequence data, while Molecular data is all other molecular data except for high throughput sequence data. Submitters are able to track when the sequence metadata has been accepted and see if there are errors with the submitted sequence data or if the sequence data is ready for release or is already public. Submitters should verify that the number of samples with molecular data and sequences matches what they expect the count to be when reviewing the preview site. Once a study is released, dbGaP Advanced Search can be used to select a specific study to access the SSTR link on the study report page.
Subject Sample Telemetry Report (SSTR) API
The SSTR APIs provide programmatic access to public summary and metadata level telemetry for subjects and samples submitted and processed for a dbGaP study. The API responses are in JSON format, and conform to the dbGaP study schema: https://www.ncbi.nlm.nih.gov/gap/sstr/schema/dbgap_study.v1.schema.json. The swagger pages can be found here: https://www.ncbi.nlm.nih.gov/gap/sstr/swagger/
A dbGaP Subject is defined as a single human person/individual/patient that arises from a single germline. Each subject should be submitted with a single, unique, de-identified subject ID. Subject IDs should be an integer or string value. Integers should not have zero padding. IDs should not have spaces. Specifically, only the following characters can be included in the ID: English letters, Arabic numerals, period (.), hyphen (-), underscore (_), at symbol (@), and the pound sign (#). Once a variable name for the subject ID has been chosen, please use the same variable name throughout all the phenotype files for consistency. For example, please do not use SUBJECT_ID in one file and INDIVIDUAL_ID in another file. Please also do not use "dbGaP" in your submitted ID name, since dbGaP will assign a dbGaP subject ID that will be included in the final dump files along with the submitted subject ID. Subjects that are known to be the same person across dbGaP studies will be assigned the same dbGaP subject ID.
The Submission Portal (SP) link is https://submit.ncbi.nlm.nih.gov/dbgap/. Login using the same email address that was used to accept the SP invitation. See SP Login Instructions: here. The SP is a secure way to upload and track study data to dbGaP. The Study Data Outline (SDO) tracks the type of data that will be submitted per study version. The files accepted in the SP are: Study Config, Subject Consents, Subject Sample Mapping, Pedigree, Subject Phenotypes, Sample Attributes, Documents: Phenotypes, Molecular Data¹, Sequence Metadata, Medical Images, Association Analyses (Genomic Summary Results), special files requested by the study curator, and Exchange Area files. Do not submit sequence data (BAM, CRAM, or FASTQ) through the SP. The SP can be accessed by submitters who have been sent an invitation to submit, and have accepted the invitation within 7 days. Initially, the study investigator and the PI submitter are sent invitations. Any person with a "Manager" role in the SP can add or remove submitters. The SP is not the same as the dbGaP Submission System (SS).
¹Sequence data (e.g. BAM, CRAM, FASTQ) should be submitted only after: 1) you have received an email with an attached sequence metadata file containing the registered subject and sample IDs, and consents. This process ensures that submitted sequences are tied to sample IDs that belong to consented subjects. 2) The sequence metadata has been processed and you have received an email to upload sequences.
Submission System (SS) aka Registration System
The dbGaP Submission System (SS) is also known as the registration system. The link is https://dbgap.ncbi.nlm.nih.gov/dbgap/ss/dbgapss.cgi?login. The GPA works with the study investigator to determine the following: study principal investigator (PI), study project officer (PO), NIH administration and funding, target data delivery date, target public release date, release type, types of data submission expected, inclusion in CADA (Compilation of Aggregate Genomic Data - a collection of analyses across many dbGaP studies that can be accessed with a single Data Access Request), estimated study participants, SRA submission expected, and PI assistant for study submissions. The GPA will upload the Submission Certification, Institutional Certifications, and Data Use Certification, which specifies the Data Use Limitations (DUL). The DULs form the consent groups that will be used to parse the study data, and also determine which Data Access Requests (DAR) can be approved through dbGaP Authorized Access. BioProjects are created for each new study registered in the SS. The SS is only accessible by the GPA, PO, and PI. The SS is not the same as the dbGaP Submission Portal (SP). To make changes to the registration entry in the Submission System, contact your GPA. If you are a PI and have been given access, but have trouble logging in, see instructions: here.
Submission System (SS) Reference for GPAs
This guide provides an overview of the dbGaP Submission System (SS) and steps to register a study: https://www.ncbi.nlm.nih.gov/gap/docs/gpareference/
A dbGaP Variable is defined as the variable name and associated column of data in a phenotype table (SC, SSM, pedigree, subject phenotypes, and sample attributes). The variable's metadata, such as the variable name, description, units, type, and encoded values are defined in its respective phenotype Data Dictionary file. The variable accession is a phv########.v#.p#, where the version number (.v#) is incremented when changes occur to the data columns (phenotype values) following a release.
APPENDIX for Data Dictionary (DD) File Descriptions and Specifications
(*indicates required)
Column Headers | Description |
---|---|
VARNAME* | Variable name. The VARNAME must not contain backward slashes (\). Do not use "dbGaP" in the variable name. "dbGaP" is reserved for dbGaP generated items. |
VARDESC* | Variable description. The description should be understandable and enable users to replicate the variable. For example, "blood pressure" is useful, but "brachial blood pressure while sitting" provides more context. Alternatively, study documents with detail are also acceptable. |
DOCFILE | Study document name associated with the variable. To list multiple documents, add a semicolon (;) between documents. Please list only study document filenames that are submitted to dbGaP. |
TYPE | Data value type: integer (1,2,3,4,…), encoded value (integers or strings are coded for non-numerical meaning, ex. 1=Control; 2=Case, see VALUES), decimal (0.5,2.5,…), string (African American, Asian, Caucasian, Hispanic, Non-Hispanic). For mixed values (any combination of string, integers, decimals and/or encoded values) in a single data column, list all types present. |
UNITS* | Units of measurement of variable |
MIN | The logical minimum value of the variable. If a separate code such as -1 is used for a missing field, this should not be considered as the MIN value. |
MAX | The logical maximum value for the variable. If a separate code such as 9999 is used for a missing field, this should not be considered as the MAX value. |
RESOLUTION | Measurement resolution – the number of decimal places to which a measured value is presented in the data. For example, in 54.321 the resolution is 3. |
COMMENT1, COMMENT2 | Additional information not included in the VARDESC that will further define the variable. If additional comments are needed beyond COMMENT2, insert new columns (COMMENT3, COMMENT4, etc.) before the column "ORDER." |
VARIABLE_SOURCE | Source of controlled vocabularies. Ex. PhenX, MeSH, SNOMED, NCI. If there is no match, leave blank. (Must be submitted as a group with SOURCE_VARIABLE_ID and VARIABLE_MAPPING). |
SOURCE_VARIABLE_ID | A unique identifier from the VARIABLE_SOURCE or a unique text concept/term from various controlled vocabularies. (Must be submitted as a group with VARIABLE_SOURCE and VARIABLE_MAPPING). |
VARIABLE_MAPPING | For example, a variable from the source could be Identical, Related, or Comparable. (Must be submitted as a group with VARIABLE_SOURCE and SOURCE_VARIABLE_ID). |
UNIQUEKEY | Unique key is a combination of variables that is designed to uniquely identify a row in a longitudinal dataset or rows that have repeating SUBJECT_IDs or SAMPLE_IDs. Mark "X" for variables that constitute the unique keys, and leave other values blank. Ex. SUBJECT_ID and VISIT_NUMBER. UNIQUEKEYs can only be used in the subject phenotypes file and some cases of the sample attributes file. The SC, SSM, and pedigree files should never have UNIQUEKEYs marked, since there should be a unique identifier appearing once in each file. |
COLLINTERVAL | Collection interval is the time frame in which the data for the variable or dataset was collected. |
ORDER | The order in which VALUES appear on the variable summary report page. If VALUES of a single variable/column of data are integers or decimals, leave blank. If VALUES are encoded values, string, or mixed, define the order. VALUES can be ordered by Frequency (highest to lowest frequency of VALUES) or by List (user specifies order through placement in VALUES columns). For mixed values within a single variable/column of data, see examples: "age" and "weight" in example file 5b_SubjectPhenotypes_DD.xlsx. |
VALUES* | List of all unique values and/or descriptions of all encoded values, one value per cell. Encoded values are defined as a value and its meaning. For example, if a data file contains a variable named "EDUCATION" and its data values are "1, 2, 3, and 99," these coded values will need to be defined in the data dictionary. The format of an encoded value is VALUE=MEANING. Therefore, in the data dictionary, there should be 4 separate data cells filled out with the following: 1=Completed High School, 2=Completed College, 3=Completed Graduate School, 99=Unknown. The "VALUES" header must be the last column header (farthest right in the table). It should appear only in the column above the first encoded value that is listed. The remaining column header cells should be left blank. The script will identify the first code meanings and continue right until there are no more code meanings. For example, if the variable "SEX" has 3 encoded values: 1=Male, 2=Female and 3=Unknown, the column header "VALUES" will appear only above the cell that contains 1=Male. 1=Male, 2=Female and 3=Unknown will be listed in three separate cells next to each other. The header column cells above "2=Female" and "3=Unknown" should be left blank. |
Example of VALUES:
Last column with header | Leave header blank | Leave header blank | Leave header blank |
---|---|---|---|
VALUES | |||
10=Elementary | 20=High School | 40=College | 4=Graduate School |
1=2-4 drinks per day | 2=5-7 drinks per day | 3=>7 drinks per day |
Previous Updates
- There are updated guidance and templates to link subject/sample IDs to samples in NCBI databases: GEO, GenBank, SRA. (January 2022)
- The Study Data Outline has replaced the Study Questionnaire in the Submission Portal to collect pertinent information for study processing and releases. (October 2021)
- High throughput sequence metadata should now be uploaded to the dbGaP Submission Portal under section "Sequence metadata" instead of through email. (May 2021)
- We are offering pre-validation tools for you to check your data before submitting to dbGaP on your system using GaPTools (February 2021)
- The study config can now be filled out online in your study's Submission Portal. (October 2020)
- Automated Preprocessing Validation Checks are being run on all studies submitting PLINK or VCF files. This system will provide feedback within a few days of submission for IDs errors and inconsistences between PLINK, VCFs, Subject Consent, SSM, and Pedigree datasets (DS). For active studies pre-dating this new system, curators will work with you to update your files, so that this automated check can be run. (July 2019)
- Biological sex is required in the Subject Consent files in order to run the Automated Preprocessing Validation Checks. (July 2019)
- SAMPLE_USE is discontinued from the Subject Sample Mapping files. Please remove before submitting. (April 2018)