Human Variation Sets in VCF Format
There are two sets of VCF format files containing human variations:
- Human variations without clinical assertions that have been mapped to assemblies GRCh37 and GRCh38, are provided by dbSNP at their FTP repository ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/.
- Human variations with clinical assertions that have been mapped to assemblies GRCh37 and GRCh38, are provided by ClinVar at their FTP repository ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/. For descriptions, see ClinVar VCF Files.
Note that both repositories use VCF version 4.1.
dbSNP Files
Table 1 summarizes the files that are generated by dbSNP, a brief overview of their content, frequency of updates and the location of the files for variations mapped to the most recent builds of GRCh37 and GRCh38. File names in Table 1. are linked to more detailed descriptions of the file.
What's new in VCF for dbSNP Build150 (April 2017 release)
b150 General Information and File Access
- Access /human_9606_b150_GRCh37p13 and human_9606_b150_GRCh38p7 directly from ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/. Scroll down to human_9606 to see available builds.
- The definition of a common population is based on at least one population out of more than 26 major populations
- Human b150 supports both the GRCh38p7 and GRCh37p13 assemblies since we map the rs in b150 to both GRCh38p7 and GRCh37p13.
b150 Updates to ClinVar VCF file sets
dbSNP RefSNP data now includes allele frequency data from 1000 Genomes and TOPMED populations. Below are the links to descriptions for the populations used to generate allele frequency data:
- 1000 Genomes Super Population: http://www.1000genomes.org/category/frequently-asked-questions/population. Note: The population super codes for the 1000 Genomes Super Population are as follows: EAS = East Asia EUR = Europe AFR = Africa AMR = The Americas SAS = South Asia
dbSNP VCF files exclude:
- Variations listed as microsatellites
- Named variations (i.e. variations without sequence definition)
- Variations not mapped on assembled chromosomes of the reference genome (currently GRCh38), independently of the patch version.
- Variations mapped to more than one location on the reference genome (weight > 1).
Included file 'human_variation_vcf-table1.inc' not found Note: The common_no_known_medical_impact.vcf.gz file and the clinvar.vcf.gz are not mutually exclusive because common variants asserted to be non-pathogenic and obtained through clinical channels appear in both the clinvar.vcf.gz file and the common_no_known_medical_impact.vcf.gz file. In other words, some records for non-pathogenic variations submitted through clinical channels may be common enough to be listed in the common_no_known_medical_impact.vcf.gz file.
Directory Contents
Data Organization Note:
When multiple alleles are present, the following organizational rules apply to all VCF files:
- The data for each INFO tag is presented in the order that the alleles appear in the variant entry. The data for the primary allele is shown first, followed by the data for each alternate allele in the same order that the alternate alleles are presented.
- The INFO tag values for each allele are separated by a semi-colon " ; "
- If a single INFO tag has multiple values for a particular allele, each value for that allele is separated by a comma " , "
When multiple alleles are present, the following organizational rule applies to all VCF files EXCEPT clinvar.vcf.gz:
- If a single INFO tag has no data available for a particular allele, then a dot " . " placeholder represents the value of the INFO tag for that allele so as to maintain data order.
When multiple alleles are present, the following organizational rule applies to the clinvar.vcf.gz file only:
If a variant's INFO tag has no data available for a particular allele, the data will be omitted. Data order is maintained in the clinvar.vcf.gz file by the CLNALLEtag since CLNALLE provides an ordered list of the alleles described by the clinical (CLN*) INFO tags that follow it. A user can match the ordered list of alleles in the CLNALLE tag to their corresponding clinical data in the other clinical (CLN*) INFO tags because these clinical data are listed in the same order that the CLNALLE tag lists the alleles. See the examplein variation FAQ number 8for more details about how the CLNALLE tag allows for data matching.
The VCF Files in the Directory of Human Variation Sets Include the Following:
00-All.vcf.gz
This file is a comprehensive report of short human variations formatted in VCF. It does not include genotypes, population-specific allele frequencies, or any information regarding clinical significance. It also does not include microsatellites or named variations (i.e. variations without sequence definition).
File Updates: This file is updated once per build.
All_papu.vcf.gz
This file is an inventory of all human variations found in the pseudoautosomal region (PAR), alternate loci, patch sequences and unlocalized/unplaced contigs (papu). We also release companion files for clinical data, named with a “papu” extension to support reporting on these additional, non-primary chromosome locations.
*File Updates:*** This file is updated once per build.
common_all.vcf.gz
This file is an inventory of all"common" human variations that fall within the scope of VCF processing. The "common" category is based on germline origin and a minor allele frequency (MAF) of >=0.01 in at least one major population, with at least two unrelated individuals having the minor allele. This file may contain variations that happen to be both common and have evidence of medical interest. The definition of common may be based on one of more than 26 major populations
Note: The populations used to calculate allele frequency may not include the population you are studying.
Important: an allele shown to be "common" in one of the 26 major populations used for this directory may not be common in all populations.
File Updates: This file is updated once per build.
common_and_clinical.vcf.gz
This file is an inventory of all "common" germlinevariations that have evidence of medical interest. To create this inventory, only those records in the common_all.vcf.gz whose CLINSIG values are 4and greater (records of possible medical interest) are reported in common_and_clinical.vcf.gz. The resulting file, therefore, contains records that are both common and may have medical interest.
Note: A small percentage of records in "common_and_clinical.vcf.gz" are marked as "suspect". we suspect these records to be false positive due to artifacts of the presence of a paralogous sequence in the genome or evidence suggested sequencing error or computation artifacts. **** You can find these records by looking for the "SSR" tag in the information header of the VCF file.
Important: an allele shown to be "common" in one of the 26 major populations used for this directory may not be common in all populations.
File Updates: This file is updated weekly.
clinvar.vcf.gz
This file contains variations submitted through clinical channels. The variations contained in this file are therefore a mixture of variations asserted to be pathogenic as well as those known to be non-pathogenic (see Notebelow). The user should note that any variant may have different assertions regarding clinical significance and that this file will contain only those that are the most "pathogenic".
File Updates: This file is updated once per build.
clinvar_papu.vcf.gz
This file is an inventory of all human clinical variations found in the pseudoautosomal region (PAR), alternate loci, patch sequences and unlocalized/unplaced contigs (papu).
**File Updates:**** This file is updated once per build.
Variation records included in this file fall into the following clinical significance (CLINSIG) categories:
Included file 'human_variation_vcf-table2.inc' not found 1Variations with this value for clinical significance are in this file only if the allele frequency is greater than the stated threshold.
2Variations for which there is not yet an enumerated clinical significance class. These variations are grouped in a clinical significance class called "other", which includes:
- Variations that are found only insomaticcells and are with or without known trait or phenotype. If a variant's source is not asserted during submission, we assume that the source of the variant is germline. Those variants submitted with the clinical phrase (clinic_phrase) tag set to "cancer" are reported as somatic
- Somaticor germline variations that are disease risk factors
- Somaticor germlinevariations that act to protect a disease state (protective variants)
Note: The "common_no_known_medical_impact.vcf.gz" file and the "clinvar.vcf.gz" file are not mutually exclusive since some variants asserted to be non-pathogenic that were obtained through clinical channels appear in both the clinvar.vcf.gz file and the common_no_known_medical_impact.vcf.gz file. In other words, non-pathogenic variations submitted through clinical channels are marked as non-pathogenic and may have allele frequencies consistent with being reported ascommon.
Note: A small percentage of records in clinvar.vcf.gz are marked as "suspect" because they are suspected to be false positive due to the presence of a paralogous sequence in the genome or evidence that suggested sequencing error or computation artifacts. **** You can find these records by looking for the "SSR" tag in the information header of the VCF file.
*File Updates: * This file is updated weekly.
common_no_known _medical_impact.vcf.gz
This file is an inventory of all "common" germlinevariations that fall within the scope of VCF processing. To create this inventory, variation records of probable medical interest (records in clinvar.vcf.gz with CLINSIG values of 4 and above) are removed from the “common” variation records file (common_all.vcf.gz).
The file common_no_known_medical_impact.vcf.gz was created to provide users with an up-to-date report of common alleles not known to cause clinical phenotypes. This file can be used to subtract variants (filter) from a set of variant calls, thereby narrowing the list of variations that might warrant further evaluation for clinical significance. Should you wish to filter polymorphisms out of your whole genome/exome sequencing results, use the "common_no_known_medical_impact" file.
The "common_no_known_medical_impact.vcf.gz" file and the "clinvar.vcf.gz" file are not mutually exclusive because some variants asserted to be non-pathogenic that were obtained through clinical channels appear in both the "clinvar.vcf.gz" file and the "common_no_known_medical_impact.vcf.gz" file. Records for non-pathogenic variations that were submitted through clinical channels are marked as non-pathogenic and have allele frequencies consistent with a non-pathogenic status.
*File Updates: * This file is updated weekly.
File updates
dbSNP files are updated for every build (approximately once a quarter) or are updated weekly.
Older versions of the "common_no_known_medical_impact.vcf.gz", "clinvar.vcf.gz" files will have the date in the "yyyymmdd" format appended to the end of the file name, while the most recent version will have a symlink called "-latest" at the end of the filename to point to the most recent file.
Older versions of the "common_and_clinical.vcf.gz" file will have the date in the "yyyymmdd" format appended to the end of the file name, while the "common_and_clinical" without the appended date will point to the most recent version of the file.
Since updates to these files will capture changes to variations submitted to NCBI through clinical channels, users should verify that they are using the most recent version.
General Notes for All Files
- Minor Allele Frequency (MAF) is the allele frequency for the second most frequently seen allele. For example, consider a variation with alleles and allele frequencies as follows: Reference Allele=G; frequency = 0.600 Alternate Allele=C; frequency = 0.399 Alternate Allele= T; frequency = 0.001 Based on the MAF guideline mentioned above, the minor allele is "C", so the minor allele frequency (MAF) is 0.399. Allele "T" with frequency 0.001 is considered a rare allele rather than a minor allele.
- The Minor allele Frequency (MAF) for 1000 Genomes populations were calculated using genotype data from the 1000 Genomes Project [phase 1] total population of 2504 individuals, and is called the "1000G MAF" or "GMAF". You can find the data used for this computation in:ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/
-
The criteria used to consider variation "common":
- Variant has germlineorigin.
- Variant has a minor allele frequency (MAF) of >=0.01 in at least one major population, with at least two unrelated individuals having the minor allele.
- MAF was computed with founder genotypes only. That is, if a variant's minor allele was observed only in a parent and its child, the variant is not considered "common".
-
The *.tbi files in the ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF directory are created with Tabix for use with SAMtools. See details at: https://samtools.sourceforge.net/. The command options for Tabix are located at: https://samtools.sourceforge.net/tabix.shtml
VCF File Headers for "common_all.vcf.gz" and "common_no_known _medical_impact.vcf.gz"
The VCF headers for the "common_all.vcf.gz", "common_no_known_medical impact" and "clinvar_yyyymmdd" **** files are similar to the standard VCF header, but contain the following:
- The INFO tags: dbSNP_POP_IDS, dbSNP_LOC_POP_IDS, and dbSNP_POP_HANDLES all contain information that allows you to retrieve the submitter ID and the local population ID for the population in question. Each of these tags has the following comma separated fields (in order): numeric SNP population IDs; local population ID; submitter identifier (handle). Since the order of these fields is consistent for each tag, you can use the position of the values relative to each other to determine the local population ID and submitter handle for a particular population ID.
- The “Info” column contains an additional field called “POPFREQ”. This field contains the frequency information for each population ID. The frequency information provided in this field follows the format: pid(ns/na):f(c1/c2)[|f(c1,c2)], where: pid = population ID na = the number of chromosomes in which alleles were observed. This is usually the sample count multiplied by 2 (one for each chromosome) ns = the number of samples f = the minor allele frequency (MAF). This is actually the frequency of the alternate (ALT) allele for the 2nd most frequently seen allele. If f > 0.5, then the genome allele is the minor allele. An example is rs3091274 on chr 1 where all frequencies are > 0.5 c1 = the number of occurrences of the minor allele for the population. For samples where the minor allele is homozygous, the number of occurrences is 2, for heterozygous samples, the number of occurrences is 1, otherwise, the number of occurrences is 0. Example: if the population contains 3 heterozygous samples that have the allele, 1 homozygous sample that has the allele, and 12 samples that don't have the allele, then c1= 5 (3 alleles+2 alleles+0 alleles). c2 = sample count for the allele (c1 minus homozygous count). Example: if the population contains 3 heterozygous samples that have the allele, 1 homozygous sample that has the allele, and 12 samples that don't have the allele, so that c1=5 (see c1 example above), then c2= 4 (that is 5-1) since you subtract the homozygous count (in this case 1) from the sample count (c1) for the allele. f(c1/c2) represents the f, c1 and c2 values for the first alternate allele listed in the ALT column [|f(c1/c2)] represents additional instances off(c1/c2) that may follow a vertical bar. These additional instances off(c1/c2) provide the f, c1 and c2 values for alternate alleles when there is more than one alternate allele listed in the ALT column. Example 1: In this example, we will use contents of the AAM_GENO_PANEL.vcf.gz file. A description and composition of the population in this file is available at https://www.ncbi.nlm.nih.gov/SNP/snp_viewTable.cgi?pop=4446. The POPFREQ value for rs12121577 given in the AAM_GENO_PANEL.vcf file is: POPFREQ=248:124:0.0887096774193548(22/22). The reference allele given in AAM_GENO_PANEL.vcf for rs12121577 is "C" and the alternate allele is "G". Remember, the values for f, c1 and c2 are those for the alternate (and in this case, minor) allele "G". You can see these features in a sample of the AAM_GENO_PANEL.vcf fileavailable on the web-based poly_clin_readme page. Using the POPFREQ format statement: na:ns:f(c1/c2)[|f(c1/c2)], the value of each variable for each alternate allele in the POPFREQ statement for rs12121577 is the following: na=248 ns=124 f=.0887096774193548 (The refSNP page for shows the MAF as .0891 in the Population Diversity section) c1=22 c2=22 You can find the frequency of the reference allele ("C") by subtracting the frequency of the alternate allele (.0887096774193548) from 1: 1-.0887096774193548=0.911290323 Example 2: In this example, we again use contents of the AAM_GENO_PANEL.vcf.gz file, but examine rs11121815 instead. The POPFREQ value for rs11121815 given in the AAM_GENO_PANEL.vcf file is: POPFREQ=248:124:0(0/)|0.491935483870968(122/92)|0(0/) The reference allele given in AAM_GENO_PANEL.vcf for rs11121815 is "T" and the alternate alleles are "A", "C" and "G". You can see these features in a sample of the AAM_GENO_PANEL.vcf file available on the web-based poly_clin_readme page. Using the POPFREQ format statement: na:ns:f(c1/c2)[|f(c1/c2)], you can determine the value of each variable for each alternate allele in the POPFREQ statement for rs11121815 : na=248 ns=124 For alternate allele "A": f=0 c1=0 c2=0 For alternate allele "C" f=0.491935483870968 (The refSNP page for shows the MAF as .492 in the Population Diversity section) c1=122 c2=92 For alternate allele "G": f=0 c1=0 c2=0 You can find the frequency of the reference allele ("T") by subtracting the MAF (0.491935483870968) from 1: 1 - 0.491935483870968 = 0.508064516
Disclaimer: Assertions about the phenotypic effects of variants are provided by multiple sources may have different levels of experimental support, and they may be in conflict. NCBI does not independently verify assertions and cannot endorse their accuracy. Information obtained through this resource is not a substitute for professional genetic counseling and is not intended for use as the basis for medical decision making.
Please contact [email protected] if you have any questions or comments.