RefSeq announcements in 2013:
- January 14, 2013: Announcing RefSeq Release 57
- March 14, 2013: Announcing RefSeq Release 58
- May 2, 2013: Announcing RefSeq Release 59
- June 10, 2013: Announcing non-redundant bacterial proteins (WP_ accessions)
- July 25, 2013: Announcing RefSeq Release 60
- September 17, 2013: Announcing RefSeq release 61
- November 1, 2013: Announcing change in bacterial taxonomy management
- November 14, 2013: Announcing RefSeq Release 62
January 14, 2013: Announcing RefSeq Release 57
This full release incorporates genomic, transcript, and protein data available, as of January 8, 2013 and includes 34,169,407 records, 27,845,459 proteins, 3,267605 RNAs, and sequences from 21,415 different organisms. Additional information is available in the Release Notes.
Changes since the previous release
[1] Variation annotation: This update reflects SNP Build 137, and includes incremental updates for some human records.
[2] The bacterial RefSeq collection expansion is ongoing. This adds more microbial genomes that represent complete or draft assemblies from novel isolates and clinical and population samples. As part of this expansion, bacterial RefSeq genomes will be re-annotated to increase consistency across this dataset. Annotation updates are expected to start in the first quarter of 2013.
March 14, 2013: Announcing RefSeq Release 58
This full release incorporates genomic, transcript, and protein data available, as of March 11, 2013 and includes 36,938,203 records, 30,489,893 proteins, 3,345,543 RNAs, and sequences from 22,460 different organisms. Additional information is available in the Release Notes.
Changes since the previous release:
[1] Variation annotation: This update reflects SNP Build 137, and includes incremental updates for some human records.
[2] New accessions with the prefix 'NS_' will not be issued. This prefix was used for sequence records that were a collection of unordered and unoriented contigs, not a real biological molecule. New accessions with this prefix will not be issued and a single record with this prefix type is retained per a prior agreement with FlyBase.
May 2, 2013: Announcing RefSeq Release 59
This full release incorporates genomic, transcript, and protein data available, as of April 29, 2013 and includes 39,040,745 records, 31,593,499 proteins, 3,579,371 RNAs, and sequences from 24,656 different organisms. Additional information is available in the Release Notes.
Changes since the previous release:
[1] Variation annotation: This update reflects SNP Build 137, and includes incremental updates for some human records.
June 10, 2013: Announcing non-redundant bacterial proteins (WP_ accessions)
Announcing the addition of a new protein data model and accession prefix (WP_) for bacterial protein records. Additional information is available here: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/announcements/WP-proteins-06.10.2013.pdf
July 25, 2013: Announcing RefSeq Release 60
The July 2013 RefSeq FTP release marks the 10th anniversary of RefSeq comprehensive FTP releases.
This full release incorporates genomic, transcript, and protein data available, as of July 19, 2013 and includes 40,913,699 records, 32,504,738 proteins, 4,243,209 RNAs, and sequences from 28,560 different organisms. Additional information is available in the Release Notes.
Changes since the previous release:
[1] new /announcements/ directory area: A directory has been added for supplemental documentation of larger announcements that are posted with a release, or between releases.
/refseq/release/announcements/
Announcement file names include a date (mm.dd.yyyy). Announcements older than 6 months will be moved to /announcements/archive/
[2] Variation annotation: The update include variation annotations from dbSNP Build 138 release for the following organisms:
human (tax_id 9606), for assemblies GRCh37.p10, HuRef, and CHM1_1.0 (new), and CRA_TCAGchr7v2
chicken (tax_id 9031), for assembly Gallus_gallus_4.0
pig (tax_id 9823), for assembly Sscrofa10.2
rat(tax_id 10116), for assembly Rnor_5.0
zebrafish (tax_id 7955), for assembly Zv9
cow (tax_id 9913), for assemblies Bos_taurus_UMD_3.1 and Btau_4.6.1
dbSNP human build 138 includes new submissions based on the ESP6500 data release from NHLBI GO-ESP; please see the BioProject record PRJNA165957 for more information on this initiative: http://www.ncbi.nlm.nih.gov/bioproject/165957.
[3] Bacterial genomes, new protein data model and accession series (WP): NCBI continues to expand the RefSeq bacterial genomes node to include ALL complete and draft genomes that meet minimum assembly and annotation quality criteria. This means that RefSeq will include more than one genome of the same strain which may be provided through strain population sampling or sequencing to monitor a disease outbreak. NCBI is in the process of re-annotating all bacterial genomes, with the exception of a small number for which annotation is provided by, or in collaboration with, another group (such as E. coli str. K12 substr. MG1655).
Please note that some RefSeq bacterial genomes were recently suppressed. This included two categories: - Unannotated genomes that had not been processed by NCBI's annotation pipeline yet - Annotated genomes with identified annotation quality issues
This results in a net decrease in RefSeq bacterial genomic accessions in this release. Many of the suppressed accessions will be reinstated when annotation is provided.
Due to the expanded scope of the RefSeq bacterial node, we anticipated a very large increase in the number of identical (redundant) proteins; therefore, we have introduced a new data model for bacterial proteins whereby we are providing a true non-redundant protein dataset associated with a new accession prefix,'WP'. Details about the new data model with examples was announced between release cycles. More information is available here:
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/announcements/WP-proteins-06.10.2013.pdf
This release includes a new supplemental file providing mapping of WP accessions to tax_id and species name, for the subset of WP accessions that are annotated on genomes of different species. For example, WP_000002243.1.
release60.multispecies_WP_accession_to_taxname.txt
[4] Annotation of human and vertebrate transcript records: Recent changes to human and other vertebrate transcript records includes:
- removal of exon numbers
- expanded reporting of support evidence, in a structured comment with the header 'Evidence Data'
- (new) reporting gene and transcript attributes, in a structured comment with the header 'RefSeq Attributes'
- removal of mitochondrial localization information from the record DEFINITION line (moved to Attributes)
For a detailed description of these changes, please see:
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/announcements/AnnouncingRefSeq-vertebrate-evidence&attributes-07.24.2013.pdf
[5] New RefSeq Category comment:
A subset of RefSeq bacterial genomes are considered to be established Reference genomes (e.g., E. coli K12), or good representative genomes for the species. A new comment identifies bacterial genomes as the category of 'Reference Genome' or 'Representative Genome'. The comment also conveys information explaining why it is tracked with this comment.
For example see NC_000913.2 -
RefSeq Category: Reference Genome
COM: Community selected
PRT: Proteomics
UPR: UniProt Genome
The RefSeq Genome comment includes information on the one or more reasons that it is tracked with this comment including:
FGS: First Genome sequenced
PRT: Proteomics
UPR: UniProt genomes
QfO: species selected by the "Quest For Orthologs" group
PHY: representative member at a phylogenetically interesting position
MOD: Model organism
CCA: Community Consortium Annotation
CALC: Calculated
COM: Community selected reference
[6] Change to allow a mixture of known and model accessions for eukaryotic genes:
NCBI calculates genome annotation for many RefSeq eukaryotic genomes and the resulting model transcript and protein sequences are tracked with XM/XR/XP accession prefixes.
We also provide transcript and protein sequence records based on automatic processing of GenBank transcript records and manual sequence analysis; these 'known' RefSeq records are tracked with NM/NR/NG/NP accession prefixes. Previously, we did not allow a mixture of 'model' and 'known' records; thus curation processing would result in removal of X* series accession and replacement by N* series accessions.
We have changed this policy in order to provide increased annotation of splice variants. RefSeq models are calculated using cDNA, protein, and RNAseq data. There may be good support at the level of each exon pair; however, the long range exon combination represented in the model may not be fully supported. For example, see Gene ID: 100306968.
[7] Keyword: The keyword 'RefSeq' has been added to all RefSeq records.
[8] Packaging of WGS master records: As previously announced, instead of providing a single file for each WGS master accession, we are providing a file (per node) that contains all of the WGS master records.
Therefore, files named like 'microbialNZ_AAEQ.mstr.gbff.gz' will be replaced with a single file named as 'microbial.wgs_mstr.gbff.gz'.
September 17, 2013: Announcing RefSeq release 61
This full release incorporates genomic, transcript, and protein data available as of September 9, 2013 and includes 41,959,081 records, 33,139,114 proteins, 4,528,216 RNAs, and sequences from 29,414 different organisms. Additional information is available in the Release Notes.
Changes since the previous release:
[1] This update reflects SNP Build 138. A list of updated organisms and dbSNP annotation summary is available here: ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/b138_20130913.rpt
[2] This release includes updated annotation of human assemblies GRCh37.p13, HuRef, CHM1_1.1, and Toronto chromosome 7 CRA_TCAGchr7v2. This is the last annotation update provided for the GRCh37 reference assembly. We anticipate including annotation for the updated human reference assembly, GRCh38, in RefSeq release 62. See the Genome Reference Consortium web site for more details: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/
This annotation update, Human annotation release 105, incorporated RNA-Seq data from the Human Body Map project. Human genes may now include a mixture of 'known' RefSeq transcripts and proteins (with NM_, NR_, NP_ accession prefixes) and 'model' transcripts and proteins (with XM_, XR_, and XP_ accession prefixes). For example, see the Gene report for human gene MACF1 (Gene ID: 23499).
Providing a mixture of known and model human transcript and protein enables us to represent many more splice variants and exons. Note that model RefSeq records are a product of NCBI's eukaryotic genome annotation pipeline and are generated using a mixture of transcript, protein, and RNA-Seq alignments; the exon combination represented may be inferred.
[3] This release includes a new supplemental file for non-redundant WP proteins (release##.WP2genomic.mapping.txt) in the /release-catalog/ directory.This file provides a mapping table of WP accessions to the genomic accessions that the WP is annotated on. Note that additional rows are provided when a protein has been annotated on more than one genomic record.
The columns in this file are: Protein accession.version Protein gi Genomic nucleotide accession.version Nucleotide gi Species-level tax_id
[4] Multispecies WP records: The subset of proteins in the new non-redundant RefSeq dataset, with WP_ accession prefixes, that are annotated on assembled genomes from distinct species have been updated to reflect organism names and NCBI taxonomic IDs for the lowest common taxonomic node.
For example, WP_000289090.1 is annotated on both Escherichia and Shigella bacterial genomes. The record was updated to indicate that the source organism is 'Enterobacteriaceae'.
[5] The RefSeq bacterial genomes group stopped processing new bacterial genomes using the original PGAAP annotation pipeline. The new pipeline, PGAAP-2.0 has now been moved into production. This pipeline switch was accompanied by a delay in processing approximately 1000 new bacterial genomes for RefSeq. Processing of these genomes, as well as calculating annotation updates for other RefSeq bacterial genomes, is in progress.
November 1, 2013: Announcing change in bacterial taxonomy management
NCBI made an announcement regarding bacterial strain-level TaxID management that affects the RefSeq database. Assigning strain-level TaxIDs will be discontinued in January 2014. Strain and isolate information will be maintained in the BioSample database instead of the Taxonomy database. Additional information is availalable in the NCBI News Archive.
November 14, 2013: Announcing RefSeq Release 62
This full release incorporates genomic, transcript, and protein data available, as of November 10, 2013 and includes 45,971,929 records, 36,036,343 proteins, 5,178,509 RNAs, and sequences from 31,646 different organisms. Additional information is available in the Release Notes.
Changes since the previous release:
[1] This update reflects a mixture of SNP builds 138 and 139. A list of updated organisms and dbSNP annotation summary is available here: ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/b139_20131112.rpt