RefSeq Announcements in 2014
- January 21, 2014: Announcing RefSeq Release 63
- March 12, 2014: Announcing RefSeq Release 64
- April 14, 2014: Reminder of upcoming RefSeq FTP changes
- May 20, 2014: Announcing RefSeq Release 65
- July 15, 2014: Announcing RefSeq Release 66
- August 26, 2014: Major revision of NCBI's genomes FTP site
- September 11, 2014: Announcing RefSeq Release 67
- November 7, 2014: Announcing RefSeq Release 68
January 21, 2014: Announcing RefSeq Release 63
This full release incorporates genomic, transcript, and protein data available, as of January 12, 2014 and includes 48,358,066 records, 37,371,278 proteins, 5,760,653 RNAs, and sequences from 33,485 different organisms. Additional information is available in the Release Notes.
Changes since the previous release:
[1] This update reflects a mixture of SNP builds 138 and 139. A list of updated organisms and dbSNP annotation summary is available here: ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq63.snp.rpt
March 12, 2014: Announcing RefSeq Release 64
This full release incorporates genomic, transcript, and protein data available, as of March 10, 2014 and includes 45,971,929 records, 37,818,139 proteins, 6,198,996 RNAs, and sequences from 33,693 different organisms. Additional information is available in the Release Notes.
Changes since the previous release:
-
A list of updated organisms and dbSNP annotation summary is available here: ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq64.snp.rpt
-
RefSeq domain and site features that are provided by the Conserved Domain Database were updated in conjunction with CDD release 3.11. For more information on release 3.11 see: http://www.ncbi.nlm.nih.gov/news/02-19-2014-cdd-v311/
-
This release includes annotation for the updated human reference genome assembly, GRCh38. The Genome Reference Consortium released a major update to the human reference genome assembly in late December, 2013. In January 2014 this updated assembly, plus two other human genome assemblies (HuRef and CHM1_1.1) were annotated using NCBI's eukaryotic genome annotation pipeline which integrated information from curated RefSeqs, cDNAs, ESTs, protein alignments, and RNA-Seq data from the Human BodyMap2 project. Results for all three genomes are available as NCBI Annotation release 106. Annotation was calculated for chromosomes, alternate loci, and unplaced or unlocalized scaffolds defined by the GRC.
GRCh38 annotation highlights include: 41,556 genes and pseudogenes 20,246 protein-coding genes 14,632 genes are annotated with alternatively spliced transcripts 69,826 mRNA models 17,857 additional RNAs
Additional information is available here: - Annotation report page: http://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/106/ - Annotation pipeline: http://www.ncbi.nlm.nih.gov/books/NBK169439/ - Map Viewer genome browser: http://www.ncbi.nlm.nih.gov/projects/mapview/map_search.cgi?taxid=9606&build=106.0
April 14, 2014: Reminder of upcoming RefSeq FTP changes
-
Directory name change: The 'microbial' directory will be removed. Two new directories 'archaea' and 'bacteria' will be added.
-
WGS management change: WGS accessions will no longer be processed on a per-project (WGS prefix) basis. Instead, these accessions will be processed and packaged the same as non-WGS accessions. This will significantly reduce the number of files in the /complete/ and (new) /archaea/ and /bacteria/ directories. Therefore, there will no longer be a series of files named like 'microbialNZ_*'. Instead, all WGS scaffolds will be found in concatenated files just like all other accession data. We will continue to provide a separate file for the WGS master records.
Please note that this change in WGS management will also impact the /refseq/daily/ and /refseq/wgs/ directory areas. This impact was not spelled out in previous emails. As WGS accessions are processed the same as other non-WGS accessions, these updates will now appear in the /daily/ update area. WGS master records will continue to be provided separately from other files as they are special meta-data only records. WGS mater files will be provided with names like 'rsnc.wgs_mstr.0403.2014.bna.gz' and 'rsnc.wgs_mstr.0403.2014.gbff.gz' (where "wgs_mstr" indicates the type). These files will be provided in the /refseq/wgs/ directory area.
-
File name & Content change: This change will occur in both /refseq/daily/ and /refseq/release/release-catalog/ directory areas.
-
*WP2genomic.mapping.gz will be renamed* AutonomousProtein2Genomic.gz
- *multispecies_WP_accession_to_taxname.gz will be renamed *MultispeciesAutonomousProtein2taxname.gz
Both file names will be modified in order to more accurately reflect the terminology that is being used to refer to the autonomous nonredundant protein dataset that utilizes the 'WP_' accession prefix. In addition, the content of the AutonomousProtein2Genomic file will be expanded to include:
- protein accession.version
- protein gi
- genomic accession.version (on which the autonomous WP protein is annotated)
- genomic gi
- genomic annotated strain-level taxid
- genomic species-level taxid
- genomic BioSample ID if available
- genomic organism name (e.g., species + strain)
May 20, 2014: Announcing RefSeq Release 65
This full release incorporates genomic, transcript, and protein data available, as of May 12, 2014 and includes 51,770,174 records, 38,633,935 proteins, 7,051,549 RNAs, and sequences from 36,335 different organisms. Additional information is available in the Release Notes.
Changes since the previous release:
[1] A list of updated organisms and dbSNP annotation summary is available here: ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq65.snp.rpt
[2] Directory name changes: Two new directories, 'archaea' and 'bacteria' have been added to the release ftp site. They replace the previous 'microbial' directory.
[3] WGS management change: WGS accessions are no longer processed on a per-project (WGS prefix) basis. Instead, these accessions are processed and packaged the same as non-WGS accessions. This reduces the number of files in the /complete/ and (new) /archaea/ and /bacteria/ directories. Therefore, there are no longer files named like 'microbialNZ_*'. Instead, all WGS scaffolds are found in concatenated files just like all other accession data. We continue to provide a separate file for the WGS master records.
Please note that this change in WGS management also impacts the /refseq/daily/ and /refseq/wgs/ directory areas. As WGS accessions are processed the same as other non-WGS accessions, these updates will now appear in the /daily/ update area. WGS master records continue to be provided separately from other files as they are special meta-data only records. WGS mater files are provided with names like rsnc.wgs_mstr.0403.2014.bna.gz and rsnc.wgs_mstr.0403.2014.gbff.gz (where wgs_mstr ? indicates the type). These files are provided in the /refseq/wgs/ directory area.
[4] File name & Content change: This occurs in both /refseq/daily/ and /refseq/release/release-catalog/ directory areas *WP2genomic.mapping.gz has been renamed *AutonomousProtein2Genomic.gz *multispecies_WP_accession_to_taxname.gz has been renamed *MultispeciesAutonomousProtein2taxname.gz
- The modified file names more accurately reflect the terminology that is being used to refer to the autonomous and nonredundant protein dataset that utilizes the 'WP_' accession prefix. In addition, the content of the AutonomousProtein2Genomic file has been expanded to include both species and strain information. The columns provided include:
- protein accession.version
- protein gi
- genomic accession.version (on which the autonomous WP protein is annotated)
- genomic gi
- genomic annotated strain-level taxid
- genomic species-level taxid
- genomic BioSample ID if available
- genomic organism name (e.g., species + strain)
Also see the previous announcement introducing the autonomous nonredundant protein dataset: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/announcements/WP-proteins-06.10.2013.pdf
[5] Change in bacterial strain-level TaxID management: Please refer to the RefSeq-announce email that was sent on November 1, 2013. To summarize - Due to the high volume of bacterial genome submissions, and the trend toward sequencing many isolates from a population, we will discontinue assignment of strain-specific TaxIDs. Instead, strain name data will be managed in the BioSample database. Sequence records will be annotated with both the NCBI TaxID, which identifies the species, and a BioSample ID, which identifies the strain (or breed, cultivar, or isolate).
GenBank submissions processing of bacterial genomes has already transitioned to this new data model (CP006594.1) and this data detail is being propagated to the corresponding RefSeq genome (e.g., NC_021838.1). Historical GenBank and RefSeq genomes will be retroactively updated to add the BioSample Accession to the DBLINK line over the next months.
July 15, 2014: Announcing RefSeq Release 66
This full release incorporates genomic, transcript, and protein data available, as of July 7, 2014 and includes 58,334,707 records, 43,671,159 proteins, 7,568,770 RNAs, and sequences from 41,263 different NCBI Taxons. Additional information is available in the Release Notes.
Changes since the previous release:
[1] A list of updated organisms and dbSNP annotation summary is available here: ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq66.snp.rpt
[2] As this release was in the process of being installed, a flat file change was noted that affects the displayed dbXref text for MGI and HGNC links such that the header is displayed twice. For example, /db_xref="MGI:MGI:99954"
This change was originates from a INSDC modification to these dbXrefs at the request of MGI and HGNC. The dbXref is intended to display the db source followed by the db value that is passed in the URL. For both MGI and HGNC the passed URL value includes both the DB source term and an integer value. RefSeq uses the INSDC dbXref format and was not aware of this planned change prior to processing the release.
[3] Growth in Bacterial Genomes: NCBI recently refactored the prokaryotic genome annotation pipeline. The pipeline has a much higher throughput rate which better responds to the growing influx of bacterial genome submissions, incorporates regression and additional quality tests, and generates more consistent annotation results. Initial processing has focused on unannotated bacterial genomes submitted to members of the International Nucleotide Sequence Database Collaboration (INSDC) and available in NCBI's GenBank database.
NCBI's bacterial genome annotation pipeline is described here:
http://www.ncbi.nlm.nih.gov/genome/annotation_prok/ http://www.ncbi.nlm.nih.gov/books/NBK174280/
[4] Imminent change to related /genomes/ FTP area: Organism and assembly specific reports of sequence and annotation information is available for some RefSeq genomes from: ftp://ftp.ncbi.nlm.nih.gov/genomes/
Three new directories will be available from this site by the end of July. all/ - comprehensive reports of GenBank and RefSeq assembled genomes Content is organized per assembly accession and name. Sequence and annotation data is provided in several formats. Assembly meta-data and details are also provided. refseq/ - RefSeq genomes Content is organized by taxonomic groups similar to the RefSeq release, then by species, then by assembly. Sequence and annotation data is provided in several formats. Assembly meta-data and details are also provided. genbank/ - GenBank genomes Same as above.
Additional details will be provided in NCBI announcements, and in README files in all three directories. Please monitor the primary NCBI News site for additional information (http://www.ncbi.nlm.nih.gov/news/).
[5] NCBI has discontinued assigning tax_ids to strains. Strain information is now managed by the BioSample database resource. Sequence records that represent a strain report the BioSample ID on the DBLINK line and continue to report the strain name in the source feature. A RefSeq processing bug was discovered while processing this release that results in failure to report the strain name in the record DESCRIPTION line. NZ_JNXV00000000.1 is an example accession that reports the BioSample ID:
LOCUS NZ_JNXV01000000 6964522 bp DNA linear BCT 03-JUL-2014
DEFINITION Streptomyces flavotricini, whole genome shotgun sequencing project.
ACCESSION NZ_JNXV00000000
VERSION NZ_JNXV00000000.1 GI:662073487
DBLINK BioProject: PRJNA224116
BioSample: SAMN02645296
Assembly: GCF_000715705.1
KEYWORDS WGS; HIGH_QUALITY_DRAFT; RefSeq.
[6] Common BioProject ID on Prokaryotic WGS RefSeq Genomes: Prokaryotic genomes annotated using the re-factored genome annotation pipeline are tracked with a single BioProject ID, namely PRJNA224116. Summary information on the annotated genomes is presented in BioProject: http://www.ncbi.nlm.nih.gov/bioproject/PRJNA224116
August 26, 2014: Major revision of NCBI's genomes FTP site
NCBI has released a major revision of the genomes FTP site, making it easier to navigate and to download both GenBank and RefSeq eukaryotic and prokaryotic genomes. For more information about the changes, please see the NCBI News article: http://www.ncbi.nlm.nih.gov /news/08-26-2014-new-genomes-FTP-live/ and the genome FTP FAQ: /genome/doc/ftpfaq/
We plan to maintain the older content and structure of the preexisting site (ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/), which many of you have been using, in parallel with the new structure for 6 months.
Please note that bi-monthly RefSeq releases will continue to be provided. The next RefSeq release will be installed in mid-September. Here is more information to clarify the distinction between the Genomes and RefSeq FTP sites.
The new genomes FTP directories ('all', 'genbank', and 'refseq') provide data based on new or updated prokaryotic or eukaryotic genome assemblies, or updates to whole genome annotations. Over time, this space will include both historical and current assembly and annotation content. This FTP site is oriented on the genome assembly package and corresponds to the content in NCBI's Assembly resource (www.ncbi.nlm.nih.gov/assembly/). The genomes FTP site facilitates access to both GenBank and RefSeq data - including genomic sequence, assembly structure details, accessioned annotated transcripts, and accessioned annotated proteins. RefSeq records that are not part of the genome annotation will not be included here. The root directory is here: ftp://ftp.ncbi.nlm.nih.gov/genomes/
The RefSeq FTP site provides bulk access to the entire RefSeq database or specific sub-sets and does include content that is not available in the new genomes FTP site including: daily updates; RefSeqGene records; viruses; organelles (that are not part of a whole genome submission); the targeted ribosomal RNA project; RefSeq transcript and protein records that are not yet annotated on the corresponding genome; and autonomous non-redundant proteins (WP_ accession prefix) that are not yet directly annotated on a genome. The root directory is: ftp://ftp.ncbi.nlm.nih.gov/refseq/
September 11, 2014: Announcing RefSeq Release 67
This full release incorporates genomic, transcript, and protein data available, as of September 8, 2014 and includes 61,277,203 records, 45,166,402 proteins, 8,163,775 RNAs, and sequences from 41,913 distinct NCBI TaxIDs. Additional information is available in the Release Notes.
Changes since the previous release:
[1] A list of updated organisms and dbSNP annotation summary is available here: ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq67.snp.rpt
[2] Expanded content in *files.installed file: The Release-catalog file 'release*.files.installed has been modified to add md5chksum information per file. The file format is: md5chksum file_name
[3] WGS master record count: The release-statistics file 'RefSeq-release*.mmdd7777.stats.txt now includes a separate count of WGS master records.
[4] Bacterial genome re-annotation status: Re-annotation of bacterial genomes is in process and we expect to make updated annotation public for a large number of RefSeq bacterial genomes in the next two months. RefSeq release 68 will include this revised data. One of the expected changes will be expanded direct annotation of autonomous RefSeq protein products (with WP_ accession prefixes) and a corresponding removal of YP_ accessioned records.
[5] New Identical Protein Report for web and eUtils: http://www.ncbi.nlm.nih.gov/news/09-09-2014-identical-protein-report-display-setting/
A new display option has been added to the Protein database - the "Identical Protein Report". When viewing an individual record, this display allows you to access a list of all other identical proteins including those submitted as translations to GenBank, as well as RefSeq, UniProtKB/Swiss-Prot, PIR, PDB, and patented protein records.
This report is anticipated to be especially useful for those interested in obtaining current information about the RefSeq genomes that have been annotated with an autonomous protein product (with a WP_ accession prefix). The report page provides mapping between protein and genomic accession.version, as well as information on the species and strain that the autonomous protein is in scope for.
The report can also be obtained using NCBI's programming utilities. For example: 1) www.ncbi.nlm.nih.gov/protein/WP_008440780.1?report=ipg&log$=seqview 2) eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=495716201&rettype=ipg&retmode=text
[6] GI sequence identifiers to be phased out (slowly!) at NCBI: As NCBI considers how best to address the expected continued increase in the volume of submitted sequence data, it is clear that prior practices will need to be re-thought. As an example, imagine 100,000 pathogen-related genomes/samples, each with 5000 proteins, most of which are common to all. We are moving toward solutions that represent each unique protein *once*. The RefSeq autonomous non-redundant protein dataset is an early example of this.
GenBank is considering expanding this data model such that the coding region protein products for each genome will likely continue to be assigned their own Accession.Version identifiers, but (within the NCBI data model) they will simply *reference* the unique proteins. And, the annotated coding region proteins will no longer be issued GIs of their own. This change will affect both GenBank and RefSeq records.
Such a change will likely have a significant impact on NCBI users who utilize GIs in their own information systems and analysis pipelines, so it will not be introduced quickly. You can expect that a great deal of additional detail will be made available via NCBI's various announcement mechanisms.
*This* particular announcement is chiefly intended to provide some advance warning to our users. There _will_ be classes of GenBank sequences that are not assigned GIs in the not-too-distant future. If GIs are central to your operations, then it might be appropriate to begin planning a switch to the use of Accession.Version identifiers instead.
And in fact, NCBI now has several WGS submissions for which GIs have not been assigned, for both the contigs and the scaffolds.
For example: Here are excerpts of the flatfile representation for the first ALWZ02 (second assembly-version of the ALWZ project) contig, and the 'singleton' scaffold which is constructed from it and lacks a GI value:
LOCUS ALWZ020000001 701 bp DNA linear PLN 28-MAY-2013 DEFINITION Picea glauca contig316_0, whole genome shotgun sequence. ACCESSION ALWZ020000001 ALWZ020000000 VERSION ALWZ020000001.1 DBLINK BioProject: PRJNA83435
November 7, 2014: Announcing RefSeq Release 68
This full release incorporates genomic, transcript, and protein data available, as of November 3, 2014 and includes 66,078,114 records, 46,968,574 proteins, 9,069,704 RNAs, and sequences from 49,312 distinct NCBI TaxIDs. Additional information is available in the Release Notes.
Changes since the previous release:
[1] A list of updated organisms and dbSNP annotation summary is available here: ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq68.snp.rpt
[2] New file: A file named RELEASE_NUMBER has been added to the /refseq/release/ directory. The content of this file is the integer release number.
[3] Bacterial genomes: NCBI's refactored Prokaryotic Genome Annotation Pipeline is now in full production mode. Since the last release, new bacterial genomes from over 8,000 distinct NCBI Taxonomy IDs were added to the RefSeq database. This also resulted in a large increase in annotated bacterial plasmids.
In addition, approximately 10,000 RefSeq bacterial genomes have been re-annotated and are includee in this release.
As bacterial genomes are annotated, they are converted to the new RefSeq protein data model. Thus, a large number of 'YP_' and 'XP_' accessions have been removed. All pipeline-annotated bacterial genomes refer to a nonredundant RefSeq protein with a 'WP_' accession prefix. We will provide a mapping file from the suppressed YP accessions to the replacement WP_ proteins when the re-annotation process has completed. We anticipate this will be available by December 2014.
General information on the Prokaryotic Genome Annotation Pipeline is available here: http://www.ncbi.nlm.nih.gov/genome/annotation_prok/ http://www.ncbi.nlm.nih.gov/books/NBK174280/