Updating Information on GenBank Genome Records

You can update your existing GenBank prokaryotic and eukaryotic genome records at any time using the different file types described below. If you are updating multiple records, please send a list of all accessions to be updated at the top of your request. Save the update file types as plain text and email them to us at [email protected].

You can also request a change in the release date of your genomes, depending on their status. Send an email request with the genome accessions to us at [email protected].

If you submitted to our collaborators at ENA or DDBJ, please see their instructions for update formats.

See these instructions for updating GenBank records that are not prokaryotic or eukaryotic genomes.

See SRA for information about updating SRA submissions.

Update Formats

Changing the release date
Updating Publication Information
Source Information
A new assembly of the genome (sequence update)
Changing text or adding new qualifiers for existing features
Adding or removing a few features
Re-annotating existing records (adding or removing or moving many features)

Changing the release date

Your genome will be released on the day that you selected during the submission or when its accession or description is publicly available, whichever is first. You can request a change in the release date, depending on the status. If needed, send an email request with the genome accessions to us at [email protected].

If you are requesting the release of a genome because a manuscript has been accepted for publication or is published, please also provide the publication information so that we can include that on the genome.

Note that release of the genome will automatically trigger the release of its BioProject and BioSample. However, the reverse is not true; the release of a BioProject or BioSample will not automatically trigger the release of associated data.

Updating Publication Information

[a] If the PMID or DOI are publicly available please send the information as a tab-delimited table as follows:

    acc. num.   PMID 
    CP002501    29980901
    JAARTP00000000  29985341

    acc. num.   DOI
    CP002501        10.1000/xyz456
    JAARTP00000000  https://doi.org/10.1000/xyz123doi

[b] For all other updates, please provide the revised information in a tab-delimited table. You must replace any non-ASCII characters (for example, characters with accents and umlauts) with the appropriate English letters.

The complete list of revised author names should be provided in the following format: first_initial middle_initial surname, etc., For example:

acc. num.    authors    title
ARTP00000000    J. A. Smith    Analysis of the ABCD genome               
CP002341    X. P. Weng, J. Doe    Comparison of gut genomes

The complete list of revised author names should be provided in the following format: first_initial middle_initial surname, etc., For example:

    acc. num.       authors    
    ARTP00000000    J. A. Smith    
    CP002341        X. P. Weng, J. Doe

Source Information

Send updates to the source information (e.g., strain, cultivar, geo_loc_name, specimen_voucher) in a multi-column tab-delimited table, and we will update the genome and its BioSample. An example table is:

acc. num.       strain  geo_loc_name
XXXX00000000        82      USA
XXXY00000000        ABC     Canada

A new assembly of the genome (sequence update)

If you have updated the sequence and the chromosomes of the genome are still in multiple pieces, then create a new genome file as you did before, following the instructions. Submit a new genome submission in the Genome Submission Portal and select 'yes' that it is an update to an existing genome at the prompt. Include the WGS accession XXXX00000000 in the box. Be sure to provide the BioProject and BioSample IDs from the original genome. Choose option 2 (wgs genomes) on the Files tab and finish the submission. The sequences will be assigned new accession numbers and the master accession will increment to the next version, e.g., XXXX01000000 would update to XXXX02000000. See more information about WGS accession numbers.

If the chromosomes of the genome are now each in a single sequence, then create a new genome file as per the instructions. Submit a new genome submission in the Genome Submission Portal and select 'yes' that it is an update to an existing genome at the prompt. Include the previous genome accession (eg, CP000001 or XXXX00000000) in the box. Be sure to provide the BioProject and BioSample IDs from the original genome. Choose option 1 on the Files tab and finish the submission.

If you are including annotation, then see the prokaryotic or eukaryotic annotation instructions.

Changing text or adding new qualifiers for existing features

Use a tab-delimited table for simple updates to existing features (e.g., changing product names or adding EC_numbers to existing CDS features). The first row in the table would be the headers, with subsequent rows for each qualifier being modified or added. The first column is the accession or contig name, the second is locus_tag, and subsequent columns are the qualifiers being changed. For example:

Accession   Locus_tag   gene_name   CDS_product CDS note    gene note
XXXX01000001    Abc_xxxx    lacZ    beta-galactosidase      present in multiple copies
XXXX01000010    Abc_xxxy        helicase    required for replication

Also indicate whether blank cells mean 'delete what is present' or 'no change'. A blank cell in the CDS_product can never mean 'delete' since that is a required field. You only need to include the features that are changing. New product names will need to follow the protein naming conventions; see the prokaryotic and eukaryotic annotation guidelines.

NOTE: You cannot add, remove, or change the locations of new features (e.g., new CDS or gene) this way. If you want to make those changes, then see the instructions below.

Adding or removing a few features

If you are only adding a few new features, then you could send a small 5-column Feature Table .tbl file that has only the new features. However, if there are many changes, then follow the instructions below for re-annotating existing records. For more information about this table format see the prokaryotic or eukaryotic annotation instructions.

If you are only removing a few features, send us a list of the locus_tags for the features that need to be removed.

We will let you know if we find any issues when the update file is processed, e.g., if a CDS overlaps an rRNA feature. Email the files to [email protected] andinclude the WGS accession in the request.

Re-annotating existing records (adding or removing or moving features)

If the genome has been released with annotation and you want to update the annotation (but not change the sequences), create a new annotated submission as you did originally, and submit the update via the Genome Submission Portal. We will replace all of the existing annotation with the annotation in the new file.

The fasta header in the update must include the contig identifier (SeqID) used in the original submission and the accession numbers. The correct format for the identifiers of a WGS genome in such an update is:

gnl|WGS:XXXX|SeqID|gb|XXXX01xxxxxx
gnl|WGS:XXXXXX|SeqID|gb|XXXXXX01xxxxxx

where XXXX (or XXXXXX) is the WGS accession prefix and XXXX01xxxxxx (or XXXXXX01xxxxxx) is the contig accession number. A file containing the contig identifiers and accession numbers was posted to the Submission Portal when the genome was released. The file name format is xxxx_accs file, where xxxx is the WGS accession prefix. Let us know if you need a copy of this file.

Please see the submission guide for instructions about how to generate a submission. In addition, if you are including annotation, be sure to read the prokaryotic or eukaryotic annotation guidelines.

In the standard situation for WGS genomes, annotation is not tracked from the previous version to the new version. The locus_tag prefix always remains the same; however, the locus_tags would need to be unique in the new annotation, both within the update and compared to the previous annotation. A simple way to ensure uniqueness is to use a different number of digits after the underscore in the locus_tag. For example, if the registered locus_tag prefix is ABC and the previous annotation has 4 digits after the underscore (ABC_0001), then the new annotation could have 5 digits (ABC_00001). Similarly, the protein_ids must be unique compared to the previous assembly. By using the locus_tag in the protein_id, this uniqueness could be maintained, for example:

gnl|WGS:xxxx|ABC_00001

where XXXX is the accession number prefix of the project.

Alternatively, you could track the annotation from the previous version to this update. Note that this is not required. Track both the locus_tag's and protein_id's so that they are included when the gene/CDS is retained in the new annotation, even if the nucleotide location is modified slightly (e.g., the start codon is being extended upstream). To track the proteins, the protein_id's must have the format:

gnl|WGS:xxxx|SeqID|gb|accession_number

where XXXX is the accession number prefix of the project, SeqID is the protein SeqId (column 1 of the p2g file) and accession_number is the protein accession number (column 2 of the p2g file). You should have received a p2g file with the release letter for the genome. We can send this file again if you need it.

If you are adding a new protein, it would not have a protein accession number. You would need to use a new locus_tag that was not in the previous annotation and you would need to give the new protein a unique identifier (usually the same as the new locus_tag). For example, if you used ABC_6000 as the new locus_tag, you could use:

gnl|WGS:XXXX|ABC_6000

Please include a summary of the expected protein fates (new proteins, same proteins, changed proteins, removed proteins) so we will know what to expect.

If you are modifying an existing protein (maybe just moving the start codon) then use the same locus_tag and protein_id that is in the previous annotation. The protein will also keep its protein accession number. If you find that two adjacent proteins should be combined into a single protein and part of the translation stays the same, then choose one of the locus_tag/protein_ids/protein_accessions from the previous annotation to use for the new protein (preferably the one that had the similar translation) and remove the other identifiers (or you could add the removed locus_tag to the /old_locus_tag qualifier and include a note explaining that two proteins were combined). If you are completely changing a protein (maybe changing the reading frame) such that the new translation is completely different, then it would be a new protein with a new locus_tag, new protein_id, which would be assigned a new accession upon release into GenBank. If you do remove a protein, then do not reuse the locus_tag/protein_id/protein_accession for a different protein. The identifiers are meant to represent a single unique feature and should not be moved to different proteins.

Please contact us at [email protected] if you have questions about generating the submission files, or about details of annotation.

GenBank

Public nucleic acid sequence repository