Validation Error Explanations for Genomes
This page has explanations for individual errors that are commonly found during processing of prokaryotic and eukaryotic genomes, along with suggestions to fix them. Write to [email protected] if you do not know how to correct the error in your submission.
Explanations of disrepancy report problems that are reported in the "discrep" file can be found at https://www.ncbi.nlm.nih.gov/genbank/asndisc#fatal nd https://www.ncbi.nlm.nih.gov/genbank/asndisc.examples/
Remember that annotation is not required for genome submissions, and that you can request NCBI's Prokaryotic Genome Annotation Pipeline for your prokaryotic genome submissions. For more information about annotation, see the Prokaryotic Genome Annotation Guidelines or Eukaryotic Genome Annotation Guidelines.
Error List
- SEQ_FEAT_BadCharInAuthorLastName
- SEQ_FEAT_BadCharInAuthorName
- SEQ_DESCR_BadCollectionCode
- SEQ_DESCR_BadCollectionDate
- SEQ_DESCR.BadGeoLocNameCode
- SEQ_FEAT.BadProteinName: Unknown or hypothetical protein should not have EC number
- SEQ_INST.BadProteinStart
- SEQ_DESCR.BadVoucherID
- SEQ_DESCR.BioSourceMissing
- SEQ_FEAT.EcNumberProblem
- SEQ_FEAT.FeatureBeginsOrEndsInGap
- SEQ_FEAT.GenCodeInvalid
- SEQ_FEAT.GenCodeMismatch
- SEQ_FEAT.IllegalDbXref
- SEQ_DESCR.InconsistentMolInfoTechnique
- SEQ_INST.InternalNsInSeqRaw
- SEQ_FEAT.InternalStop
- SEQ_FEAT.InvalidInferenceValue: unrecognized database
- SEQ_FEAT.InvalidQualifierValue: rRNA has no name
- SEQ_DESCR.LatLonGeoLocName
- SEQ_DESCR.LatLonFormat
- SEQ_DESCR.LatLonProblem
- SEQ_DESCR.LatLonRange
- SEQ_DESCR.LatLonValue
- SEQ_FEAT.MisMatchAA
- GENERIC.MissingPubInfo: No submission citation anywhere on this entire record
- GENERIC.MissingPubInfo: Submission citation affiliation has no state
- GENERIC.MissingPubInfo
- GENERIC.MissingPubRequirement
- SEQ_FEAT.MissingTrnaAA
- SEQ_DESCR.NoOrgFound
- SEQ_DESCR.NoPubFound
- SEQ_DESCR.NoSourceDescriptor
- SEQ_FEAT.NoStop
- SEQ_FEAT.OnlyGeneXrefs
- SEQ_FEAT.PartialProblem PartialLocation: Start does not include first/last residue of sequence
- SEQ_FEAT.ShortIntron
- SEQ_INST.ShortSeq
- SEQ_FEAT.StartCodon
- SEQ_INST.StopInProtein
- SEQ_INST.TerminalNs
- SEQ_FEAT.TransLen
- SEQ_FEAT.UnknownFeatureQual: orig_protein_id
- SEQ_DESCR.UnstructuredVoucher
- SEQ_DESCR.UnwantedCompleteFlag
- SEQ_DESCR.WrongVoucherType
SEQ_FEAT_BadCharInAuthorLastName
Explanation : An author name has illegal characters.
Suggestion : Check the last names (family names) in the sequence and publication references. Use only plain ASCII text for the names. The last name should NOT contain symbols, numbers, accents, umlauts, characters with diacritical marks, and should NOT end in punctuation. Note that names with internal punctuation such as "St. John" or "D'Abaco" will validate.
examples:
incorrect: Henry Jones., Carlos Méndez, Xu 1Weng
corrected: Henry Jones, Carlos Mendez, Xu Weng
The use of a terminal period and number in these family names causes an error. The error can be corrected by removing the symbols, characters with diacritical marks, numbers, or punctuation.
SEQ_FEAT_BadCharInAuthorName
Explanation : An author name has illegal characters.
Suggestion : Check the first names (given names) in the sequence and publication references. Use only plain ASCII text for the names. The names should NOT contain symbols, numbers, accents, umlauts, characters with diacritical marks, and should NOT end in punctuation. Note that names with internal punctuation such as "St. John" or "D'Abaco" or "Doe-Smith" are okay.
examples:
incorrect: J\#ane Doe, José Perez, 1Xu Weng
corrected: Jane Doe, Jose Perez, Xu Wang
The use of symbols and numbers causes an error. The error can be corrected by removing the symbols, characters with diacritical marks, numbers, or punctuation.
SEQ_DESCR.BadCollectionCode
Explanation: The culture collection is not in the list of registered institutes, or is in the wrong format, or there are multiple culture-collections in a single qualifier.
Suggestion: See the description for the proper format and list of allowed institutes, https://www.insdc.org/controlled-vocabulary-culturecollection-qualifier. Include only the culture-collection from which the sample was obtained. If the sample was deposited into multiple culture-collections, then present each culture-collection in a separate qualifier. If the culture collection is not in the list of allowed institutes, write to us with details of the culture collection. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at [email protected], and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.
Note that culture-collection should be used for microbial sequences, while specimen-voucher should be used for plants and animals only. However, do not use specimen-voucher to describe host information for a microbial sequence submission.
SEQ_DESCR_BadCollectionDate
Explanation: The collection date is not in the required format.
Suggestion: Correct the collection-date source modifier so the date
is in the correct format. For example, a collection-date should be
formatted like this: DD-MMM-YYYY
, where the month is the three-letter
code in English. Alternatively, the ISO 8601 standard may be used;
see descriptions and examples on
the INSDC Feature Table page.
If the error
is from a genome that was created from a fasta submission, then the
information comes from the BioSample, which will need to be
updated. Send the correct source information to us at
[email protected], and we will update the genome and
BioSample. Be sure to include the SUBid of the genome submission and
the accession of the BioSample (SAMNxxxxxxxx), if one has been
assigned.
Examples of correctly formatted collection-dates:
01-Jul-1999
Nov-2010
2008
SEQ_DESCR_BadGeoLocNameCode
Explanation: The geographic location name (geo_loc_name), up to the first colon, is not on the approved list of geographic location names.
Suggestion: Correct the geo_loc_name source modifier with a geographic location name on the approved geographic location name list and verify the geo_loc_name value is correctly formatted. If you want to include more specific location information, you must place the approved geographic location name first, followed by a colon and then the additional information. The geo_loc_name has a specific format and must be formatted as follows:
<approved geographic location name>: <region or specific area>
Examples:
Iceland
Canada: Vancouver
Atlantic Ocean: Charlie Gibbs Fracture Zone
If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at [email protected], and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.
SEQ_FEAT.BadProteinName: Unknown or hypothetical protein should not have EC number
Explanation: The product name is "hypothetical protein" and there is an EC number.
Suggestion: If this really is a hypothetical protein, simply remove the EC number. If the EC number is correct, use that to provide a valid product name.
SEQ_DESCR.BadVoucherID
Explanation: The voucher is missing a specific identifier.
Suggestion: Correct the format of the culture-collection or specimen-voucher source modifiers. The culture-collection or specimen-voucher is missing the identifier. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at [email protected], and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.
The culture-collection must be formatted like this: <institution-code>:\[<collection-code>:\]<culture id>
.
The institution code and culture ID are required, the collection-code
is optional. The institution code must be valid. See the
description for the proper format
and
list of all allowed institutions.
An example culture-collection is: CBS:1234
Culture-collection should be used for microbial sequences, while specimen-voucher should be used for plants and animals only. Do not use specimen-voucher to describe host information for a microbial sequence submission. The specimen-voucher is not required to be structured.
SEQ_DESCR_BioSourceMissing
Explanation: The biological source of this sequence has not been described correctly. A submission must have a source descriptor that covers the entire molecule. Please add the source information.
Suggestion: Provide an organism name for each sequence in your submission.
SEQ_FEAT.EcNumberProblem
Explanation: Apparent EC number in protein title. A product name includes a value that looks like an EC number, e.g. : "L-pipecolate oxidase (1.5.3.7)"
Suggestion: Remove the EC number from the product name and field it in the EC_number qualifier. If it is something else, e.g. a TC number, then move it to a note.
SEQ_FEAT.FeatureBeginsOrEndsInGap
Explanation: A feature begins or ends in a gap.
Suggestion: Remove the feature or adjust its location to be partial and abut the gap, whichever is appropriate.
SEQ_FEAT.GenCodeInvalid and SEQ_FEAT.GenCodeMismatch
Explanation: The genetic code seems to be invalid or incorrect.
Suggestion: If the organism is a prokaryote, then include -j "[gcode=11]" in the command line to force the use of the prokaryotic genetic code. If the organism is not a prokaryote, then you can ignore this error and we will address it during processing.
SEQ_FEAT.IllegalDbXref
Explanation: The database in the db_xref has the abbreviation or is not one of the allowed databases
Suggestion: If the database that you are using is not one of the allowed databases, then change the db_xref to a note. (However, do no use GI as a db_xref because that is an internal technical database.)
SEQ_DESCR.InconsistentMolInfoTechnique
Explanation: A WGS accession appears to be present but the wgs technique is not set.
Suggestion: You can ignore this error and we will address it during processing. However, you can quiet the error yourself if you wish by including -j "[tech=wgs]" in the command line.
SEQ_INST.InternalNsInSeqRaw
Explanation: A sequence has a run of 100 or more Ns, which is most likely a gap, not a run of ambiguous bases.
Suggestion: Label the run's of N's as assembly_gaps. Choose a smaller length (e.g. 1 or 10) to convert runs of Ns to an assembly_gap with the appropriate linkage evidence. Do not simply remove internal N's.
SEQ_FEAT.InternalStop and SEQ_INST.StopInProtein
Explanation: The InternalStop and StopInProtein errors are produced when there is an internal stop codon within the CDS.
Suggestion: The problem could be the genetic code, the location of the CDS, the reading frame of the CDS, or that the CDS cannot produce an error-free translation. Use the correct genetic code to get the correct translations. For example, include [gcode=11] for prokaryotic genome submissions. If the genetic code is correct, then adjust the CDS location, if possible. If the CDS is partial at its 5' end, then you might need to add a codon_start qualifier with a value of 2 or 3 to shift the reading frame one or two bases, respectively. If the CDS does not have an error-free translation, then add the /pseudo qualifier to the gene to indicate that the CDS cannot be translated.
SEQ_FEAT.InvalidQualifierValue: rRNA has no name
Explanation: rRNA features must have a product name.
Suggestion: Use the appropriate full product name for each rRNA feature, e.g. "16S ribosomal RNA"
SEQ_FEAT.InvalidInferenceValue: unrecognized database
Explanation: The database in the structured inference qualifier is not one of the expected ones.
Suggestion: See the instructions for evidence qualifiers and use one of listed acronyms. If the database that you are referring to is not on the list, then consider including the information as a /note, rather than an /inference.
SEQ_DESCR_LatLonGeoLocName
Explanation: lat_lon and geographic location (geo_loc_name) disagree
Suggestion: The latitude-longitude (lat-lon) value provided does not map to the source geographic location (geo_loc_name) provided, so correct or remove the lat-lon values and/or geo_loc_name source modifiers. Provide lat-lon in decimal degrees with the compass direction (for example: 39.7 N 42.1 W) and check that the lat-lon coordinates map to the geographic location you have provided. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at [email protected], and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.
SEQ_DESCR_LatLonFormat
Explanation: The format of lat-lon should be dd.dd N|S ddd.dd E|W.
Suggestion: Correct the latitude-longitude (lat-lon) source modifier with lat-lon coordinates in decimal degree format with the compass directions. For example: 39.7 N 42.1 W If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at [email protected], and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.
SEQ_DESCR_LatLonProblem
Explanation: There is a problem with the lat-lon modifier provided.
Suggestion: Correct or remove the latitude-longitude (lat-lon) values in the source modifiers. Provide lat-lon in decimal degrees and include the compass direction (for example, 39.7 N 42.1 W). Longitude values range from 0 to 180E or 0 to 180W. Latitude values range from 0 to 90 N or 0 to 90 S. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at [email protected], and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.
SEQ_DESCR_LatLonRange
Explanation: Latitude or longitude is out of range.
Suggestion: Correct or remove the latitude-longitude (lat-lon) values in the source modifiers. Provide lat-lon in decimal degrees and include the compass direction (for example, 39.7 N 42.1 W). Longitude values range from 0 to 180E or 0 to 180W. Latitude values range from 0 to 90 N or 0 to 90 S. Numbers outside of these ranges will cause errors. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at [email protected], and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.
SEQ_DESCR_LatLonValue
Explanation: Latitude or longitude values appear to be in the wrong hemisphere or swapped.
Suggestion: Correct or remove the latitude- longitude (lat-lon) values in the source modifiers. The lat-lon value for the record does not agree with the source geographic location (geo_loc_name) provided. Based on the source geographic location, the lat-lon value appears to have the incorrect hemisphere or is swapped. Check the coordinates and compass direction and provide the correct values. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at [email protected], and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.
SEQ_FEAT.MisMatchAA
Explanation: The conceptual translation does not match the provided translation.
Suggestion: Make the CDS partial if it does not begin at the start codon (and extend to end of the sequence for incomplete prokaryotic sequence). Set the genetic code of prokaryotes ( [gcode=11] ) to get the correct translations.
GENERIC.MissingPubInfo: No submission citation anywhere on this entire record
Explanation: There is no submitter block.
Suggestion: Include the template when you create the .sqn submission file. You can create a template here: https://submit.ncbi.nlm.nih.gov/genbank/template/submission/ .
GENERIC.MissingPubInfo: Submission citation affiliation has no state
Explanation: The country is USA, but the state is not included in the affiliation in the submitter block.
Suggestion: Include the state in the template file (for .sqn submissions) or your submission portal profile (for fasta submissions).
GENERIC_MissingPubInfo
Explanation: The publication is missing essential information, such as title or authors.
Suggestion: Check the references. Provide author names, a title, and select the publication status (unpublished, in press, or published). If the title is published or is in press, provide additional information including publication year, journal, volume, and pages, where applicable.
GENERIC_MissingPubRequirement
Explanation: The REFERENCE that includes the submitter information is missing.
Suggestion: Make the template file and call it with the -t argument in the command line: -t template.sbt
SEQ_FEAT.MissingTrnaAA
Explanation: The amino acid that the tRNA carries is not included.
Suggestion: Include the amino acid as the product of the tRNA. If the amino acid of a tRNA is unknown, use tRNA-Xxx as the product. See prokaryotic examples and eukaryotic examples .
SEQ_DESCR.NoOrgFound
Explanation: No organism name is included.
Suggestion: Include the organism information when creating the .sqn file. When running table2asn (or tbl2asn), the organism information can be included in the definition lines of the fasta files or in the command line with -j
.
SEQ_DESCR.NoPubFound
Explanation: There is no submitter block or other reference.
Suggestion: Include the template when you create the .sqn submission file. You can create a template here: https://submit.ncbi.nlm.nih.gov/genbank/template/submission/ .
SEQ_DESCR.NoSourceDescriptor
Explanation: There is no source information included.
Suggestion: Include the source by including the information in the fasta headers OR the -j argument in the command line. See the available source modifiers. For a genome you only need to include the organism and strain (for microbes) or organism and breed/ecotype/cultivar and isolate for plants and animals because the information in the BioSample will be added to the genome.
SEQ_FEAT.NoStop
Explanation: The CDS is not marked as partial at its 3′ end and does not end with a stop codon.
Suggestion: Extend the CDS to the stop codon, or mark the 3′ end as partial (and extend the CDS to the end of the sequence for prokaryotic sequences), or add the /pseudo qualifier to the gene to indicate that the CDS cannot be translated.
SEQ_FEAT.OnlyGeneXrefs
Explanation: Features, such as CDS, refer to genes but there are no corresponding gene features.
Suggestion: Include gene features with a unique locus_tag on each gene.
SEQ_FEAT.PartialProblem PartialLocation: Start does not include first/last residue of sequence
Explanation: Since prokaryotes have very little splicing, their features need to be complete or to extend to the end of the sequence and be partial. In eukaryotes this error can be ignored if the partial is at an intron/exon boundary.
Suggestion: Extend the feature one or a few bases to the end of the sequence. If the feature is complete, remove the partial symbols. If this is only a fragment or a nonfunctional gene, change the feature′s location to be complete and add the /pseudo qualifier to the gene.
SEQ_FEAT.ShortIntron
Explanation: The CDS contains an intron shorter than 11bp, which is generally not biologically correct and is usually included to adjust for a frameshift in the sequence.
Suggestion: If the gene is frameshifted but not a pseudogene, then annotate a single gene feature across the entire span and include a pseudo qualifier to indicate that the gene is broken and cannot be translated as expected. In addition, you could include a brief note explaining the problem. If the gene is an actual pseudogene, then add the pseudogene qualifier and the appropriate TYPE to the single gene feature. Alternatively, you can include "-c s" in the table2asn command line, in which case the CDS will have a translation but it will also have the qualifier /artificial_location="low-quality sequence region" and the protein definition line will be prefaced with "LOW QUALITY PROTEIN:"
SEQ_INST.ShortSeq
Explanation: This warning is triggered by proteins that are shorter than ten amino acids. This is probably fine and will not cause problems with your submission, but you should investigate and decide whether you think these really exist.
Suggestion: This is probably fine and will not cause problems with your submission, but you should investigate and decide whether you think these really exist. If there are lots of them and they are just short ORF calls by the annotation tool, then we recommend that you remove them unless you think that they are real.
SEQ_FEAT.StartCodon and SEQ_INST.BadProteinStart
Explanation: The StartCodon and BadProteinStart errors are produced when the CDS is not marked as partial at its 5′ end and does not begin with a start codon.
Suggestion: Use the correct genetic code to get the correct translations. For example, include [gcode=11] for prokaryotic genome submissions. Other possible fixes include: extend the CDS to the start codon, or mark the 5′ end as partial (and extend the CDS to the end of the sequence for prokaryotic sequences), or add the /pseudo qualifier to the gene to indicate that the CDS cannot be translated.
SEQ_INST.TerminalNs
Explanation: There are Ns at the beginning or end to the sequence.
Suggestion: Remove Ns from the beginning and end of the sequence or indicate that the sequence circular, if that is applicable.
SEQ_FEAT.TransLen
Explanation: The length of the protein does not match the provided protein length
Suggestion: Recreate the .sqn file and if the error persists, send your file to us with a description of how you created it and a request to help fix the error.
SEQ_FEAT.UnknownFeatureQual: orig_protein_id
Explanation: An older version of table2asn_GFF was used to convert a GFF file that had pseudo=true or pseudogene=true
Suggestion: Download the current version of table2asn_GFF and use it to make your submission.
SEQ_DESCR.UnstructuredVoucher
Explanation: The voucher needs to be structured as "<institution-code>:[<collection-code>:]<culture id>".
Suggestion: Correct the format of the culture-collection source modifier. The institution code and culture ID are required, the collection-code is optional. Follow the formatting instruction in the explanation. The culture collection must have a valid institution code followed by a colon and the culture ID. See the list of allowed institutes. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at [email protected], and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.
For example CBS:1234
In this example, CBS is the insitution code and 1234 is the culture ID. There must be a colon between the institution code and the culture ID.
SEQ_DESCR.UnwantedCompleteFlag
Explanation: The sequence is listed as complete, but there is missing information elsewhere in the record
Suggestion: You can ignore this error when you have submitted a complete chromosome or plasmid or organelle.
SEQ_FEAT.WrongQualOnImpFeat
Explanation: The feature has an illegal qualifier
Suggestion: Find the legal qualifiers for each feature in the Feature Table .
SEQ_DESCR_WrongVoucherType
Explanation: The institution (or institution: collection) code normally uses a different bio material/culturecollection/specimen voucher type.
Suggestion: In the source modifiers, use the source modifier "culture-collection" instead of "specimen-voucher" or vice versa. For example, if you provided the source modifiers in a tab-delimited table, edit the table so the column header "culture-collection" is used in place of "specimen-voucher" and upload the revised table. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at [email protected], and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.
Note that culture-collection should be used for microbial sequences, while specimen-voucher should be used for plants and animals only. Do not use specimen-voucher to describe host information for a microbial sequence submission.
Genome Resources
- About WGS
- WGS Browser
- Genome Submission Guide
- Genome Submission Portal
- Update Genome Records
- FAQ
- table2asn
- Submitting Multiple Haplotype Assemblies
- Create Submission Template
- Eukaryotic Annotation Guide
- Prokaryotic Annotation Guide
- Annotation Example Files
- Annotating Genomes with GFF3 or GTF files
- Validation Error Explanations for Genomes
- Discrepancy Report
- NCBI Prokaryotic Genome Annotation Pipeline
- AGP Format
- Metagenome Submission Guide
- Structured Comment
- BioProject
- BioSample