Genome Notes

Genome Notes

Genome Notes displayed on individual Genome web pages are applied to genome assemblies based on analyses performed by NCBI to alert users that an assembly may not be suitable in particular cases. Most assemblies with warnings and other comments are excluded from the RefSeq collection. Atypical assembly genome notes are shown as a warning at the top of genome pages while all genome notes are provided in the Genome Notes section. Genomes in the atypical category in the list below can be excluded from the Genome Table by checking the “Exclude atypical genomes” checkbox (for example, human).

Atypical Assemblies

Note: These may be excluded from the Genome Table by checking the “Exclude atypical genomes” checkbox.

  • Chimeric—The genome assembly contains sequences from two different organisms that are joined together.

  • Contaminated—The genome assembly contains sequences from other organisms, cloning vectors, linkers, adapters, or primers. See Contamination Screening for more information about how we assign genomes as contaminated.

  • Fragmented assembly—A prokaryotic assembly with contig L50 above 500, contig N50 below 5,000, or with more than 2,000 contigs is considered fragmented.

  • Genome length too large—The total ungapped sequence length of the assembly is more than 1.5 times that of the average for the genomes in the assembly resource from the same species, more than 15 Mbp, or is otherwise suspiciously long.

  • Genome length too small—The total ungapped sequence length of the assembly is less than half that of the average for the genomes in the same species, less than 300 Kbp, or is otherwise suspiciously short.

  • Hybrid—Genome assembly sequences are from a hybrid between different species, strains, or isolates.

  • Low quality sequence—Long stretches of the sequence have a high proportion of ambiguous bases, are low complexity, or provide some other indication that the sequence quality is low.

  • Misassembled—Alignment to related genome assemblies or other evidence indicates the genome assembly is likely to contain errors.

  • Partial—The genome assembly contains a sequence for only part of the DNA found in a typical cell, e.g., one chromosome out of twenty.

  • Sequence duplications—The genome assembly contains one or more large duplications.

  • Unverified source organism—Quality analysis demonstrates the taxonomic assignment of the genome assembly is incorrect.

Assemblies Derived from Atypical Source Material

  • Derived from metagenome—The genomic sequence was assembled from metagenomic sequencing rather than a pure culture. A small number of these genomes are included in RefSeq when estimated to be free of contaminants, for species with fewer than 50 non-MAGs, and with Taxonomy check status “OK” or “Inconclusive” with best match status “below-threshold match”.

  • Derived from single cell—The source material for the assembly was amplified from a single cell resulting in concerns about genome sequence accuracy.

  • From large multi-isolate project—The assembly is one of over 100 assemblies for multiple isolates of the same species generated by the same project. Typically, these are pathogen surveillance projects.

  • Genus undefined—The lineage does not include a genus and therefore, the precise taxonomic placement is uncertain. An exception is made for symbionts.

  • Metagenome—The assembly is derived from a sample consisting of a mixture of unidentified organisms.

  • Missing strain identifier—The prokaryote assembly lacks both strain and isolate identifiers in the appropriate field. Exceptions are made for symbionts and phytoplasmas.

  • Mixed culture—The genome assembly is derived from a co-culture of multiple organisms.

  • Not used as type—The assembly is derived from a type specimen, but it does not meet the criteria for a type-strain assembly that can be used in ANI analysis.

Assemblies for which PGAP Produces Atypical Annotation Results

  • Abnormal gene to sequence ratio—The NCBI PGAP predicts too many or too few genes. The gene-to-sequence ratio is calculated as the number of genes of any type per kb of sequence. The typical range of the ratio is 0.8 to 1.2, and anything outside the 0.5 to 1.5 range is considered abnormal.

  • Annotation fails completeness check—The percent completeness, as estimated on the NCBI PGAP protein predictions by the “CheckM” run, is below three standard deviations from the average completeness for the species, if there are more than 1,000 genomes for the species, or is below 90% or three standard deviations below the average for the species, whichever is smaller if there are 100-1,000 genomes for the species.

  • Annotation fails MAG completeness check—The percent completeness, as estimated on the NCBI PGAP protein predictions by the “CheckM” run, is below 90%. This applies to metagenome-assembled genomes (MAGs) only.

  • Low gene count—The number of predicted genes by NCBI PGAP is much lower than expected when compared to other high-quality genome assemblies for the same species.

  • Many frameshifted proteins—The percentage of protein-coding genes with frameshifts, as determined by NCBI PGAP, is more than three standard deviations from the species average or 5% of annotated genes, whichever is larger, if the genome is for a species with more than 10 genomes, or the percentage of protein-coding genes with frameshifts is above 30% .

  • Missing rRNA genes—The NCBI PGAP failed to find at least one copy each of the 5S, 16S, and 23S rRNA gene. This applies to complete genomes only.

  • Missing tRNA genes—The NCBI PGAP failed to find tRNA genes with anticodons for two or more of the expected 20 amino acids. This applies to complete genomes only.

  • RefSeq annotation failed—The annotated genome assembly does not meet RefSeq standards for reasons other than those listed in this article, typically related to metadata errors or inconsistencies.

Generated November 25, 2024