Selecting Reference Genomes

Selecting Reference Genomes

Prokaryotes

For each defined species with assemblies included in RefSeq, one assembly is designated as ‘reference’. Reference genomes are a compact, normalized, and taxonomically diverse view of the RefSeq collection that can be used for the taxonomic identification and characterization of novel sequences. Only species with formal names accepted under the International Code of Nomenclature of Prokaryotes which is governed by the International Committee on Systematics of Prokaryotes, or species with names that have conventions for formal use, such as names that are only “effectively published”, i.e., in publications other than the journal of record International Journal of Systematic and Evolutionary Microbiology, or Candidatus species are assigned references. No references are selected for undefined species such as ‘Vibrio sp.’.

Among species in scope, the following assemblies are taken into consideration in the selection of references:

  • Live RefSeq assemblies - not superseded by newer assemblies or suppressed due to quality or taxonomic misassignment concerns
  • Assemblies that pass Average Nucleotide Identity (ANI) criteria - 1) are from type material or 2) are not from type material but a) match type material assemblies for the species at above 70% coverage, or b) in the absence of type material for the species, are not flagged as mismatches at or above ANI thresholds with at least 70% query and subject coverage to a type from a different species.

A reference genome is chosen among eligible assemblies based on the criteria below, in order of importance. Criteria that are lower on the list are only used if assemblies are judged equal based on higher-ranked criteria:

  1. Manual selection – a few references are selected based on community input, biological features or other a priori knowledge about the assembly.
  2. Magnitude of deviation from the mean assembly length for the species – assemblies with the lowest integral number of standard deviations from the species average assembly length are preferred. This ensures that assemblies that are significantly longer or shorter than others for the species are not chosen.
  3. CheckM completeness – In order, assemblies with the highest quantized level of completeness (98 to 100) are preferred over assemblies in the 95-98, 90-95, 85-90, 70-85, 50-70, and under 50 percent level of completeness, as determined by CheckM.
  4. Magnitude of count of pseudo CDSs – assemblies with the lowest rounded natural log of pseudo CDSs are preferred.
  5. Presence of a plasmid - assemblies containing plasmid sequences are preferred.
  6. Magnitude of count of scaffolds – assemblies with the lowest rounded log base 10 scaffold count are preferred.
  7. Species reference – the current reference is preferred.
  8. Magnitude of deviation from the mean gene count for the species - assemblies with the lowest integral number of standard deviations from the species average count of genes are preferred. This ensures that assemblies that have significantly more or fewer genes than others for the species are not chosen.
  9. Absolute count of pseudo CDSs - assemblies with fewer pseudo CDSs are preferred.
  10. Type strain status
  11. Release date (tie-breaker)

Reference genomes are updated several times a year to take into account newly added assemblies to RefSeq, changes in the NCBI Taxonomy, modified taxonomic assignments, and recently discovered contamination.

Eukaryotes

For eukaryotes, a reference genome is computationally or manually selected from among the best genomes available for each species, using the following selection hierarchy:

  1. The genome is not overtly contaminated or from an unverified species
  2. The genome is not overtly partial
  3. Genomes included in the current RefSeq set are preferred if available for the species. Selection of a particular genome for RefSeq is manual and considers community usage and genome quality. For species with more than one RefSeq genome (e.g., human), one is manually selected as the reference genome. In exceptional cases there may be more than one reference genome selected for some species (e.g., dog and dingo, two subspecies of Canis lupus).
  4. Highest contig N50, binned based on rounded log10 values
  5. Prefer genomes with gapless chromosomes
  6. Prefer genomes with a full set of assembled chromosomes
  7. Prefer genomes with at least one assembled chromosome
  8. Prefer genomes with at least some sequences assigned to chromosomes (aka unlocalized scaffolds)
  9. Highest unbinned N50 value (either contig or scaffold)
Generated November 25, 2024