SPDI - NCBI Variation Notation for Variants with Known Breakpoints
NCBI Variation Services use a new notation for variations described as Sequence Position Deletion Insertion or SPDI (Holmes et al., 2019). The notation represents an observed variant sequence using deleted and inserted sequences at a given position in a reference seqence.
SPDI is a minor generalization of the traditional "ref" and "alt" notation. The "ref" corresponds to the deleted sequence and the "alt" corresponds to the inserted sequence. To clarify, using the term "deleted sequence" does not imply that someone is asserting that the mechanism behind the variant was a deletion then an insertion. It only specifies that the same variant sequence would be observed, if this deletion followed by this insertion was applied to the reference sequence.
The SPDI notation uses four fields and is written out as four elements delimited by colons S:P:D:I, where S = SequenceId P = Position , a 0-based coordinate for where the Deleted Sequence starts D = DeletedSequence , sequence for the deletion, can be empty I = InsertedSequence , sequence for the insertion, can be empty The SPDI notation represents variation as deletion of a sequence (D) at a given position (P) in reference sequence (S) followed by insertion of a replacement sequence (I) at that same position. Position 0 indicates a deletion that starts immediately before the first nucleotide, and position 1 represents a deletion interval that starts between the first and second residues, and so on. Either the deleted or the inserted interval can be empty, resulting a pure insertion or deletion. The deleted and inserted sequences in SPDI are all written on the positive strand for two-stranded molecules.
Variation Services only support variants where the coordinates of both the upstream and downstream breakpoints are known (e.g. single nucleotide change, deletions at precise coordinates). Such variants can be encoded precisely using the SPDI notation.
SPDI notation works for both nucleotide and protein variants. For nucleotide variants, it uses upper-case IUPAC nucleic acid notation and for protein variants, it uses upper-case IUPAC single-letter amino acid notation with the extension that * can represent a translated terminator codon. A frameshift is easily represented as a large delins. However, that is hard to read, hard to produce, awkward to store, and typically not very useful. So you can represent a frameshift as ProteinSequenceId:PositionOfShift:FirstRefAA:FirstAltAA:fs .
Examples
Seq1
For the following examples, we will use an imagined, short double-stranded DNA sequence with identifier "Seq1" and the nucleotides "GCTGATG" on its positive strand and "CGACTAC" on the negative strand.
Substitution Variant
A substitution of the 5th nucleotide on Seq1 from A to G is represented as:
Seq1:4:A:G .
This represents the observed variant sequence "GCTGGTG".
Specifying the sequence-literal, instead of just length helps address off-by-one issues, enables determination of the type of variant without reference to an external listing of the reference sequence, and is easier for humans to read in most cases. SPDI can also be written without the deleted sequence using just the deletion length SequenceId:Position:DeletionLength:InsertedSequence:.
This format is shorter and can be easier to read when the deletion sequence is large. The above substitution can then be written:
Seq1:4:1:G
Deletion Variant
When a nucleotide deletion occurs, nothing is inserted. In SPDI syntax, InsertedSequence is simply an empty string. Thus, the deletion of the same A nucleotide as above is represented as:
Seq1:4:1:
Insertion Variant
When a nucleotide insertion occurs, nothing is deleted. In SPDI syntax, the DeletedSequence literal is simply an empty string. Thus, an insertion of C after the above A nucleotide is represented as:
Seq1:5::C
Indel Variant
When a set of nucleotides is replaced by another set, the SPDI is very similar to a substitution variant. Replacing AT with CCC is represented as:
Seq1:4:AT:CCC
Frameshift Variant
Consider the transcript Trans1= " ATG CCC GGC TAA AAT AAA" which translates to the protein Prot1= " MPG*" . An insertion Trans1:5::A is a frameshift in the protein and can be written as:
Prot1:1:P:P:fs
It is also valid (though less useful in many contexts) to write this as a delins:
Prot1:1:PG*:PRLK*
Sequence Interval
Notation for stranded intervals is similar, represented as SequenceId:Position:Strand:Length. It uses the positive strand for sequences with only one strand . The interval containing the residues 5th through 8th on the positive strand would be encoded as:
Seq1:4:+:4
The empty interval before the 5th nucleotide on negative strand (i.e. between the C and the T) would be:
Seq1:4:-:0
Resources and Tutorials using SPDI:
- Webinar (Video)
- Python Tutorial
- Jupyter Notebooks