Transcriptome Shotgun Assembly Sequence Database
What is the Transcriptome Shotgun Assembly (TSA) Database?
TSA is an archive of computationally assembled transcript sequences from primary data such as ESTs and Next Generation Sequencing Technologies. The overlapping sequence reads from a complete transcriptome are assembled into transcripts by computational methods instead of by traditional cloning and sequencing of cloned cDNAs. The primary sequence data used in the assemblies must have been experimentally determined by the same submitter. TSA sequence records differ from GenBank records because there are no physical counterparts to the assemblies.
How Do TSA Sequence Records Differ from Other GenBank/EMBL/DDBJ Records?
The display of a TSA sequence is similar to other International Nucleotide Sequence Database Collaboration (INSDC) records, but includes the following:
- The label 'TSA:' at the beginning of each Definition Line.
- DBLINK
- BioProject
- BioSample
- Sequence Read Archive
- Keywords: TSA; Transcriptome Shotgun Assembly
- Assembly data
- Comment describing the assembly if from a multi-step process.
Each TSA project is assigned a stable 4-letter TSA accession prefix, which does not change as the project is updated. In addition to the TSA accession prefix, the transcript identifiers have a version number corresponding to a specific TSA project update. Finally, each individual assembly is assigned a unique accession number prefixed by the TSA accession prefix and version number. For instance, if a TSA project's assigned accession number is XXXX00000000, then that project's first transcript version would be XXXX01000000, and the first assembly of that version would be XXXX01000001. (The last six digits of this ID identify each individual assembly). When a project is reassembled, the new assemblies are submitted as the 02 version of the TSA project. No linkage or relationship is expected between the old and new assemblies, and the new assemblies are given new accession numbers beginning with XXXX02000001. The 01 transcripts are suppressed when the 02 transcripts are released.
An example of a TSA master record is GAAA00000000.
Nucleotide sequences must conform to the following standards
- Submitted sequences must be assembled from data experimentally determined by the submitter.
- Screened for vector contamination and any vector/linker sequence removed. This includes the removal of NextGen sequencing primers.
- Sequences should be greater than 200 bp in length.
- Ambiguous bases should not be more than total 10% length or more than 14n's in a row.
- Sequence gaps of known length may be present and annotated with the assembly_gap feature if there is sufficient evidence for the linkage between the sequences. See the TSA Submission Guide for more information about adding assembly_gap features.
- Gaps cannot be of unknown length.
Requirements
- Raw reads should be submitted to SRA prior to submitting your transcriptome. The SRA run accession(s) (SRRXXXXXX) and associated BioProject (PRJNAXXXXXX) and BioSample(s) (SAMNXXXXXX) are required for TSA submission.
- Assembly Data Structured Comment. This information is input directly in the Submission Portal dialogs.
- Description of the assembly process if a multi-step assembly was performed should be provided in the COMMENT section.
- If annotation is provided the product names should follow the International Protein Nomenclature Guidelines.
- The keyword 'Targeted' and feature annotation should be included for all targeted subsets of transcriptome data. See Targeted vs. Non-targeted TSA Studies for more information.
- Annotation must be biologically valid.
How to Submit to TSA
All TSA submissions must be submitted through the TSA Submission Portal . Submission details can be found in the TSA submission guide .
How to Update an Existing TSA Submission
See Update TSA Records. Contact [email protected] with any addditional questions.
How to Search for TSA Sequences
- You can search Entrez Nucleotide using the following terms: tsa-master [prop] and 'Genus Species' [orgn]
- For example: tsa-master [prop] AND Nitella mirabilis [orgn]
- The public submissions are available through the WGS/TSA browser .
- The sequences can be downloaded from the NCBI FTP GenBank site .
Should not be submitted to TSA
- Assemblies from sequences not directly sequenced by the submitter.
- Clonal based assemblies. These should be submitted to GenBank.
- A single assembly from multiple organisms.
- Subsets of a transcriptome study unless it is part of a targeted study. See the TSA submission guide for more information about submitting a targeted study.