Version 2.0:    Towards High-resolution Generic Orthologs & Functions

Motivations

Seq2Fun version1 is developed for high speed functional quantification based on KEGG Orthologs (KO) for RNA-seq data of organisms without genome references. It is very useful for medium level functional summary (eg. KO abundance) and the subsequent KEGG pathway analysis. However, the KO based system has two main limitations

  • Limited Coverage: the genes assigned to KOs are only a proportion of whole protein-coding genes. eg: there are ~ 19,648 protein-coding genes in human, 14,964 (76.16%) genes have KO assigned, while 4,684 (23.84%) are not assigned. For zebrafish, there are ~ 26,584 protein-coding genes, 16,322 (61.40%) have KO assigned and 10,262 are not. The assigned KOs in pathways are even fewer. Functional quantification based on only these genes assigned to KOs can not capture these missing genes, resulting in missing information from whole transcriptome. This could be translated to incomplete information and hamper our effort to answer biological questions using RNA-seq.
  • Limited Annotation: the version 1 uses low resolution gene annotation (KO) database, it will be interesting to conduct functional analysis at a higher resolution such as at gene level. In addition to KOs, GO analysis is an another powerful approach for functional analysis of RNA-seq data. These are not available in version1.

Goals

Therefore, we develop Seq2Fun version2 to address the aforementioned weaknesses with the following new features:

  1. high coverage of the whole transcriptome to quantify all protein-coding genes
  2. high resolution of gene quantification at generic ortholog level
  3. more functional summaries and analysis such as KO, GO and their enrichment
  4. target gene assemble for phylogenetic analysis, primer designing, reference construction

Solutions

  1. using all protein-coding genes for database construction. This approach ensures that gene quantification will conduct on all the protein-coding genes and maximum gene information will be kept. This will provide the opportunity to gain a comprehensive and deep functional insight from RNA-seq data.
  2. using gene orthologs to annotate mapped reads. The database used here consists of all protein-coding genes from diverse organisms. One read could be mapped to many genes from many organisms, and these genes are most likely one ortholog. Therefore, using ortholog will ensure one mapped read to be annotated to one common feature such as one ortholog id or gene symbol from orthologs.
    We have download all the protein-coding gene from 687 eukaryotic organisms using KEGGREST. These organisms are grouped into ~12 phylogenetic groups (eg, mammals, birds, fishes, et al). See DATABASE. We employed OrthoFinder to identify ortholog groups for these phylogenetic groups.
  3. annotate ortholog to KO and GO. KO and GO are very important for high level functional summary and analysis, which will offer users opportunity to investigate functional changes. Therefore, We annotate each protein/gene with Kegg database for KO, and GOA for GO. Based on the ortholog-KO-GO map, KO and GO can be extracted from ortholog abundance table and subsequent downstream analysis such as pathway and GO enrichment can be performed.
  4. develop functions for annotated reads extraction for gene assembly. For organisms without genome references, it could be important to obtain the gene sequences from RNA-seq data via de novo assemblers. It is well-known that de novo assemblers are extremely slow and computational intensive. Therefore, we have developed functions to output annotated reads, then extract the mapped reads to each annotated genes, and the subsequent target gene can be assembled by other assemblers such as SOAPdenovo-Trans. The target gene assembly will be extremely fast due to the fewer annotated reads and the assembled contigs can be used for downstream analysis such as phylogenetic analysis, primer design and reference construction.

Seq2Fun version1 and 2

Figure 1: Seq2Fun version 1 and 2.