Version 2.0:    From KEGG Orthologs to Generic Orthologs & Gene Ontologies

Motivations

Seq2Fun version1 is developed for high speed functional quantification based on KEGG Orthologs (KO) of RNA-seq data for organisms without genome references. It is very useful for medium level functional summary (eg. KO abundance) and the subsequent KEGG pathway analysis. However, the KO based system has two key limitations

  • Limited Coverage the genes assigned to KOs are only a proportion of whole protein-coding genes. eg: there are ~ 19,648 protein-coding genes in human, 14,964 (76.16%) genes have KO assigned, while 4,684 (23.84%) are not assigned. For zebrafish, there are ~ 26,584 protein-coding genes, 16,322 (61.40%) have KO assigned and 10,262 are not. The assigned KOs in pathways are even fewer. Functional quantification based on only these genes assigned to KOs can not capture these missing genes, resulting in missing information from whole transcriptome. This could translate to incomplete information and hamper our effort to answer biological questions using RNA-seq.
  • Limited Annotation the version 1 uses low resolution gene annotation (KO) database, it will be interesting to conduct functional analysis at a higher resolution beyond KO level. In addition to KOs, GO analysis is an another powerful approach for functional analysis of RNA-seq data. This is not available in version1.

Goals

Therefore, we develop Seq2Fun version2 to address the aforementioned weaknesses with the following new features:

  1. high coverage of the whole transcriptome to quantify all protein-coding genes
  2. high resolution of gene quantification at generic ortholog level
  3. more functional summaries and analysis such as KO, GO and their enrichment
  4. target gene assemble for phylogenetic analysis, primer designing, reference construction

Solutions

  1. using all protein-coding genes. This approach ensures that gene quantification will conduct on all the protein-coding genes and maximum gene information will be kept. This will provide the opportunity to gain a comprehensive and deep functional insight from RNA-seq data.
  2. using gene orthologs to annotate mapped reads. The database used here consists of all protein-coding genes from diverse organisms. One read could be mapped to many genes from many organisms, and these genes are most likely one ortholog. Therefore, using ortholog will ensure one mapped read to be annotated to one common feature such as one ortholog id or gene symbol from orthologs.
    There are two reference ortholog databases used here to construct the databases. One is from NCBI gene_orthologs.gz. It consists of 14,072,810 protein sequences representing 6,279,509 genes and 30,392 orthologs from ~436 eukaryotic species.
    The second one is from JustOrthologs. It consists of 1,675,415 genes representing 513,290 ortholog groups from 741 eukaryotic species.
    The difference between the two reference ortholog databases is that the former only focus on animals while later contains all other groups such as plants, fungi.
  3. annotate ortholog to KO and GO. KO and GO are very important for high level functional summary and analysis, which will offer users opportunity to investigate functional changes. Therefore, We annotate each protein/gene with Kegg database for KO, and GOA for GO. Of these 14,072,810 protein sequences, 5,871,017 annotated to 11,358 KOs, while 1567627 annotated to 98,727 GO groups and 20,110 GOs. Based on the ortholog-KO-GO map, KO and GO can be extracted from ortholog abundance table and subsequent downstream analysis such as pathway and GO enrichment can be performed.
  4. develop functions for annotated reads extraction and gene assembly. For organisms without genome references, it could be important to obtain the gene sequences from RNA-seq data via de novo assemblers. It is well-known that de novo assemblers are extremely slow and computational intensive. Therefore, we have developed functions to firstly output annotated reads, then extract the mapped reads to each annotated genes, and the subsequent target gene assembly. The target gene assembly will be extremely fast due to the fewer annotated reads and the assembled contigs can be used for downstream analysis such as phylogenetic analysis, primer design and reference construction.

Seq2Fun version1 and 2

Figure 1: Seq2Fun version 1 and 2.