Version 1.0:    Ultrafast KEGG Ortholog Detection Directly from Short Reads


The biggest challenges in conventional RNA-seq workflow are the transcriptome de novo assembly and annotation, which are time-consuming, computational resource very expensive, e.g., running assembler and annotation tool usually require a high performance server with more than 100 GB's RAM and dozens of CPUs, which are not always for many labs focus on non-model organisms, and could take several days or even weeks to finish

In addition, given the fact that most research questions in non-model organisms are only focus on protein-coding genes and the underlying pathways. It is known that protein-coding genes are only a small proportion of the whole transcriptome. Furthermore, of these protein-coding genes, only a proportion is assigned to KEGG pathways, e.g., only 33.67% (8438), 29.48% (4933) and 23.86% (6050) protein-coding genes are involved in KEGG pathways for mouse, chicken and zebrafish, respectively. Therefore, many genes in annotation databases are considered as unrelated or uninformative for many non-model organisms.

Furthermore, reconstruct transcripts first and subsequently searching and identifying their homologies in a protein database are not straight-forward, and many intermediate steps and requirement of programming skills in the whole complex pipeline of conventional workflow have posed additional challenges for many researchers.


The requirement of multiple software, high-computational cost and time-consuming motivated us to think a straightforward, assembly-free, all-in-one took to tackle these challenges. Given the fact that majority of RNA-seq reads are from mRNA and are intron-free, this inspired us to propose directly translating RNA-seq reads into all possible amino acid (AA) sequences with six reading frames (ORFs), and comparing them in a protein database consisting of only protein-coding genes (orthologs) to identify their possible functional homologies.

Therefore, we developed an ultra-fast, assembly-free, all-in-one tool Seq2Fun, based on a modern data structure full-text in minute space (FM) index and burrow wheeler transformation (BWT), to functional quantification of RNA-seq reads for non-model organisms without transcriptome assembly and genome references.

Seq2Fun directly takes raw RNA-seq reads as input, and subsequently conducts quality check, followed by translated search and finally generate ortholog abundance table.

Figure 2: The workflow of directly translated search. It skips the transcriptome de novo assembly and directly conducts a translated search for each read.