Background

RNA-seq is a powerful tool to answer many biological questions. While the majority of RNA-seq data has been collected and analyzed in model organisms, it is increasingly collected in non-model organisms such as many species of environmental and/or economical importance, to answer some very basic questions, such as which genes are up- and down- regulated, which pathways are changed under different conditions. In most cases, they either lack of genome references or do not have high-quality genome, which has posed great challenge for RNA-seq data analysis for these organisms.

The conventional workflow of RNA-seq data analysis for organisms without genome references typically include multiple steps with many software. For example, raw RNA-seq reads are fist performed with quality check by removing low-quality bases, reads, too short reads; followed by error correction before submitted to assembler, such as Trinity, for transcriptome de novo assembly; the quality of transcriptome are then checked before subjected to annotation tool, such as BLAST2GO, to identify transcripts' homologies; with reconstructed transcripts annotated, downstream analysis such as identification of differentially expressed genes (DEGs), pathway enrichment can be applied and explored.

Figure 1: Conventional RNA-seq workflow for organisms without a reference genome. It has 4 main functions: 1) raw reads quality processing; 2) transcriptome de novo assembly; 3) transcriptome annotation; 4) Abundance table generation

Motivations

The biggest challenges in conventional RNA-seq workflow are the transcriptome de novo assembly and annotation, which are time-consuming, computational resource very expensive, e.g., running assembler and annotation tool usually require a high performance server with more than 100 GB's RAM and dozens of CPUs, which are not always for many labs focus on non-model organisms, and could take several days or even weeks to finish

In addition, given the fact that most research questions in non-model organisms are only focus on protein-coding genes and the underlying pathways. It is known that protein-coding genes assigned to KEGG pathways is only a proportion of whole gene set, e.g., only 33.67% (8438), 29.48% (4933) and 23.86% (6050) protein-coding genes are involved in KEGG pathways for mouse, chicken and zebrafish, respectively. Therefore, many genes in annotation databases are considered as unrelated or uninformative for many non-model organisms.

Furthermore, reconstruct transcripts first and subsequently searching and identifying their homologies in a protein database are not straight-forward, and many intermediate steps and requirement of programming skills in the whole complex pipeline of conventional workflow have posed additional challenges for many researchers.

Solutions

The requirement of multiple software, high-computational cost and time-consuming motivated us to think a straightforward, assembly-free, all-in-one took to tackle these challenges. Given the fact that majority of RNA-seq reads are from mRNA and are intron-free, this inspired us to propose directly translating RNA-seq reads into all possible amino acid (AA) sequences with six open reading frames (ORFs), and comparing them in a protein database consisting of only KEGG genes to identify their possible functional homologies.

Therefore, we developed an ultra-fast, assembly-free, all-in-one tool Seq2Fun, based on a modern data structure full-text in minute space (FM) index and burrow wheeler transformation (BWT), to functional quantification of RNA-seq reads for non-model organisms without transcriptome assembly and genome references.

Seq2Fun directly takes raw RNA-seq reads as input, and subsequently conducts quality check, followed by translated search and finally generate KO abundance table, as well as hitted pathway, hitted species and reads KO tables.

Figure 2: The workflow of directly translated search. It skips the transcriptome de novo assembly and directly conducts a translated search for each read.