Case study: RNA-sequencing of liver tissue from double-crested cormorant (DCCO) embryos

This short tutorial below demonstrates how to run Seq2Fun. We use a RNA-seq dataset from a real non-model organism double-crested cormorant (DCCO), treated with ethinyl estradiol (EE2) as a show case.

Experimental design

1. Description of experiment

DCCO embryos were exposed via egg injection to EE2, a synthetic estrogen that is the active substance in some forms of birth control. Livers were harvested after 14 days exposure and immediately frozen in liquid nitrogen for total RNA extraction. Total RNA was sent to Genome Quebec (Montreal, Quebec, Canada), to build sequencing library with TruSeq RNA Library Prep Kit (San Diego, California, United States) before submitted to Illumina NovaSeq 6000 (San Diego, California, United States) for 100 bp PE reads sequencing.

2. Experimental samples

Each sample was subsampled with 5 million reads, just for demonstration purpose.
The samples can be download from here

Group Chemicals (dose) Number of samples Number of reads
High EE2 (31.9 mg/ml) 5 25,000,000
Medium EE2 (2.3 mg/ml) 5 25,000,000
Control DMSO 5 25,000,000

3. Database used

Group Number of proteins Number of Orthologs Number of species Database name
Birds 2,189,089 15,537 91 birds.tar.gz
Running Seq2Fun

1. Preparing sample.txt file

This file consists of 4 columns and must be separated by tab '\t'. The first column is the prefix name of each sample, and the second is the forward reads file, the last column is the reverse reads file. If you have a single-end (SE) reads, remove the reverse reads (second) column.

A1.CE2-S1	A1.CE2-S1_R1.fastq.gz	A1.CE2-S1_R1.fastq.gz	control
A3.CE1-M2 A3.CE1-M2_R1.fastq.gz A3.CE1-M2_R1.fastq.gz middle
A4.CE1-H5 A4.CE1-H5_R1.fastq.gz A4.CE1-H5_R1.fastq.gz high
B1.CE2-S2 B1.CE2-S2_R1.fastq.gz B1.CE2-S2_R1.fastq.gz control
B3.CE1-M3 B3.CE1-M3_R1.fastq.gz B3.CE1-M3_R1.fastq.gz middle
C1.CE2-S3 C1.CE2-S3_R1.fastq.gz C1.CE2-S3_R1.fastq.gz control
C3.CE1-M4 C3.CE1-M4_R1.fastq.gz C3.CE1-M4_R1.fastq.gz middle
D1.CE2-S4 D1.CE2-S4_R1.fastq.gz D1.CE2-S4_R1.fastq.gz control
D3.CE1-M5 D3.CE1-M5_R1.fastq.gz D3.CE1-M5_R1.fastq.gz middle
E1.CE2-S5 E1.CE2-S5_R1.fastq.gz E1.CE2-S5_R1.fastq.gz control
E3.CE1-H1 E3.CE1-H1_R1.fastq.gz E3.CE1-H1_R1.fastq.gz high
F3.CE1-H2 F3.CE1-H2_R1.fastq.gz F3.CE1-H2_R1.fastq.gz high
G3.CE1-H3 G3.CE1-H3_R1.fastq.gz G3.CE1-H3_R1.fastq.gz high
H2.CE1-M1 H2.CE1-M1_R1.fastq.gz H2.CE1-M1_R1.fastq.gz middle
H3.CE1-H4 H3.CE1-H4_R1.fastq.gz H3.CE1-H4_R1.fastq.gz high

2. Running Seq2Fun to quantify RNA-seq reads.

Using profiling mode to produces 3 main tables 1). s2f_id/ortholog abundance table for all samples, and s2f_id/ortholog abundance table for each sample, 2). s2f_id/ortholog abundance table for submitting to networkanalyst, 3). s2f_id/ortholog annotation table for submitting to networkanalyst, 4). a html report summarizing all samples.

S2F_HOME/bin/seq2fun --sampletable sample.txt --tfmi S2F_HOME/database/birds/birds.fmi --genemap S2F_HOME/database/birds/birds_annotation.txt -w 8 --profiling -V --outputMappedCleanReads --outputReadsAnnoMap 
                        
Results

1.0 An important html report to summarize all samples(please check first) (All_samples.html).

1.1 s2f id abundance for all the samples table (All_samples_s2fid_abundance_table.txt).

This table has s2f id, sample names and annotation separated by '\t'. (how many reads have assigned to the s2f id), the full name of the assigned s2f id annotation.

#NAME	A1.CE2-S1	A3.CE1-M2	A4.CE1-H5	B1.CE2-S2	B3.CE1-M3	C1.CE2-S3	C3.CE1-M4	D1.CE2-S4	D3.CE1-M5	E1.CE2-S5	E3.CE1-H1	F3.CE1-H2	G3.CE1-H3	H2.CE1-M1	H3.CE1-H4	annotation
#CLASS:XX control middle high control middle control middle control middle control high high high middle high -
s2f_10 596 723 689 326 721 728 831 459 407 394 616 633 499 823 681 K13524|GO:0003824;GO:0003867;GO:0005739;GO:0008483;GO:0009448;GO:0009450;GO:0016740;GO:0030170;GO:0032144;GO:0034386;GO:0042135;GO:0042802;GO:0047298;GO:0048148;|ABAT|4-aminobutyrate aminotransferase
s2f_100 6 17 22 1 22 7 27 5 10 6 13 11 5 17 24 K04137|GO:0004930;GO:0004935;GO:0004937;GO:0005886;GO:0007165;GO:0007186;GO:0016020;GO:0016021;GO:0071875;|ADRA1D|adrenoceptor alpha 1d
s2f_1000 164 107 103 66 148 143 140 104 94 129 92 91 105 123 147 K00344|GO:0008270;GO:0016491|CRYZ|crystallin zeta
s2f_10000 39 52 56 51 77 46 68 45 58 42 72 77 61 79 60 K10105|GO:0006464;GO:0009249|LIPT1|lipoyltransferase 1
s2f_10001 156 323 283 211 324 134 457 214 439 202 368 362 335 309 259 K14565|GO:0001094;GO:0001650;GO:0005634;GO:0005654;GO:0005730;GO:0005732;GO:0005829;GO:0015030;GO:0030515;GO:0031428;GO:0032040;GO:0042254;GO:0048254;GO:0051117;GO:0070761;|NOP58|nop58 ribonucleoprotein
...

1.3 s2f id abundance for each sample table (A1.CE2-S1_s2fid_abundance.txt).

This table has s2f id, Reads_count (how many reads have assigned to the s2f id) and annotation separated by '\t'.

#s2f_id	Reads_cout	annotation
s2f_10 596 K13524|GO:0003824;GO:0003867;GO:0005739;GO:0008483;GO:0009448;GO:0009450;GO:0016740;GO:0030170;GO:0032144;GO:0034386;GO:0042135;GO:0042802;GO:0047298;GO:0048148;|ABAT|4-aminobutyrate aminotransferase
s2f_100 6 K04137|GO:0004930;GO:0004935;GO:0004937;GO:0005886;GO:0007165;GO:0007186;GO:0016020;GO:0016021;GO:0071875;|ADRA1D|adrenoceptor alpha 1d
s2f_1000 164 K00344|GO:0008270;GO:0016491|CRYZ|crystallin zeta
s2f_10000 39 K10105|GO:0006464;GO:0009249|LIPT1|lipoyltransferase 1
s2f_10001 156 K14565|GO:0001094;GO:0001650;GO:0005634;GO:0005654;GO:0005730;GO:0005732;GO:0005829;GO:0015030;GO:0030515;GO:0031428;GO:0032040;GO:0042254;GO:0048254;GO:0051117;GO:0070761;|NOP58|nop58 ribonucleoprotein
...
2. The following tables and figures are from (All_samples.html).

This table summarizes the general information of your samples.
Rarefaction curve (2d) for all samples.
Rarefaction curve (3d) for all samples.
Reads quality (3d) for all samples.
Downstream analysis: Pathways enrichment analysis (for Seq2Fun version 1)
We provide two options for downstream analysis, such as identification of differentially expressed KOs, pathway enrichment analysis, et al.

1. Online website-based analysis tool NetworkAnalyst
All_sample_ko_abundance_table_submit2networkanalyst.txt, this will be used to submit to NetworkAnalyst.

2. Based on the KO abundance table generated by Seq2Fun (version 1, from verion 2, we provide the s2f id abundance table), differentially expressed KOs are identified by limma (R package), and enriched pathways are analysized (GSEA) by HTSanalyzeR (R package).
We provide the R script (click here to download) to conduct pathway enrichment.
Please note that the following figure is generated by HTSanalyzeR, which may no longer be installed on current R version.
Therefore, we have developed an alternative R script using fgsea for pathway analysis and we suggest users to try the casestudy_fgsea.R.

Significantly enriched pathways computed using GSEA for DCCO.
The absolute values of KO log2FCs were used for GSEA. Pathways were considered significant if the adjusted p-value was < 0.05. The log2FC of all genes in the enriched pathways are indicated with the vertical grey lines. The distribution for each pathway is colored according the pathway’s adjusted p-value.