Case study: RNA-sequencing of liver tissue from double-crested cormorant (DCCO) embryos

This short tutorial below demonstrates how to run Seq2Fun. We use a RNA-seq dataset from a real non-model organism double-crested cormorant (DCCO), treated with ethinyl estradiol (EE2) as a show case.

Experimental design

1. Description of experiment

DCCO embryos were exposed via egg injection to EE2, a synthetic estrogen that is the active substance in some forms of birth control. Livers were harvested after 14 days exposure and immediately frozen in liquid nitrogen for total RNA extraction. Total RNA was sent to Genome Quebec (Montreal, Quebec, Canada), to build sequencing library with TruSeq RNA Library Prep Kit (San Diego, California, United States) before submitted to Illumina NovaSeq 6000 (San Diego, California, United States) for 100 bp PE reads sequencing.

2. Experimental samples

Each sample was subsampled with 5 million reads, just for demonstration purpose.
The samples can be download from here

Group Chemicals (dose) Number of samples Number of reads
High EE2 (31.9 mg/ml) 5 25,000,000
Medium EE2 (2.3 mg/ml) 5 25,000,000
Control DMSO 5 25,000,000

3. Database used

Group Number of proteins Number of Orthologs Number of species Database name
Birds 482,205 24,076 31 birds.tar.gz
Running Seq2Fun

1. Preparing sample.txt file

This file consists of 4 columns and must be separated by tab '\t'. The first column is the prefix name of each sample, and the second is the forward reads file, the last column is the reverse reads file. If you have a single-end (SE) reads, remove the reverse reads (second) column.

A1.CE2-S1	A1.CE2-S1_R1.fastq.gz	A1.CE2-S1_R1.fastq.gz	control
A3.CE1-M2 A3.CE1-M2_R1.fastq.gz A3.CE1-M2_R1.fastq.gz middle
A4.CE1-H5 A4.CE1-H5_R1.fastq.gz A4.CE1-H5_R1.fastq.gz high
B1.CE2-S2 B1.CE2-S2_R1.fastq.gz B1.CE2-S2_R1.fastq.gz control
B3.CE1-M3 B3.CE1-M3_R1.fastq.gz B3.CE1-M3_R1.fastq.gz middle
C1.CE2-S3 C1.CE2-S3_R1.fastq.gz C1.CE2-S3_R1.fastq.gz control
C3.CE1-M4 C3.CE1-M4_R1.fastq.gz C3.CE1-M4_R1.fastq.gz middle
D1.CE2-S4 D1.CE2-S4_R1.fastq.gz D1.CE2-S4_R1.fastq.gz control
D3.CE1-M5 D3.CE1-M5_R1.fastq.gz D3.CE1-M5_R1.fastq.gz middle
E1.CE2-S5 E1.CE2-S5_R1.fastq.gz E1.CE2-S5_R1.fastq.gz control
E3.CE1-H1 E3.CE1-H1_R1.fastq.gz E3.CE1-H1_R1.fastq.gz high
F3.CE1-H2 F3.CE1-H2_R1.fastq.gz F3.CE1-H2_R1.fastq.gz high
G3.CE1-H3 G3.CE1-H3_R1.fastq.gz G3.CE1-H3_R1.fastq.gz high
H2.CE1-M1 H2.CE1-M1_R1.fastq.gz H2.CE1-M1_R1.fastq.gz middle
H3.CE1-H4 H3.CE1-H4_R1.fastq.gz H3.CE1-H4_R1.fastq.gz high

2. Running Seq2Fun to quantify RNA-seq reads.

Using profiling mode to produces 3 main tables 1). s2f_id/ortholog abundance table for all samples, 2). s2f_id/ortholog abundance table for submitting to networkanalyst, 3). s2f_id/ortholog annotation table, 4). a html report summarizing all samples.

S2F_HOME/bin/seq2fun --sampletable sample.txt --tfmi S2F_HOME/database/birds/birds.fmi --genemap S2F_HOME/database/birds/birds_annotation.txt -w 8 --profiling -V --outputMappedCleanReads --outputReadsAnnoMap 

1.1 An important html report to summarize all samples(please check it first) (Seq2Fun_summary_all_samples.html).

1.2 s2f id abundance for all the samples table (S2fid_abundance_table_all_samples.txt).

This table has s2f id, sample names and annotation separated by '\t'. (how many reads have assigned to the s2f id), the full name of the assigned s2f id annotation.

#NAME	A1.CE2-S1	A3.CE1-M2	A4.CE1-H5	B1.CE2-S2	B3.CE1-M3	C1.CE2-S3	C3.CE1-M4	D1.CE2-S4	D3.CE1-M5	E1.CE2-S5	E3.CE1-H1	F3.CE1-H2	G3.CE1-H3	H2.CE1-M1	H3.CE1-H4	annotation
#CLASS:XX control middle high control middle control middle control middle control high high high middle high -
s2f_7617 158 338 475 130 329 178 342 185 479 160 312 429 241 309 345 K04198|GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100|EDNRB|endothelin receptor type B
s2f_7570 10 9 9 8 1 8 23 8 23 9 9 7 15 7 12 K07437|GO:0004497;GO:0005506;GO:0016491;GO:0016705;GO:0020037;GO:0046872|CYP26A1|cytochrome P450 26A1
s2f_14150 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 K05254|GO:0005179;GO:0005576;GO:0007165;GO:0016608|GHRL|appetite-regulating hormone
s2f_6367 22 43 48 50 50 31 55 29 38 16 52 45 40 40 45 K03370|GO:0005794;GO:0006486;GO:0008373;GO:0016020;GO:0016021;GO:0016740;GO:0016757;GO:0097503|ST3GAL5|lactosylceramide alpha-2,3-sialyltransferase isoform X1
s2f_7077 42 81 44 63 61 61 63 66 60 83 46 65 72 74 51 U|GO:0000775;GO:0000776;GO:0005694|CFDP1|craniofacial development protein 1
s2f_12441 3 4 8 4 2 1 3 0 13 2 3 7 6 4 5 K15680|GO:0016491|HSD11B1L|hydroxysteroid 11-beta-dehydrogenase 1-like protein isoform X1
2. The following tables and figures are from (Seq2Fun_summary_all_samples.html).

This table summarizes the general information of your samples.
Rarefaction curve (2d) for all samples.
Rarefaction curve (3d) for all samples.
Reads quality (3d) for all samples.
Downstream analysis: Pathways enrichment analysis (for Seq2Fun version 1)
We provide two options for downstream analysis, such as identification of differentially expressed KOs, pathway enrichment analysis, et al.

1. Online website-based analysis tool ExpressAnalyst
S2fid_abundance_table_all_samples_submit_2_networkanalyst.txt, this will be used to submit to ExpressAnalyst.

2. Based on the KO abundance table generated by Seq2Fun (version 1, from verion 2, we provide the s2f id abundance table), differentially expressed KOs are identified by limma (R package), and enriched pathways are analysized (GSEA) by HTSanalyzeR (R package).
We provide the R script (click here to download) to conduct pathway enrichment.
Please note that the following figure is generated by HTSanalyzeR, which may no longer be installed on current R version.
Therefore, we have developed an alternative R script using fgsea for pathway analysis and we suggest users to try the casestudy_fgsea.R.

Significantly enriched pathways computed using GSEA for DCCO.
The absolute values of KO log2FCs were used for GSEA. Pathways were considered significant if the adjusted p-value was < 0.05. The log2FC of all genes in the enriched pathways are indicated with the vertical grey lines. The distribution for each pathway is colored according the pathway’s adjusted p-value.