Case study: RNA-sequencing of liver tissue from double-crested cormorant (DCCO) embryos

This short tutorial below demonstrates how to run Seq2Fun. We use a RNA-seq dataset from a real non-model organism double-crested cormorant (DCCO), treated with ethinyl estradiol (EE2) as a show case.

Experimental design

1. Description of experiment

DCCO embryos were exposed via egg injection to EE2, a synthetic estrogen that is the active substance in some forms of birth control. Livers were harvested after 14 days exposure and immediately frozen in liquid nitrogen for total RNA extraction. Total RNA was sent to Genome Quebec (Montreal, Quebec, Canada), to build sequencing library with TruSeq RNA Library Prep Kit (San Diego, California, United States) before submitted to Illumina NovaSeq 6000 (San Diego, California, United States) for 100 bp PE reads sequencing.

2. Experimental samples

Each sample was subsampled with 5 million reads, just for demonstration purpose.
The samples can be download from here

Group Chemicals (dose) Number of samples Number of reads
High EE2 (33 mg/ml) 4 20,000,000
Medium EE2 (3.3 mg/ml) 5 25,000,000
Control DMSO 5 25,000,000

3. Database used

Group Number of proteins Number of KOs Number of species Database name
Birds 87,530 4,177 24 birds.tar.gz
Running Seq2Fun

1. Preparing sample.txt file

This file consists of 4 columns and must be separated by tab '\t'. The first column is the prefix name of each sample, and the second is the forward reads file, the last column is the reverse reads file. If you have a single-end (SE) reads, remove the reverse reads (second) column. It looks like this:

A1.CE2-S1-LT	A1.CE2-S1-LT_R1.fastq.gz	A1.CE2-S1-LT_R2.fastq.gz    control
A2.CE2-M4-LT A2.CE2-M4-LT_R1.fastq.gz A2.CE2-M4-LT_R2.fastq.gz medium
B1.CE2-S2-LT B1.CE2-S2-LT_R1.fastq.gz B1.CE2-S2-LT_R2.fastq.gz control
B2.CE2-M5-LT B2.CE2-M5-LT_R1.fastq.gz B2.CE2-M5-LT_R2.fastq.gz medium
C1.CE2-S3-LT C1.CE2-S3-LT_R1.fastq.gz C1.CE2-S3-LT_R2.fastq.gz control
D1.CE2-S4-LT D1.CE2-S4-LT_R1.fastq.gz D1.CE2-S4-LT_R2.fastq.gz control
D2.CE2-H2-LT D2.CE2-H2-LT_R1.fastq.gz D2.CE2-H2-LT_R2.fastq.gz high
E1.CE2-S5-LT E1.CE2-S5-LT_R1.fastq.gz E1.CE2-S5-LT_R2.fastq.gz control
E2.CE2-H3-LT E2.CE2-H3-LT_R1.fastq.gz E2.CE2-H3-LT_R2.fastq.gz high
F1.CE2-M1-LT F1.CE2-M1-LT_R1.fastq.gz F1.CE2-M1-LT_R2.fastq.gz medium
F2.CE2-H4-LT F2.CE2-H4-LT_R1.fastq.gz F2.CE2-H4-LT_R2.fastq.gz high
G1.CE2-M2-LT G1.CE2-M2-LT_R1.fastq.gz G1.CE2-M2-LT_R2.fastq.gz medium
G2.CE2-H5-LT G2.CE2-H5-LT_R1.fastq.gz G2.CE2-H5-LT_R2.fastq.gz high
H1.CE2-M3-LT H1.CE2-M3-LT_R1.fastq.gz H1.CE2-M3-LT_R2.fastq.gz medium

2. Running Seq2Fun to quantify RNA-seq reads.

Seq2Fun has two output modes: comparative and profiling mode (default).
The comparative mode produces only the KO abundance table, while the profiling mode produces 4 tables 1). KO abundance table for all samples, and KO abundance table for each sample, 2). hit pathway, 3). hit species, 4). reads KO table, 5). a html report summarizing all samples. and 6) a html report summarizing these tables summarizing these tables for each sample.

S2F_HOME/bin/seq2fun --sampletable sample.txt --tfmi S2F_HOME/database/birds/birds_cdhit99_proteins.fmi --genemap S2F_HOME/database/birds/birds_protein_ko_species_cdhit99.txt -w 8 --profiling -V --outputMappedCleanReads --outputReadsKOMap 
                        
Results

1.0 An important html report to summarize all samples(please check first) (All_samples.html).

1.1 KO abundance for all the samples table (KO_abundance.txt)

This table has KO id, sample names and KO name separated by '\t'. (how many reads have assigned to the homology KO), the full name of the assigned KO. It looks like this:

#Name   A1.CE2-S1-LT    A2.CE2-M4-LT    B1.CE2-S2-LT    B2.CE2-M5-LT    C1.CE2-S3-LT    D1.CE2-S4-LT    D2.CE2-H2-LT    E1.CE2-S5-LT    E2.CE2-H3-LT    F1.CE2-M1-LT    F2.CE2-H4-LT    G1.CE2-M2-LT    G2.CE2-H5-LT    H1.CE2-M3-LT    KO_name
#Class control medium control medium control control high control high medium high medium high medium class_info
K00002 118 96 386 131 147 141 106 129 120 98 148 117 136 121 AKR1A1, adh; alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
K00006 629 604 235 648 664 506 628 499 670 455 838 615 579 521 GPD1; glycerol-3-phosphate dehydrogenase (NAD+) [EC:1.1.1.8]
K00008 971 755 715 770 1122 770 1058 1010 1023 829 1055 1351 954 1139 SORD, gutB; L-iditol 2-dehydrogenase [EC:1.1.1.14]
K00010 17 18 31 17 17 14 15 17 17 17 15 9 20 25 iolG; myo-inositol 2-dehydrogenase D-chiro-inositol 1-dehydrogenase [EC:1.1.1.18 1.1.1.369]
K00011 276 292 1581 303 315 336 290 305 353 263 316 279 290 296 AKR1B; aldehyde reductase [EC:1.1.1.21]
K00012 130 94 609 112 111 134 127 344 103 193 163 382 116 299 UGDH, ugd; UDPglucose 6-dehydrogenase [EC:1.1.1.22]
...

1.2 KO abundance table for each sample table (A1.CE2-S1_ko_abundance.txt)

This table has three columns separated by '\t', KO_id, from highest to lowest reads_count (how many reads have assigned to the homology KO), the full name of the assigned KO. It looks like this:

KO_id	reads_count	KO_name
K14736 64798 TF; transferrin
K03231 32934 EEF1A; elongation factor 1-alpha
K00522 30426 FTH1; ferritin heavy chain [EC:1.16.3.2]
K00134 29734 GAPDH, gapA; glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12]
K08757 24860 APOA1; apolipoprotein A-I
K02261 22467 COX2; cytochrome c oxidase subunit 2
... ... ...

A figure based on this table is also produced in the html report (A1.CE2-S1_functional.html).

2. Hit pathway table (A1.CE2-S1_pathway_hits.txt).

This table has five columns separated by '\t', pathway_id, pathway_name, KO_id (which hit KOs are mapped to this pathway), KO_count and KO_name. It looks like this:

pathway_id	pathway_name                    KO_id	KO_count	KO_name
map00010 Glycolysis_/_Gluconeogenesis K00002 118 AKR1A1, adh; alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
map00010 Glycolysis_/_Gluconeogenesis K00016 12019 LDH, ldh; L-lactate dehydrogenase [EC:1.1.1.27]
map00010 Glycolysis_/_Gluconeogenesis K00121 573 frmA, ADH5, adhC; S-(hydroxymethyl)glutathione dehydrogenase alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
map00010 Glycolysis_/_Gluconeogenesis K00128 3721 ALDH; aldehyde dehydrogenase (NAD+) [EC:1.2.1.3]
map00010 Glycolysis_/_Gluconeogenesis K00129 13 E1.2.1.5; aldehyde dehydrogenase (NAD(P)+) [EC:1.2.1.5]
map00010 Glycolysis_/_Gluconeogenesis K00134 29734 GAPDH, gapA; glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12]
map00010 Glycolysis_/_Gluconeogenesis K00149 4504 ALDH9A1; aldehyde dehydrogenase family 9 member A1 [EC:1.2.1.47 1.2.1.3]
... ... ... ... ...

A table summarizing the the top-hit pathways sorted by percentage of mapped KOs in the pathway is generated in the html report (A1.CE2-S1_functional.html).

Pathway ID	Percentage of mapped KOs in pathway (%) Name
map00533 100 (14/14) Glycosaminoglycan_biosynthesis_-_keratan_sulfate
map04340 97.4359 (38/39) Hedgehog_signaling_pathway
map04530 97.0297 (98/101) Tight_junction
map00563 96.1538 (25/26) Glycosylphosphatidylinositol_(GPI)-anchor_biosynthesis
map04136 95.6522 (22/23) Autophagy_-_other
map00510 95.5556 (43/45) N-Glycan_biosynthesis
map04370 94.8718 (37/39) VEGF_signaling_pathway
map04110 94.5455 (104/110) Cell_cycle
... ... ...

A barplot showing the top-hit pathways based on the abovementioned table is generated in the html report (A1.CE2-S1_functional.html).

3. Hit species table (A1.CE2-S1_species_hits.txt).

This table has two columns separated by '\t', species name and number of KOs (sorted by descending order) assigned to this species.

species                     number_of_KOs
Nipponia_nippon 1470
Egretta_garzetta 1200
Pygoscelis_adeliae 1187
Columba_livia 1057
Athene_cunicularia 1048
Apteryx_mantelli_mantelli 840
Empidonax_traillii 820
Falco_cherrug 779
Gallus_gallus 658
Anas_platyrhynchos 564
Anser_cygnoides_domesticus 542
Falco_peregrinus 510
... ...

A barplot show the top-hit species based on this table is generated in the html report (A1.CE2-S1_functional.html).

4. Reads KO table (A1.CE2-S1_reads_ko.txt).

This table has three columns separated by '\t', reads_id, KO_id (which homology KO assigned) and KO_name.

reads_id                                KO_id	KO_name
A00266:275:HLFTWDSXX:2:1101:10013:26537 K14736 TF; transferrin
A00266:275:HLFTWDSXX:2:1101:10122:16235 K14736 TF; transferrin
A00266:275:HLFTWDSXX:2:1101:10122:17174 K03883 ND5; NADH-ubiquinone oxidoreductase chain 5 [EC:7.1.1.2]
A00266:275:HLFTWDSXX:2:1101:10122:2174 K00602 purH; phosphoribosylaminoimidazolecarboxamide formyltransferase / IMP cyclohydrolase [EC:2.1.2.3 3.5.4.10]
A00266:275:HLFTWDSXX:2:1101:10140:28510 K00799 GST, gst; glutathione S-transferase [EC:2.5.1.18]
A00266:275:HLFTWDSXX:2:1101:10149:34225 K06238 COL6A; collagen, type VI, alpha
A00266:275:HLFTWDSXX:2:1101:10185:19382 K03883 ND5; NADH-ubiquinone oxidoreductase chain 5 [EC:7.1.1.2]
A00266:275:HLFTWDSXX:2:1101:10212:16423 K11188 PRDX6; peroxiredoxin 6, 1-Cys peroxiredoxin [EC:1.11.1.7 1.11.1.15 3.1.1.-]
A00266:275:HLFTWDSXX:2:1101:10212:36777 K08737 MSH6; DNA mismatch repair protein MSH6
A00266:275:HLFTWDSXX:2:1101:10294:9236 K02938 RP-L8e, RPL8; large subunit ribosomal protein L8e
A00266:275:HLFTWDSXX:2:1101:10321:31735 K14736 TF; transferrin
A00266:275:HLFTWDSXX:2:1101:10339:16297 K06171 NCSTN; nicastrin
A00266:275:HLFTWDSXX:2:1101:10375:24972 K04660 DCN; decorin
A00266:275:HLFTWDSXX:2:1101:10402:14121 K02132 ATPeF1A, ATP5A1, ATP1; F-type H+-transporting ATPase subunit alpha
A00266:275:HLFTWDSXX:2:1101:10420:21198 K03231 EEF1A; elongation factor 1-alpha
A00266:275:HLFTWDSXX:2:1101:10438:27179 K00149 ALDH9A1; aldehyde dehydrogenase family 9 member A1 [EC:1.2.1.47 1.2.1.3]
A00266:275:HLFTWDSXX:2:1101:1045:22498 K00255 ACADL; long-chain-acyl-CoA dehydrogenase [EC:1.3.8.8]
A00266:275:HLFTWDSXX:2:1101:1045:33144 K00134 GAPDH, gapA; glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12]
A00266:275:HLFTWDSXX:2:1101:10465:27508 K05692 ACTB_G1; actin beta/gamma 1
... ... ...

The number of mapped reads and table of raw reads are summarized in the html report (A1.CE2-S1_functional.html).

5. Functional html report: A1.CE2-S1_functional.html

5.1 This Functional html report reports summary of the four tables.

Number of KOs annotated         	3689
Number of top-hit species 24
Number of top-hit pathways 401
Number of mapped reads / total reads 1362106 / 5000000 (27.242119%)

5.2 A rarefaction curve to show the depth of reads against number of KOs mapped.

This figure infers 2 pieces of information: the sequence depth of the sample and how many reads are needed to saturate the gain of KOs.
If the sequencing depth reaches plateau, that means with the number of reads continue to be increased, the gain of KOs becomes minimal.
If you want a 3 to 5 times speed up, you can using --reads_to_process to specify the number of reads you want to process.

6. Reads quality check html report (A1.CE2-S1_reads_quality.html).

This table summarizes the general information of your samples.
Rarefaction curve (2d) for all samples.
Rarefaction curve (3d) for all samples.
Rarefaction curve (3d) for all samples.
Downstream analysis: Pathways enrichment analysis
We provide two options for downstream analysis, such as identification of differentially expressed KOs, pathway enrichment analysis, et al.

1. Online website-based analysis tool NetworkAnalyst
Seq2Fun generates two KO abundance tables:
All_sample_KO_abundance_table.txt and All_sample_KO_abundance_table_submit2networkanalyst.txt, the later will be used to submit to NetworkAnalyst.

2. Based on the KO abundance table generated by Seq2Fun, differentially expressed KOs are identified by limma (R package), and enriched pathways are analysized (GSEA) by HTSanalyzeR (R package).
We provide the R script (click here to download) to conduct pathway enrichment.
Please note that the following figure is generated by HTSanalyzeR, which may no longer be installed on current R version.
Therefore, we have developed an alternative R script using fgsea for pathway analysis and we suggest users to try the casestudy_fgsea.R.

Significantly enriched pathways computed using GSEA for DCCO.
The absolute values of KO log2FCs were used for GSEA. Pathways were considered significant if the adjusted p-value was < 0.05. The log2FC of all genes in the enriched pathways are indicated with the vertical grey lines. The distribution for each pathway is colored according the pathway’s adjusted p-value.