Table of Contents

  1. Full options
  2. Running Seq2Fun
  3. Input files
  4. Batch sample processing
  5. Output files
  6. Protein database for translated search
  7. Protein_KO_organism mapping table
  8. Single sample profiling
  9. MEM or GREEDY modes in amino acid sequences search in the database
  10. Minimum matching length of amino acid sequence with reference protein sequences
  11. Allowed amino acid mismatches (Greedy mode only)
  12. Minimum matching score (Greedy mode only)
  13. Keep all translated amino acid fragments (not recommend)
  14. Change codon table
  15. Trim the first 6 bases (optional)
  16. Custom built database
  17. Selected pathway analysis
  18. Automatically trimming polyA tail in reads
  19. Minimum length for merging overlapped PE read pair (PE reads only)
  20. Maximum cutoff length of translated peptides
  21. Output mapped reads
  22. General guidelines for parameter settings
  23. Submit KO abundance to NetworkAnalyst for downstream analysis
1. Seq2Fun full usage options

Use seq2fun or seq2fun --help to show the full usage options

  options:
// input/output
-s, --sampletable (recommended) sample table must consist of 3 columns (sample prefix name (sample01), forward reads name (sample01_R1.fq.gz), group info (control) for single-reads or 4 columns (sample prefix name (sample01), forward reads (sample01_R1.fq.gz), reverse reads (sample01_R2.fq.gz), group info (control) for paired-end reads. The columns must be separated by tab
-i, --in1 read1 input file name
-I, --in2 read2 input file name
-X, --prefix (not recommended) prefix name for output files, eg: sample01
--outputMappedCleanReads enable output mapped clean reads into fastq.gz files, by default is false, using --outputMappedCleanReads to enable it
--outputReadsKOMap enable output mapped clean reads-KO map into .gz files, by default is false, using --outputReadsKOMap to enable it
// Homology search;
-D, --genemap gene/protein KO species map
--profiling by default it is off. If this option is specified, 4 levels of output files will be generated, ko abundance table, hit pathway table, hit species table and ko reads mapping tableby default it is off. If this option is specified, 4 levels of output files will be generated, ko abundance table, hit pathway table, hit species table and ko reads mapping table
// translated search
-d, --tfmi fmi index of Protein database
-K, --mode searching mode either tGREEDY or tMEM (maximum exactly match). By default greedy
-E, --mismatch number of mismatched amino acid in sequence comparison with protein database with default value 2
-j, --minscore minimum matching score of amino acid sequence in comparison with protein database with default value 80
-J, --minlength minimum matching length of amino acid sequence in comparison with protein database with default value 19 for GREEDY and 13 for MEM model
-m, --maxtranslength maximum cutoff of translated peptides, it must be no less than minlength, with default 60
--allFragments enable this function will force Seq2Fun to use all the translated AA fragments with length > minlength. This will slightly help to classify reads contain the true stop codon and start codon; This could have limited impact on the accuracy for comparative study and enable this function will slow down the Seq2Fun. by default is false, using --allFragments to enable it"
--codontable select the codon table (same as blastx in NCBI), we provide 21 codon tables from NCBI. By default is the codontable1 (Standard Code), the complete codon table can be seen below.
--dbDir specify dir for internal databases such as ko_fullname.txt
//selected pathways
-Z, --pathway list of selected pathways for target pathways analysis
-z, --genefa the gene/protein sequences fasta file for retrieving proteins in selected pathways to construct database
// threading
-w, --thread worker thread number, default is 2
-V, --verbose enable verbose
--debug enable debug
--phred64 indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)
--reads_to_process specify how many reads/pairs to be processed. Default 0 means process all reads
// adapter
-A, --disable_adapter_trimming adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
-a, --adapter_sequence the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped
--adapter_sequence_r2 the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as adapter_sequence
--adapter_fasta specify a FASTA file to trim both read1 and read2 (if PE) by all the sequences in this FASTA file
--detect_adapter_for_pe by default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data
//polyA tail
--no_trim_polyA by default, ployA tail will be trimmed. If this option is specified, polyA trimming is disabled
// trimming
-f, --trim_front1 trimming how many bases in front for read1, default is 0
-t, --trim_tail1 trimming how many bases in tail for read1, default is 0
-b, --max_len1 if read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation
-F, --trim_front2 trimming how many bases in front for read2. If it's not specified, it will follow read1's settings
-T, --trim_tail2 trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings
-B, --max_len2 if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation. If it's not specified, it will follow read1's settings
// polyG tail trimming
-g, --trim_poly_g force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
--poly_g_min_len the minimum length to detect polyG in the read tail. 10 by default
-G, --disable_trim_poly_g disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
// polyX tail trimming
-x, --trim_poly_x enable polyX trimming in 3' ends
--poly_x_min_len the minimum length to detect polyX in the read tail. 10 by default
// cutting by quality
--cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise
--cut_tail move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise
-r, --cut_right move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop
-W, --cut_window_size the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4
-M, --cut_mean_quality the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20)
--cut_front_window_size the window size option of cut_front, default to cut_window_size if not specified
--cut_front_mean_quality the mean quality requirement option for cut_front, default to cut_mean_quality if not specified
--cut_tail_window_size the window size option of cut_tail, default to cut_window_size if not specified
--cut_tail_mean_quality the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified
--cut_right_window_size the window size option of cut_right, default to cut_window_size if not specified
--cut_right_mean_quality the mean quality requirement option for cut_right, default to cut_mean_quality if not specified
// quality filtering
-Q, --disable_quality_filtering quality filtering is enabled by default. If this option is specified, quality filtering is disabled
-q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality <=Q15 is qualified
-u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40%
-n, --n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5
-e, --average_qual if one read's average quality score <avg_qual, then this read/pair is discarded. Default 0 means no requirement
// length filtering
-L, --disable_length_filtering length filtering is enabled by default. If this option is specified, length filtering is disabled
-l, --length_required reads shorter than length_required will be discarded, default is 60
--length_limit reads longer than length_limit will be discarded, default 0 means no limitation
// low complexity filtering
--no_low_complexity_filter disable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1])
-Y, --complexity_threshold the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required
// filter by indexes --filter_by_index1 specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line
--filter_by_index2 specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line
--filter_by_index_threshold the allowed difference of index barcode for index filtering, default 0 means completely identical
// base correction in overlapped regions of paired end data
-c, --disable_correction disenable base correction in overlapped regions (only for PE data), default is enabled");
-v, --overlap_len_require the minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 30 by default
--overlap_diff_limit the maximum number of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default
--overlap_diff_percent_limit the maximum percentage of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%
// umi
-u, --umi enable unique molecular identifier (UMI) preprocessing
--umi_loc specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none
--umi_len if the UMI is in read1/read2, its length should be provided
--umi_prefix if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default
--umi_skip if the UMI is in read1/read2, Seq2Fun can skip several bases following UMI, default is 0
// overrepresented sequence analysis
-p, --overrepresentation_analysis enable overrepresented sequence analysis
-P, --overrepresentation_sampling one in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20
// deprecated options
--cut_by_quality5 DEPRECATED, use --cut_front instead
--cut_by_quality3 DEPRECATED, use --cut_tail instead
--cut_by_quality_aggressive DEPRECATED, use --cut_right instead
--discard_unmerged DEPRECATED, no effect now, see the introduction for merging
We have followed the codon tables from NCBI;
Seq2Fun NCBI
codontable1 The Standard Code (transl_table=1)
codontable2 The Vertebrate Mitochondrial Code (transl_table=2)
codontable3 The Yeast Mitochondrial Code (transl_table=3)
codontable4 The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (transl_table=4)
codontable5 The Invertebrate Mitochondrial Code (transl_table=5)
codontable6 The Ciliate, Dasycladacean and Hexamita Nuclear Code (transl_table=6)
codontable9 The Echinoderm and Flatworm Mitochondrial Code (transl_table=9)
codontable10 The Euplotid Nuclear Code (transl_table=10)
codontable12 The Alternative Yeast Nuclear Code (transl_table=12)
codontable13 The Ascidian Mitochondrial Code (transl_table=13)
codontable14 The Alternative Flatworm Mitochondrial Code (transl_table=14)
codontable16 Chlorophycean Mitochondrial Code (transl_table=16)
codontable21 Trematode Mitochondrial Code (transl_table=21)
codontable22 Scenedesmus obliquus Mitochondrial Code (transl_table=22)
codontable24 Rhabdopleuridae Mitochondrial Code (transl_table=24)
codontable26 Pachysolen tannophilus Nuclear Code (transl_table=26)
codontable27 Karyorelict Nuclear Code (transl_table=27)
codontable29 Mesodinium Nuclear Code (transl_table=29)
codontable30 Peritrich Nuclear Code (transl_table=30)
codontable31 Blastocrithidia Nuclear Code (transl_table=31)
codontable33 Cephalodiscidae Mitochondrial UAA-Tyr Code (transl_table=33)

2. Running Seq2Fun

Back to top

In order to make it clear and just for demonstration purpose, after downloading and installation of seq2fun, we first refer the path to seq2fun installation directory (e.g. /home/peng/Software/seq2fun) to "S2F_HOME".
This "S2F_HOME" contains several directories, e.g., bin contains executable files seq2fun mkbwt mkfmi, therefore the location of executable file seq2fun is S2F_HOME/bin/seq2fun
Similarly, we create a new directory "database" in S2F_HOME, which is used to store databases of different groups of organisms. After downloading the database, we put them in the database directory, e.g., mammals, birds, fishes. Each database contains .fmi, protein_ko_species_cdhit99.txt files. Here, the database path is S2F_HOME/database/. Otherwise, you can put the database anywhere that is clear to you.

2.1. a typical run for a single sample of pair-end (PE) reads with comparative mode, which will only generate a KO abundance table for each sample (total KO abundance table will not be produced, using -s batch mode to produce it) (not recommended):

S2F_HOME/bin/seq2fun -i A1.CE2-S1_R1.fastq.gz -I A1.CE2-S1_R2.fastq.gz --tfmi S2F_HOME/database/birds/birds_cdhit99_proteins.fmi --genemap S2F_HOME/database/birds/birds_protein_ko_species_cdhit99.txt --prefix A1.CE2-S1_R1 -w 8
                            
or for batch sample processing, and it will produce total KO abundance table for all samples, and KO abundant table for each sample (recommended)
S2F_HOME/bin/seq2fun --sampletable sample.txt --tfmi S2F_HOME/database/birds/birds_cdhit99_proteins.fmi --genemap S2F_HOME/database/birds/birds_protein_ko_species_cdhit99.txt -w 8
                            
a sample.txt is a table consisting of four columns must be separated by tab '\t'. The first column is the prefix name of sample, the second one is filename of forward reads, the third one is the filename of reverse reads and last one is the class info of that sample.
A1.CE2-S1   A1.CE2-S1_R1.fastq.gz   A1.CE2-S1_R2.fastq.gz   control
A3.CE1-M2 A3.CE1-M2_R1.fastq.gz A3.CE1-M2_R2.fastq.gz medium
... ... ...
or a sample.txt with path for output and input.
/path/to/A1.CE2-S1   /path/to/A1.CE2-S1_R1.fastq.gz   /path/to/A1.CE2-S1_R2.fastq.gz   control
/path/to/A3.CE1-M2 /path/to/A3.CE1-M2_R1.fastq.gz /path/to/A3.CE1-M2_R2.fastq.gz medium
... ... ...
2.2. a typical run for a single sample of pair-end (PE) reads with profiling mode, which will generate 5 levels of tables of a KO abundance, hit pathway, hit species, reads ko, as well as html reports summarizing these tables and rarefaction curve, as well as reads quality checks for all samples and each sample.
S2F_HOME/bin/seq2fun -i A1.CE2-S1_R1.fastq.gz -I A1.CE2-S1_R2.fastq.gz --tfmi S2F_HOME/database/birds/birds_cdhit99_proteins.fmi --genemap S2F_HOME/database/birds/birds_protein_ko_species_cdhit99.txt --prefix A1.CE2-S1 -w 8 --profiling
                            
or for batch sample processing
S2F_HOME/bin/seq2fun --sampletable sample.txt --tfmi S2F_HOME/database/birds/birds_cdhit99_proteins.fmi --genemap S2F_HOME/database/birds/birds_protein_ko_species_cdhit99.txt -w 8 --profiling
                            

3. Input files

Back to top

Seq2Fun supports both paired-end(PE) and single-end(SE) reads from various sequencing platforms e.g., Illumina. There is no reads length limitation. It supports both fastq and fastq.gz file format.

3.1 For PE reads using -i or --in1 and -I or --in2 for forward and reverse reads, respectively.
e.g.: -i A1.CE2-S1_R1.fastq.gz -I A1.CE2-S1_R1.fastq.gz

3.2 For SE reads using -i or --in1 for forward reads.
e.g.: -i A1.CE2-S1_R1.fastq.gz

4. Batch sample processing (highly recommended)

Back to top

It is highly recommended to use bach processing mode. Batch sample processing not only avoids repeatedly loading and deleting database when processing many samples, but also produces KO abundance table consisting all samples.

Using -s or --sampletable table to specify the file containing reads files and output file.
e.g.: -s sample.txt

4.1 For PE reads using -i or --in1 and -I or --in2 for forward and reverse reads, respectively.
For PE reads, sample_table must consist of four columns using tab ("\t") as a separator. The first column is prefix name of sample, the second is the filename of forward reads, the third one is the filename of reverse reads and last one is the sample class info.

A1.CE2-S1   A1.CE2-S1_R1.fastq.gz   A1.CE2-S1_R2.fastq.gz   control
A3.CE1-M2 A3.CE1-M2_R1.fastq.gz A3.CE1-M2_R2.fastq.gz medium
... ... ...
4.2 For SE reads, sample_table must consist of three columns using the tab ("\t") as a separator. The first column is the prefix name of sample and the second one is the filename of forward reads, and last one is the sample group info
A1.CE2-S1   A1.CE2-S1_R1.fastq.gz   control
A3.CE1-M2 A3.CE1-M2_R1.fastq.gz medium
... ...

5. Output files

Back to top

By default, Seq2Fun generate only 1 main output text file KO abundance tables for all samples and each sample.
If you are using --profiling mode, there are 5 main output text files: KO abundance table, hit pathway table and hit species table, reads ko table (reads annotation table); and two html reports containing tables and figures of summary these tables, as well as reads quality check.

5.1 If you use input files from 2.1
Using -X or --prefix to specify the prefix name of output files
e.g.: -X A1.CE2-S1

5.2 If you use batch sample process mode from 3.1, the last column in sample.txt is the prefix name of output files
Using -s or --sampletable to specify file containing both input and output files.
e.g.: -s sample.txt

5.3.1. An output file of KO abundance table for all samples. It is only generated by batch mode

#Name   A1.CE2-S1-LT    A2.CE2-M4-LT    B1.CE2-S2-LT    B2.CE2-M5-LT    C1.CE2-S3-LT    D1.CE2-S4-LT    D2.CE2-H2-LT    E1.CE2-S5-LT    E2.CE2-H3-LT    F1.CE2-M1-LT    F2.CE2-H4-LT    G1.CE2-M2-LT    G2.CE2-H5-LT    H1.CE2-M3-LT    KO_name
#Class control medium control medium control control high control high medium high medium high medium class_info
K00002 118 96 386 131 147 141 106 129 120 98 148 117 136 121 AKR1A1, adh; alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
K00006 629 604 235 648 664 506 628 499 670 455 838 615 579 521 GPD1; glycerol-3-phosphate dehydrogenase (NAD+) [EC:1.1.1.8]
K00008 971 755 715 770 1122 770 1058 1010 1023 829 1055 1351 954 1139 SORD, gutB; L-iditol 2-dehydrogenase [EC:1.1.1.14]
K00010 17 18 31 17 17 14 15 17 17 17 15 9 20 25 iolG; myo-inositol 2-dehydrogenase D-chiro-inositol 1-dehydrogenase [EC:1.1.1.18 1.1.1.369]
K00011 276 292 1581 303 315 336 290 305 353 263 316 279 290 296 AKR1B; aldehyde reductase [EC:1.1.1.21]
K00012 130 94 609 112 111 134 127 344 103 193 163 382 116 299 UGDH, ugd; UDPglucose 6-dehydrogenase [EC:1.1.1.22]
...
5.3.2. An output file of KO abundance (A1.CE2-S1_ko_abundance.txt) is a 3 columns table separated by '\t', with KO id, count and KO name.See also 4.7 html report
KO_id	reads_count	KO_name
K14736 64798 TF; transferrin
K03231 32934 EEF1A; elongation factor 1-alpha
K00522 30426 FTH1; ferritin heavy chain [EC:1.16.3.2]
K00134 29734 GAPDH, gapA; glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12]
K08757 24860 APOA1; apolipoprotein A-I
K02261 22467 COX2; cytochrome c oxidase subunit 2
... ... ...
5.4. (Using --profiling mode) An output file of hit pathway (A1.CE2-S1_pathway_hits.txt) is a 5 columns table separated by '\t', with pathway ID, pathway name, KO ID, KO count and KO name.See also 4.7 html report
pathway_id	pathway_name                    KO_id	ko_count    KO_name
map00010 Glycolysis_/_Gluconeogenesis K00002 118 AKR1A1, adh; alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
map00010 Glycolysis_/_Gluconeogenesis K00016 12019 LDH, ldh; L-lactate dehydrogenase [EC:1.1.1.27]
map00010 Glycolysis_/_Gluconeogenesis K00121 573 frmA, ADH5, adhC; S-(hydroxymethyl)glutathione dehydrogenase alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
map00010 Glycolysis_/_Gluconeogenesis K00128 3721 ALDH; aldehyde dehydrogenase (NAD+) [EC:1.2.1.3]
map00010 Glycolysis_/_Gluconeogenesis K00129 13 E1.2.1.5; aldehyde dehydrogenase (NAD(P)+) [EC:1.2.1.5]
map00010 Glycolysis_/_Gluconeogenesis K00134 29734 GAPDH, gapA; glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12]
map00010 Glycolysis_/_Gluconeogenesis K00149 4504 ALDH9A1; aldehyde dehydrogenase family 9 member A1 [EC:1.2.1.47 1.2.1.3]
... ... ... ... ...
5.5. (Using --profiling mode) An output file of hit species (A1.CE2-S1_species_hits.txt) is a 2 columns table separated by '\t', with species name and number of KOs assigned to this species.See also 4.7 html report
species                         number_of_KOs
Nipponia_nippon 2381
Columba_livia 2258
Egretta_garzetta 2254
Pygoscelis_adeliae 2174
Athene_cunicularia 2098
Apteryx_mantelli_mantelli 2044
Gallus_gallus 1972
Empidonax_traillii 1969
Falco_cherrug 1928
... ...
5.6. (Using --profiling mode) An output file of reads KO (A1.CE2-S1_reads_ko.txt) is a 3 columns table separated by '\t', with read ID, KO ID and KO name. See also 4.7 html report
reads_id                                KO_id	KO_name
A00266:275:HLFTWDSXX:2:1101:10013:26537 K14736 TF; transferrin
A00266:275:HLFTWDSXX:2:1101:10122:16235 K14736 TF; transferrin
A00266:275:HLFTWDSXX:2:1101:10122:17174 K03883 ND5; NADH-ubiquinone oxidoreductase chain 5 [EC:7.1.1.2]
A00266:275:HLFTWDSXX:2:1101:10122:2174 K00602 purH; phosphoribosylaminoimidazolecarboxamide formyltransferase / IMP cyclohydrolase [EC:2.1.2.3 3.5.4.10]
A00266:275:HLFTWDSXX:2:1101:10140:28510 K00799 GST, gst; glutathione S-transferase [EC:2.5.1.18]
A00266:275:HLFTWDSXX:2:1101:10149:34225 K06238 COL6A; collagen, type VI, alpha
A00266:275:HLFTWDSXX:2:1101:10185:19382 K03883 ND5; NADH-ubiquinone oxidoreductase chain 5 [EC:7.1.1.2]
A00266:275:HLFTWDSXX:2:1101:10212:16423 K11188 PRDX6; peroxiredoxin 6, 1-Cys peroxiredoxin [EC:1.11.1.7 1.11.1.15 3.1.1.-]
A00266:275:HLFTWDSXX:2:1101:10212:36777 K08737 MSH6; DNA mismatch repair protein MSH6
... ... ...
5.7. Html report for functional summary
if using --sampletable and --profiling, there will be two full html reports for summarizing tables for all samples as well as for each sample.
All_samples.html A1.CE2-S1_functional.html

6. Protein database for translated search

Back to top

Translating each read into amino acid sequences and subsequently searching them in the protein database to find their homologies. If homologies identified, the homology protein's ID will be assigned to read
Seq2Fun provides various pre-built databases of protein sequences. The size of the protein database has a huge impact on the speed of Seq2Fun.
To speed up and avoid un-informative protein sequences in the database, we only include protein sequences involved in KEGG pathways.
The protein database was built from fasta file of protein sequences based on burrows wheeler transform (BWT) and FM index for fast search. Download the pre-built database from here.
Or you can build your own protein database (see xx).

Using -d or --tfmi to specify protein database
e.g.: -d birds_cdhit99_proteins.fmi

7. Protein_KO_organism mapping table

Back to top

Each assigned read has protein IDs attached, this is used for search its corresponding KO from protein_KO_organism map.
For most cases, one read could be annotated with multiple protein ids, which usually are homology proteins from same or different species and share the same KO id.
Occasionally, one read could be annotated with multiple KO ids. In this case, only KO id with the highest frequency is kept. e.g.: read A00266:275:HLFTWDSXX:2:1101:10013:26537 mapped to K03841 and K01689 for 10 and 5 times, respectively. Only K03841 was kept for read A00266:275:HLFTWDSXX:2:1101:10013:26537. Also, see 4.3.
Protein_KO_organism table is a mapping table linking Protein id to KO id and species name.

Using -D or --genemap to specify protein_ko table file
e.g.: -D bird_protein_ko_species_cdhit99.txt

It contains 3 columns separated by '\t'. The first column is protein id, it must be the same protein id in protein database from 5. The second column is KO ID, corresponding to its protein ID, and last is the species name used for hit species table.

aam_106481982	K00844	Apteryx_mantelli_mantelli
aam_106481985 K05494 Apteryx_mantelli_mantelli
aam_106481990 K12852 Apteryx_mantelli_mantelli
aam_106481998 K14525 Apteryx_mantelli_mantelli
aam_106482001 K03172 Apteryx_mantelli_mantelli
aam_106482002 K00907 Apteryx_mantelli_mantelli
aam_106482005 K05768 Apteryx_mantelli_mantelli
aam_106482007 K03766 Apteryx_mantelli_mantelli
aam_106482010 K03172 Apteryx_mantelli_mantelli
aam_106482012 K03994 Apteryx_mantelli_mantelli
... ... ...

8. Single sample profiling

Back to top

It is known that it is very difficulty to obtain larger number of samples for lots of environmental species, and sometimes only one sample for each experimental group. In this case, it is very difficult to conduct comparative study without enough statistical power
To cope with this challenge, --profiling mode is implemented in Seq2Fun. Using this mode, 4 main levels output files KO abundance table, hit pathway table, hit species table and read KO table are generated. An additional html report summarizing and visualizing these tables is also generated.
A rarefaction curve is also generated in html report to show the sequence depth against number of KOs. This figure infers 1). the sequence depth is enough or not. It is useful when considering the difficulty to obtaining enough samples or hight-quality RNA for environmental species. 2). If you have too much reads and you want a faster processing, you can use --reads_to_process to specify the number of reads to process.
By default profiling mode is off, using --profiling to turn it on, it is ~20% slower than non-profiling mode and could consume a little bit more memory at the last stage of generating tables
e.g.: --profiling

see more details about the output files in 4.3 - 4.7.

9. MEM or GREEDY modes in amino acid sequences search in the database

Back to top

Seq2Fun translates clean read into amino acid sequences with all six possible open reading frames, followed by searching and comparing with reference protein sequences in database. Seq2Fun offers two searching modes of maximum exact matches (MEM) and GREEDY. MEM mode is a fast running mode and implemented as that translated amino acid sequences must be perfectly matched with reference protein sequences without any mismatches. Any mismatched amino acid sequence and its corresponding read will be discarded. It is designed for organisms who have reference sequence in the database. However, due to the evolutionary distance between the target organism and reference organisms as well as sequencing error, there could be mismatches between amino acid sequences translated from reads and reference protein sequences. Using MEM mode could result in the various number of unmapped reads and rapid sensitivity decay depending on the phylogenetic distance between organisms as well as the number of sequencing errors. Therefore, Greedy mode is introduced to allow mismatches between query and subject sequences to overcome the evolutionary divergences and sequencing errors. Greedy mode is designed for organisms who do not have reference sequences in database.
The default mode is Greedy mode. Or using -K or --mode to specify mode.
e.g.: -K greedy
The MEM mode can be turn on e.g.: -K mem

10. Minimum matching length of amino acid sequence with reference protein sequences

Back to top

The minimum matching length determines cut off length of how long the amino acid sequence matches with reference protein sequences. It is used for both MEM and GREEDY modes. Our results in the tested datasets show that minlength generally has complicated impacts in Greedy mode with the hightest values of coefficient of determination R square obtained with minlength ranging from 19 to 28. Therefore, the default value of minlength is set to 19, which means that the matching amino acid length must be no less than 19 aa. However, it is should be carefully tested on different organisms especially when your organisms have large evolutionary distances with organisms in the database. Decreasing the minlength would help overcome the large sequence divergence between query and subject sequences. But it would tradeoff specificity, resulting in some false positive matches. Using -J or --minlength to set minimum matching length with defulat value 19
e.g.: -J 19

11. Allowed amino acid mismatches (Greedy mode only)

Back to top

Seq2Fun employs GREEDY model to overcome evolutionary distance and sequencing error. It is implemented with setting the number of mismatches. The number of mismatches depends on the nature of protein itself, evolutionary distance and sequence error. Our results in the tested datasets show that minlength generally has negative impacts in Greedy mode. Therefore, the default value of minlength is set to 2, that means allowing two amino acid mismatches. However, it is should be carefully tested on different organisms especially when your organisms have large evolutionary distances with organisms in the database. Increasing the number of mismatches would help overcome the large sequence divergence between query and subject sequences. But it would tradeoff specificity, resulting in some false positive matches. As there are more than 20 amino acids, increasing the number of mismatches could result in exponential increasing searches in the database (also dependent on composition in the database) and would significantly slow down the search speed.This number of mismatches will translate into the affection of abundance values of KO in the output table.
Using -E or --mismatch to specify number of amino acid mismatches with default value of 2.
e.g.: -E 2

12. Minimum matching score (Greedy mode only)

Back to top

Matching score based on blosum62 is used to assess how closely the translated amino acid sequence with mutations match with reference protein sequences. Just as number of the mismatched amino acids, it could trades off sensitivity and precision. Decreasing the value of the minimum matching score will increase sensitivity with a decreasing precision. It could have affect on the runtime as it servers as filter to filter out low blosum62 score AA fragments. Therefore, higher minimum matching score could results in a speed gain. Based on our assessment in three datasets, the minimum matching score has very limited affect on the overall gene abundance quantification. Thus, we set the default minimum matching score to 80.
Using -j or --minscore to specify minimum matching score
e.g.: -j 80

13. Keep all the translated amino acid fragments(not recommended)

Back to top

After read translation into all possible amino acid fragments using the six reading frames, only the top longest fragments are select for homolog search. This filtering will greatly reduce the number of false translations if there is no true stop codon in that read, which in turn can increase the speed of Seq2Fun. This hard filtering will probably filter out some reads containing true stop codon, but it has very limited on the downstream analysis such as fold change in comparative studies.
We offer the option to keep all the fragments, but their length must be > minimum matching length.
Using --allFragments keep all the fragments
e.g.: --allFragments

14. Select codon table

Back to top

Select the codon table (same as blastx in NCBI), we provide 33 codon tables from 'https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG31'. Be default is the Standard Code

Standard1
VertebrateMitochondria2
YeastMitochondrial3
MoldProtozoanCoelenterateMitochondrialMycoplasmaSpiroplasma4
InvertebrateMitochondrial5
CiliateDasycladaceanHexamitaNuclear6
EchinodermFlatwormMitochondrial9
EuplotidNuclear10
AlternativeYeastNuclear12
AscidianMitochondrial13
InvertebrateMitochondiral5
ChlorophyceanMitochondrial16
TrematodeMitochondrial21
ScenedesmusobliquusMitochondrial22
RhabdopleuridaeMitochondrial24
PachysolentannophilusNuclear26
KaryorelictNuclear27
MesodiniumNuclear29
PeritrichNuclear30
BlastocrithidiaNuclear31
CephalodiscidaeMitochondrial33
Using --codontable to specify the codon table; "Standard1" is the default
e.g.: --codontable VertebrateMitochondria2

15. Trim the fist 6 base of read

Back to top

The fist 6 base of a read could have higher mismatch rates due to the random-hexamer priming, we advise to trim the fist 6 bases of a read
Using --trim_front1 and/or --trim_front2 for single or pair-end reads
e.g.: --trim_front1 6 --trim_front2 6

16. Custom built database

Back to top

Seq2Fun fully supports customer built database. Seq2Fun needs two input files, FM index file, which is built from protein sequences and used for database; and protein_ko_species table, which is used for mapping protein ID in database to KO ID as well as for hit species table. To build your own database, you must supply a protein fasta file
e.g.: birds.fasta

>gga_419912
MEGNEGDPPGAHRRRDSHGRRPSREINAEDQVAQETEEVFRSYTFYRYQQEREEGGAEVPMDPEIMEIQQELGSTGSQVGRRLAIIGDDINKRYDAEFRHMLKSLQPTKENAYEYFTTIASSLFDSGINWGRVIALLGFGYCMAIHVYQQGITGFLRRIARYVTEFMLRNRIARWIAQQGGWVAALDLDNVYMKYMLVVLALVMVGHLVVRRFFRP
>gga_427365
MAGDGGVYRLRCSLLGHEQDVRGITRGLFPEGGFVSVSRDRTARLWTPDGLNRGFTEMHCMRGHSNFVSCVCIIPPSDLYPRGLIATGGNDHNICVFTVDNTAPLYVLKGHTNTVCSLSSGKFGTLLSGSWDTTCKVWLNDRCMMTLQGHTAAIWAVKILPEQGLMLTGSADKTIKLWKAGRCERTYAGHEDCVRGLAILSEMEFLSCSNDASVRRWHISGECLHVYYGHTNYIYSVSVFPHCKDFITTGEDRSLRIWKQGECVQTIRLPAQSVWCCCVLDNDDIVVGASDGIIRVFTKSLERTASAEEIQAFENELSQASIDPKTGDLGDINADDLPGKEHLKDPGMRDGQTRLIKDNGKVEAYQWSVSEGRWIKIGDVVGSSGGTQQTSGKVLFEGKEYDYVFTIDVNESGPSYKLPYNISDDPWLTAYNFLQKNDLNPMFLDQVAKFIIDNTKGQAPLNTNTEFSDPFTGAGRYVPGSSSLSNTAPAADPFTGMGAYQSAKAENIYFPKKDAVTFDQANPTQILGKLKELNGNAAEEQKLTEDDLIILEKLLSATCNTSAEMPTAQQIQTLWKAINWPEDIVFPALDILRLSVRHPIVNENFCGEKAHVQFIHLLLKFLNPQGKPANQLLALRALCNCFVSQAGQKLLMEQRDEIMTQAIEVKSGNNKNVHIALATLTLNYAVCLHKVNNIEGKAQCLSVISTVMEVVKDLEAVFRLLVALGTLISDDKNAVQLAKSLGVDSQIKKYVSVSEPAKVKECCRFVLNLL
...
Before building the database, it is recommended to remove the identical protein sequences.
There are several benefits to do so, including reducing the memory usage and increasing the speed.
You can use CD-HIT or SeqKit to do this job.
Please be noted that the protein sequences must consist of the standarad 20 amino acid (ACDEFGHIKLMNPQRSTVWY) in the uppercase.
S2F_HOME/bin/mkbwt -n 8 -a ACDEFGHIKLMNPQRSTVWY -o brids_proteins birds.fasta;
S2F_HOME/bin/mkfmi brids_proteins;
                            
It generate a fmi file named brids_proteins.fmi. See 5. Protein database for translated search.

To use the database, you must prepare a mapping file containing protein ID and its corresponding KO ID, as well as species name, separated by "\t".
e.g.: birds_protein_KO_organism.txt.
Please be noted that the protein ID in the first column must be unique and identical with protein ID in birds.fasta, and must be corresponding to KO ID in the second column.
gga_419912  K14021  Gallus_gallus
gga_427365 K14018 Gallus_gallus
... ... ...
If you want to add genes to the existing databases, please download the corresponding database from here.
Prepare your additional fasta file and mapping file (linking gene id to KO id and species) as abovementioned, then cancatenate the two fasta files and mapping files before building the database.

17. Selected pathway analysis

Back to top

Using database containing all the KEGG pathways tends to produce some pathways unrelated to the research questions. In addition, full pathway analysis could compromise statistical power in some degree. Seq2Fun allows users to select a list of pathways that are closely related to their research interests. Moreover, reducing number of pathways for analysis would accelerate Seq2Fun.

To use selected pathways, you must provide a txt file consisting your selected pathways, which can only be selected from file pathway_name_453.txt; this pathway_name_453.txt contains all 453 pathways from KEGG database.

Here is an example of selected pathway file, e.g. pathway_list.txt; it must consist only one column.

map00010:Glycolysis_/_Gluconeogenesis
map00020:Citrate_cycle_(TCA_cycle)
map00030:Pentose_phosphate_pathway
map00040:Pentose_and_glucuronate_interconversions
map00051:Fructose_and_mannose_metabolism
map00052:Galactose_metabolism
map00053:Ascorbate_and_aldarate_metabolism
map00061:Fatty_acid_biosynthesis
map00062:Fatty_acid_elongation
map00071:Fatty_acid_degradation

Second, the protein sequences fasta file must be provided, in which protein sequences in the selected pathways will be retrieved for database construction. It can be downloaded from our prebuild database directory, e.g. birds_cdhit99.fasta in birds.tar.gz. Also see Download database in INSTALLATION.

Here is an example of birds_cdhit99.fasta; each id must be like this.

>gga_419912
MEGNEGDPPGAHRRRDSHGRRPSREINAEDQVAQETEEVFRSYTFYRYQQEREEGGAEVPMDPEIMEIQQELGSTGSQVGRRLAIIGDDINKRYDAEFRHMLKSLQPTKENAYEYFTTIASSLFDSGINWGRVIALLGFGYCMAIHVYQQGITGFLRRIARYVTEFMLRNRIARWIAQQGGWVAALDLDNVYMKYMLVVLALVMVGHLVVRRFFRP
>gga_427365
MAGDGGVYRLRCSLLGHEQDVRGITRGLFPEGGFVSVSRDRTARLWTPDGLNRGFTEMHCMRGHSNFVSCVCIIPPSDLYPRGLIATGGNDHNICVFTVDNTAPLYVLKGHTNTVCSLSSGKFGTLLSGSWDTTCKVWLNDRCMMTLQGHTAAIWAVKILPEQGLMLTGSADKTIKLWKAGRCERTYAGHEDCVRGLAILSEMEFLSCSNDASVRRWHISGECLHVYYGHTNYIYSVSVFPHCKDFITTGEDRSLRIWKQGECVQTIRLPAQSVWCCCVLDNDDIVVGASDGIIRVFTKSLERTASAEEIQAFENELSQASIDPKTGDLGDINADDLPGKEHLKDPGMRDGQTRLIKDNGKVEAYQWSVSEGRWIKIGDVVGSSGGTQQTSGKVLFEGKEYDYVFTIDVNESGPSYKLPYNISDDPWLTAYNFLQKNDLNPMFLDQVAKFIIDNTKGQAPLNTNTEFSDPFTGAGRYVPGSSSLSNTAPAADPFTGMGAYQSAKAENIYFPKKDAVTFDQANPTQILGKLKELNGNAAEEQKLTEDDLIILEKLLSATCNTSAEMPTAQQIQTLWKAINWPEDIVFPALDILRLSVRHPIVNENFCGEKAHVQFIHLLLKFLNPQGKPANQLLALRALCNCFVSQAGQKLLMEQRDEIMTQAIEVKSGNNKNVHIALATLTLNYAVCLHKVNNIEGKAQCLSVISTVMEVVKDLEAVFRLLVALGTLISDDKNAVQLAKSLGVDSQIKKYVSVSEPAKVKECCRFVLNLL
...

To use the selected pathways, you must specify --pathway pathway_list.txt --genefa birds_cdhit99.fasta --genemap birds_protein_ko_species_cdhit99_no_chicken.txt

the whole command could like this
S2F_HOME/bin/seq2fun -w 8 -i left.fastq.gz -I right.fastq.gz --prefix sample01 --pathway pathway_list.txt --genefa birds_cdhit_99_no_chichen_new.fasta -D birds_protein_ko_species_cdhit99_no_chicken.txt

It will produce a temp dir selected_pathway_database, which contains 4 files, selected_pathway_protein_aas.pep.fasta is the protein sequences in selected pathways, which is retrieved from e.g. birds_cdhit99.fasta, proteins.sa, proteins.bwt and proteins.fmi, which is the database used for Seq2Fun.

You can also use customized database for selected pathways. See 13.Custom built database

18. Automatically trimming polyA tail in reads

Back to top

Reads with polyA tail could affect protein comparison, resulting in a high artificial matching score. Removing or trimming reads containing polyA tail could increase precision.
By default, trimming polyA tail is activated.
If you want to disable this function
using --no_trim_polyA

19. Minimum length for merging overlapped PE read pair (PE reads only)

Back to top

The length of query amino acid sequences could have heavily impact the comparison with reference protein sequences. It generally believes that the longer of query sequences, the higher precision of protein annotation. Therefore, merging overlapped PE read into longer read could yield longer translated amino acid sequences, which could reduce number of ill mapped reads and boost annotation precision. The default value of minimum length for merging PE read pair is 30 bp if the read pair is overlapped.
Using -v or --overlap_len_require to specify miminum length of overlapped PE read pair
e.g: -v 30

20. Maximum cutoff length of translated peptides

Back to top

Seq2Fun only keeps the longest AA fragment for identification of homologies in protein database, this will filter out lots of false fragments and accelerate Seq2Fun. In some cases, only keeping at most the longest fragments could filter out a proportion of fragments that originated from the merged PE reads if start and stop codons are present in the middle of the reads. Therefore, we recap the maximum cut-off length of peptide fragment to be 60 aa (by default, user can change this cut-off), which will prevent the filtering out of some true peptide fragments from the merged reads.
Using -m or --maxtranslength to specify maximum cutoff length of translated peptides.
Note: --maxtranslength must be larger than the --minlength (minimum matching length of amino acid sequence) in 9.
e.g: -m 60

21. Output mapped reads

Back to top

For non-model organisms, it could be important to obtain the assembled genes/contigs of interest.
Seq2Fun offers a feature to capture and label each mapped clean read with KO tag and output as fastq.gz files. The clean reads can be directly used to assembled into genes/contigs using various de novo assemblers such as Trinity, et al. Because of only a small proportion (~30%) of total reads are mapped and retrieved, the de novo assembly for each genes is very fast and usually can be finished within several minutes.
Various downstream analysis such as designing primers, comparable analysis, on these assembled genes/contigs can be applied, and this will greatly compensate the information loss by Seq2Fun compared with the conventional de novo assemblers.

Using --outputMappedCleanReads to enable this function.

Forward mapped clean reads file A1.CE2-S1-LT_mapped_R1.fastq.gz:
@A00266:275:HLFTWDSXX:2:2608:9625:19977 1:N:0:TTACCGAC+CGAATACG K02912
CCAGGAACTTTCTGAACCCCGTGGGCAGCATGTGCTTTGTCTTCTTATTGCTGCCGTAGCCAATGTTGGGCATCAGGATCTGACCCTTGAACCGCCTGCGA
+
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00266:275:HLFTWDSXX:2:1101:7437:1000 1:N:0:TTACCGAC+CGAATACG K00016
CGACTCTCCCCTTCTTGCTGACGAACTCCTGCAGTTACTACCACAATCTTGGAGTTGGCTGTGACAGCATAGTCTTTGTC
+ FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Reverse mapped clean reads file A1.CE2-S1-LT_mapped_R2.fastq.gz:
@A00266:275:HLFTWDSXX:2:2608:9625:19977 2:N:0:TTACCGAC+CGAATACG K02912
TGTGAAACCTAAAATCGTCAAGAAGAGGACCAAGAAGTTCATCCGCCATCAGTCGGATCGTTATGTCAAGATCAAGCGCAACTGGCGTAAACCGAGAGGTA
+
FFFFFFFFFFF,FFFFF:FFFFFF:FFFFFFFFFF,FFFFFFFFFFFF:FFFFF,FFFFFFFFFFFF,FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00266:275:HLFTWDSXX:2:1101:7437:1000 2:N:0:TTACCGAC+CGAATACG K00016
GACAAAGACTATGCTGTCACAGCCAACTCCAAGATTGTGGTAGTAACTGCAGGAGTTCGTCAGCAAGAAGGGGAGAGTCG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

22. General guideline for parameter settings

Back to top

Several parameters drives the trade-offs between sensitivity and precision. The default parameter setting is obtained from the parameter sensitivity test on four organisms: mouse, chicken, zebrafish and roundworm. If your studied species has no or limited number of close-related reference species in the database. You can tune parameters to obtain a better balance between sensitivity and precision.
For example: 1) increasing the number of mismatches (--mismatch) from default 2 to 3 or 4; 2) or decreasing the minimum matching length (--minlength) from default 19; 3) or decreasing the minimum BLOSUM62 score (--minscore) from default 80; 4) or decreasing the maximum length cutoff of the translated amino acid sequences for overlapped paired-end reads from default 60; will increase the mapping reads or the mapping chance for the highly divergent homologs.
It is worth mentioning that this parameter adjustment could slow down the Seq2Fun and/or increase the false positives with various degrees.

23. Submit KO abundance to NetworkAnalyst for downstream analysis (optional)

Back to top

After obtaining KO abundance table, we provide an online website-based analysis tool to conduct downstream analysis such as identification of differentially expressed KOs, pathway enrichment analysis, et al.
Submit KO abundance table to our website-based tool NetworkAnalyst (optional)
Seq2Fun generates two KO abundance tables:
All_sample_KO_abundance_table.txt and All_sample_KO_abundance_table_submit2networkanalyst.txt, the later will be used to submit to NetworkAnalyst.