Table of Contents

  1. Full options
  2. Running Seq2Fun
  3. Input files
  4. Batch sample processing
  5. Output files
  6. Protein database for translated search
  7. Protein annotation mapping table
  8. Single sample profiling
  9. MEM or GREEDY modes in amino acid sequences search in the database
  10. Minimum matching length of amino acid sequence with reference protein sequences
  11. Allowed amino acid mismatches (Greedy mode only)
  12. Minimum matching score (Greedy mode only)
  13. Keep all translated amino acid fragments (not recommend)
  14. Change codon table
  15. Trim the first 6 bases (optional)
  16. Custom built database
  17. Automatically trimming polyA tail in reads
  18. Minimum length for merging overlapped PE read pair (PE reads only)
  19. Maximum cutoff length of translated peptides
  20. Output mapped reads
  21. General guidelines for parameter settings
  22. Submit gene abundance tables to ExpressAnalyst for downstream analysis
  23. Extract mapped reads for target genes
  24. Target gene de novo assemble
1. Seq2Fun full usage options

Use seq2fun or seq2fun --help to show the full usage options

  options:
// input/output
-s, --sampletable (recommended) sample table must consist of 3 columns (sample prefix name (sample01), forward reads name (sample01_R1.fq.gz), group info (control) for single-reads or 4 columns (sample prefix name (sample01), forward reads (sample01_R1.fq.gz), reverse reads (sample01_R2.fq.gz), group info (control) for paired-end reads. The columns must be separated by tab
-i, --in1 read1 input file name
-I, --in2 read2 input file name
-X, --prefix (not recommended) prefix name for output files, eg: sample01
--outputMappedCleanReads enable output mapped clean reads into fastq.gz files, by default is false, using --outputMappedCleanReads to enable it
--outputReadsAnnoMap enable output mapped clean reads - gene map into .gz files, by default is false, using --outputReadsAnnoMap to enable it
// Homology search;
-D, --genemap gene annotation map file
--profiling by default it is off. If this option is specified, rarefaction curve of reads against number of ortholog will be generated
// translated search
-d, --tfmi fmi index of Protein database
-K, --mode searching mode either tGREEDY or tMEM (maximum exactly match). By default greedy
-E, --mismatch number of mismatched amino acid in sequence comparison with protein database with default value 2
-j, --minscore minimum matching score of amino acid sequence in comparison with protein database with default value 80
-J, --minlength minimum matching length of amino acid sequence in comparison with protein database with default value 19 for GREEDY and 13 for MEM model
-m, --maxtranslength maximum cutoff of translated peptides, it must be no less than minlength, with default 60
--allFragments enable this function will force Seq2Fun to use all the translated AA fragments with length > minlength. This will slightly help to classify reads contain the true stop codon and start codon; This could have limited impact on the accuracy for comparative study and enable this function will slow down the Seq2Fun. by default is false, using --allFragments to enable it"
--codontable select the codon table (same as blastx in NCBI), we provide 21 codon tables from NCBI. By default is the codontable1 (Standard Code), the complete codon table can be seen below.
--dbDir specify dir for internal databases such as ko_fullname.txt
//selected pathways
-Z, --pathway list of selected pathways for target pathways analysis
-z, --genefa the gene/protein sequences fasta file for retrieving proteins in selected pathways to construct database
// threading
-w, --thread worker thread number, default is 2
-V, --verbose enable verbose
--debug enable debug
--phred64 indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)
--reads_to_process specify how many reads/pairs to be processed. Default 0 means process all reads
// adapter
-A, --disable_adapter_trimming adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
-a, --adapter_sequence the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped
--adapter_sequence_r2 the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as adapter_sequence
--adapter_fasta specify a FASTA file to trim both read1 and read2 (if PE) by all the sequences in this FASTA file
--detect_adapter_for_pe by default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data
//polyA tail
--no_trim_polyA by default, ployA tail will be trimmed. If this option is specified, polyA trimming is disabled
// trimming
-f, --trim_front1 trimming how many bases in front for read1, default is 0
-t, --trim_tail1 trimming how many bases in tail for read1, default is 0
-b, --max_len1 if read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation
-F, --trim_front2 trimming how many bases in front for read2. If it's not specified, it will follow read1's settings
-T, --trim_tail2 trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings
-B, --max_len2 if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation. If it's not specified, it will follow read1's settings
// polyG tail trimming
-g, --trim_poly_g force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
--poly_g_min_len the minimum length to detect polyG in the read tail. 10 by default
-G, --disable_trim_poly_g disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
// polyX tail trimming
-x, --trim_poly_x enable polyX trimming in 3' ends
--poly_x_min_len the minimum length to detect polyX in the read tail. 10 by default
// cutting by quality
--cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise
--cut_tail move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise
-r, --cut_right move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop
-W, --cut_window_size the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4
-M, --cut_mean_quality the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20)
--cut_front_window_size the window size option of cut_front, default to cut_window_size if not specified
--cut_front_mean_quality the mean quality requirement option for cut_front, default to cut_mean_quality if not specified
--cut_tail_window_size the window size option of cut_tail, default to cut_window_size if not specified
--cut_tail_mean_quality the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified
--cut_right_window_size the window size option of cut_right, default to cut_window_size if not specified
--cut_right_mean_quality the mean quality requirement option for cut_right, default to cut_mean_quality if not specified
// quality filtering
-Q, --disable_quality_filtering quality filtering is enabled by default. If this option is specified, quality filtering is disabled
-q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality <=Q15 is qualified
-u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40%
-n, --n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5
-e, --average_qual if one read's average quality score <avg_qual, then this read/pair is discarded. Default 0 means no requirement
// length filtering
-L, --disable_length_filtering length filtering is enabled by default. If this option is specified, length filtering is disabled
-l, --length_required reads shorter than length_required will be discarded, default is 60
--length_limit reads longer than length_limit will be discarded, default 0 means no limitation
// low complexity filtering
--no_low_complexity_filter disable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1])
-Y, --complexity_threshold the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required
// filter by indexes --filter_by_index1 specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line
--filter_by_index2 specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line
--filter_by_index_threshold the allowed difference of index barcode for index filtering, default 0 means completely identical
// base correction in overlapped regions of paired end data
-c, --disable_correction disenable base correction in overlapped regions (only for PE data), default is enabled");
-v, --overlap_len_require the minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 30 by default
--overlap_diff_limit the maximum number of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default
--overlap_diff_percent_limit the maximum percentage of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%
// umi
-u, --umi enable unique molecular identifier (UMI) preprocessing
--umi_loc specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none
--umi_len if the UMI is in read1/read2, its length should be provided
--umi_prefix if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default
--umi_skip if the UMI is in read1/read2, Seq2Fun can skip several bases following UMI, default is 0
// overrepresented sequence analysis
-p, --overrepresentation_analysis enable overrepresented sequence analysis
-P, --overrepresentation_sampling one in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20
// deprecated options
--cut_by_quality5 DEPRECATED, use --cut_front instead
--cut_by_quality3 DEPRECATED, use --cut_tail instead
--cut_by_quality_aggressive DEPRECATED, use --cut_right instead
--discard_unmerged DEPRECATED, no effect now, see the introduction for merging
We have followed the codon tables from NCBI;
Seq2Fun NCBI
codontable1 The Standard Code (transl_table=1)
codontable2 The Vertebrate Mitochondrial Code (transl_table=2)
codontable3 The Yeast Mitochondrial Code (transl_table=3)
codontable4 The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (transl_table=4)
codontable5 The Invertebrate Mitochondrial Code (transl_table=5)
codontable6 The Ciliate, Dasycladacean and Hexamita Nuclear Code (transl_table=6)
codontable9 The Echinoderm and Flatworm Mitochondrial Code (transl_table=9)
codontable10 The Euplotid Nuclear Code (transl_table=10)
codontable12 The Alternative Yeast Nuclear Code (transl_table=12)
codontable13 The Ascidian Mitochondrial Code (transl_table=13)
codontable14 The Alternative Flatworm Mitochondrial Code (transl_table=14)
codontable16 Chlorophycean Mitochondrial Code (transl_table=16)
codontable21 Trematode Mitochondrial Code (transl_table=21)
codontable22 Scenedesmus obliquus Mitochondrial Code (transl_table=22)
codontable24 Rhabdopleuridae Mitochondrial Code (transl_table=24)
codontable26 Pachysolen tannophilus Nuclear Code (transl_table=26)
codontable27 Karyorelict Nuclear Code (transl_table=27)
codontable29 Mesodinium Nuclear Code (transl_table=29)
codontable30 Peritrich Nuclear Code (transl_table=30)
codontable31 Blastocrithidia Nuclear Code (transl_table=31)
codontable33 Cephalodiscidae Mitochondrial UAA-Tyr Code (transl_table=33)

2. Running Seq2Fun

Back to top

In order to make it clear and just for demonstration purpose, after downloading and installation of seq2fun, we first refer the path to seq2fun installation directory (e.g. /home/peng/Software/seq2fun) to "S2F_HOME".
This "S2F_HOME" contains several directories, e.g., bin contains executable files seq2fun mkbwt mkfmi seqtract transembler, therefore the location of executable file seq2fun is S2F_HOME/bin/seq2fun
Similarly, we create a new directory "database" in S2F_HOME, which is used to store databases of different groups of organisms. After downloading the database, we put them in the database directory, e.g., mammals, birds, fishes. Each database contains .fmi, protein_ko_go_species.txt files. Here, the database path is S2F_HOME/database/. Otherwise, you can put the database anywhere that is clear to you.

2.1. a typical run for a single sample of pair-end (PE) reads with comparative mode, which will only generate s2f_id/ortholog bundance tables for each sample (total s2f_id/ortholog abundance tables will not be produced, using -s batch mode to produce it) (not recommended):

S2F_HOME/bin/seq2fun -i A1.CE2-S1_R1.fastq.gz -I A1.CE2-S1_R2.fastq.gz --tfmi S2F_HOME/database/birds/birds.fmi --genemap S2F_HOME/database/birds/birds_annotation.txt --prefix A1.CE2-S1_R1 -w 8
                            
or for batch sample processing, and it will produce total s2f_id/ortholog abundance tables for all samples, and s2f_id/ortholog abundant tables for each sample (recommended)
S2F_HOME/bin/seq2fun --sampletable sample.txt --tfmi S2F_HOME/database/birds/birds.fmi --genemap S2F_HOME/database/birds/birds_annotation.txt -w 8
                            
a sample.txt is a table consisting of four columns must be separated by tab '\t'. The first column is the prefix name of sample, the second one is filename of forward reads, the third one is the filename of reverse reads and last one is the class info of that sample.
A1.CE2-S1   A1.CE2-S1_R1.fastq.gz   A1.CE2-S1_R2.fastq.gz   control
A3.CE1-M2 A3.CE1-M2_R1.fastq.gz A3.CE1-M2_R2.fastq.gz medium
... ... ...
or a sample.txt with path for output and input.
/path/to/A1.CE2-S1   /path/to/A1.CE2-S1_R1.fastq.gz   /path/to/A1.CE2-S1_R2.fastq.gz   control
/path/to/A3.CE1-M2 /path/to/A3.CE1-M2_R1.fastq.gz /path/to/A3.CE1-M2_R2.fastq.gz medium
... ... ...
2.2. a typical run for a single sample of pair-end (PE) reads with profiling mode, which will generate several tables of a s2f_id/ortholog abundance, reads s2f_id/ortholog, as well as html reports summarizing these tables and rarefaction curve, as well as reads quality checks for all samples and each sample.
S2F_HOME/bin/seq2fun -i A1.CE2-S1_R1.fastq.gz -I A1.CE2-S1_R2.fastq.gz --tfmi S2F_HOME/database/birds/birds.fmi --genemap S2F_HOME/database/birds/birds_annotation.txt --prefix A1.CE2-S1 -w 8 --profiling
                            
or for batch sample processing
S2F_HOME/bin/seq2fun --sampletable sample.txt --tfmi S2F_HOME/database/birds/birds.fmi --genemap S2F_HOME/database/birds/birds_annotation.txt -w 8 --profiling
                            

3. Input files

Back to top

Seq2Fun supports both paired-end(PE) and single-end(SE) reads from various sequencing platforms e.g., Illumina. There is no reads length limitation. It supports both fastq and fastq.gz file format.

3.1 For PE reads using -i or --in1 and -I or --in2 for forward and reverse reads, respectively.
e.g.: -i A1.CE2-S1_R1.fastq.gz -I A1.CE2-S1_R1.fastq.gz

3.2 For SE reads using -i or --in1 for forward reads.
e.g.: -i A1.CE2-S1_R1.fastq.gz

4. Batch sample processing (highly recommended)

Back to top

It is highly recommended to use bach processing mode. Batch sample processing not only avoids repeatedly loading and deleting database when processing many samples, but also produces KO/GO abundance tables consisting all samples.

Using -s or --sampletable table to specify the file containing reads files and output file.
e.g.: -s sample.txt

4.1 For PE reads using -i or --in1 and -I or --in2 for forward and reverse reads, respectively.
For PE reads, sample_table must consist of four columns using tab ("\t") as a separator. The first column is prefix name of sample, the second is the filename of forward reads, the third one is the filename of reverse reads and last one is the sample class info.

A1.CE2-S1   A1.CE2-S1_R1.fastq.gz   A1.CE2-S1_R2.fastq.gz   control
A3.CE1-M2 A3.CE1-M2_R1.fastq.gz A3.CE1-M2_R2.fastq.gz medium
... ... ...
4.2 For SE reads, sample_table must consist of three columns using the tab ("\t") as a separator. The first column is the prefix name of sample and the second one is the filename of forward reads, and last one is the sample group info
A1.CE2-S1   A1.CE2-S1_R1.fastq.gz   control
A3.CE1-M2 A3.CE1-M2_R1.fastq.gz medium
... ...

5. Output files

Back to top

By default, Seq2Fun generate 3 main output text files of s2f id, KO and GO abundance tables for all samples and each sample.
If you are using --profiling mode, there are 6 main output text files: s2f, KO and GO abundance tables, hit pathway table and hit species table, reads ko table (reads annotation table); and two html reports containing tables and figures of summary these tables, as well as reads quality check.

5.1 If you use input files from 2.1
Using -X or --prefix to specify the prefix name of output files
e.g.: -X A1.CE2-S1

5.2 If you use batch sample process mode from 3.1, the last column in sample.txt is the prefix name of output files
Using -s or --sampletable to specify file containing both input and output files.
e.g.: -s sample.txt

5.3.1. An output file of s2f_id/ortholog abundance table for all samples. It is only generated by batch mode

#NAME	A1.CE2-S1	A3.CE1-M2	A4.CE1-H5	B1.CE2-S2	B3.CE1-M3	C1.CE2-S3	C3.CE1-M4	D1.CE2-S4	D3.CE1-M5	E1.CE2-S5	E3.CE1-H1	F3.CE1-H2	G3.CE1-H3	H2.CE1-M1	H3.CE1-H4	annotation
#CLASS:XX control middle high control middle control middle control middle control high high high middle high -
s2f_7617 158 338 475 130 329 178 342 185 479 160 312 429 241 309 345 K04198|GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100|EDNRB|endothelin receptor type B
s2f_7570 10 9 9 8 1 8 23 8 23 9 9 7 15 7 12 K07437|GO:0004497;GO:0005506;GO:0016491;GO:0016705;GO:0020037;GO:0046872|CYP26A1|cytochrome P450 26A1
s2f_14150 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 K05254|GO:0005179;GO:0005576;GO:0007165;GO:0016608|GHRL|appetite-regulating hormone
s2f_6367 22 43 48 50 50 31 55 29 38 16 52 45 40 40 45 K03370|GO:0005794;GO:0006486;GO:0008373;GO:0016020;GO:0016021;GO:0016740;GO:0016757;GO:0097503|ST3GAL5|lactosylceramide alpha-2,3-sialyltransferase isoform X1
s2f_7077 42 81 44 63 61 61 63 66 60 83 46 65 72 74 51 U|GO:0000775;GO:0000776;GO:0005694|CFDP1|craniofacial development protein 1
s2f_12441 3 4 8 4 2 1 3 0 13 2 3 7 6 4 5 K15680|GO:0016491|HSD11B1L|hydroxysteroid 11-beta-dehydrogenase 1-like protein isoform X1
...
5.3.2. An output file of s2f_id/ortholog abundance table for each sample.
#s2f_id Reads_cout      annotation
s2f_7617 158 K04198|GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100|EDNRB|endothelin receptor type B
s2f_7570 10 K07437|GO:0004497;GO:0005506;GO:0016491;GO:0016705;GO:0020037;GO:0046872|CYP26A1|cytochrome P450 26A1
s2f_6367 22 K03370|GO:0005794;GO:0006486;GO:0008373;GO:0016020;GO:0016021;GO:0016740;GO:0016757;GO:0097503|ST3GAL5|lactosylceramide alpha-2,3-sialyltransferase isoform X1
s2f_7077 42 U|GO:0000775;GO:0000776;GO:0005694|CFDP1|craniofacial development protein 1
s2f_12441 3 K15680|GO:0016491|HSD11B1L|hydroxysteroid 11-beta-dehydrogenase 1-like protein isoform X1
s2f_12567 13 K09784|GO:0005154;GO:0007165;GO:0008083;GO:0016020;GO:0016021|EREG|proepiregulin
s2f_7711 6 K04357|GO:0005509;GO:0007165;GO:0008083;GO:0016020;GO:0016021|EGF|pro-epidermal growth factor
...
5.7. Html report for functional summary
if using --sampletable and --profiling, there will be a full html reports for summarizing tables for all samples.
Seq2Fun_summary_all_samples.html

6. Protein database for translated search

Back to top

Translating each read into amino acid sequences and subsequently searching them in the protein database to find their homologies. If homologies identified, the homology protein's ID will be assigned to read
Seq2Fun provides various pre-built databases of protein sequences. The size of the protein database has an impact on the speed of Seq2Fun.
The protein database was built from fasta file of protein sequences based on burrows wheeler transform (BWT) and FM index for fast search. Download the pre-built database from DATABASES page.
Or you can build your own protein database (see 16).

Using -d or --tfmi to specify protein database
e.g.: -d birds.fmi

7. Gene annotation table

Back to top

Each assigned read has protein IDs attached, this is used for search its corresponding s2f_id/ortholog from gene annotation map.
For most cases, one read could be annotated with multiple protein ids, which usually shares the same s2f_id/ortholog.
Occasionally, one read could be annotated with multiple ortholog ids. In this case, only ortholog id with the highest frequency is kept. e.g.: read A00266:275:HLFTWDSXX:2:1101:10013:26537 mapped to s2fid_1 and s2fid_2 for 10 and 2 times, respectively. Only s2fid_1 was kept for read A00266:275:HLFTWDSXX:2:1101:10013:26537. Also, see 4.3.
Gene annotation table is a mapping table linking Protein id to s2f_id/ortholog, KO/GO id, gene symbol, description and species name.

Using -D or --genemap to specify protein_ko_go table file
e.g.: -D bird_protein_ko_go_species.txt

It contains 7 columns separated by '\t'. The first column is protein id, it must be the same protein id in protein database from 5. The second column is s2f_id/ortholog, followed by KO id and GO set ID corresponding to its protein ID, and then are the gene symbol and description, as well as species name.

gga:408082      s2f_7617        K04198  GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100        EDNRB   endothelin receptor type B      Gallus gallus (chicken)
pcoc:116239860 s2f_7617 K04198 GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100 EDNRB endothelin receptor type B Phasianus colchicus (Ring-necked pheasant)
mgp:100546485 s2f_7617 K04198 GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100 EDNRB endothelin receptor type B Meleagris gallopavo (turkey)
cjo:107308043 s2f_7617 K04198 GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100 EDNRB endothelin receptor type B Coturnix japonica (Japanese quail)
nmel:110408339 s2f_7617 K04198 GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100 EDNRB endothelin receptor type B Numida meleagris (helmeted guineafowl)
apla:101790262 s2f_7617 K04198 GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100 EDNRB endothelin receptor type B Anas platyrhynchos (mallard)
acyg:106043935 s2f_7617 K04198 GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100 EDNRB endothelin receptor type B Anser cygnoides domesticus (swan goose)
tgu:100228255 s2f_7617 K04198 GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100 EDNRB endothelin receptor type B Taeniopygia guttata (zebra finch)
... ... ...

8. Single sample profiling

Back to top

It is known that it is very difficulty to obtain larger number of samples for lots of environmental species, and sometimes only one sample for each experimental group. In this case, it is very difficult to conduct comparative study without enough statistical power
To cope with this challenge, --profiling mode is implemented in Seq2Fun. Using this mode, 4 main levels output files KO abundance table, hit pathway table, hit species table and read KO table are generated. An additional html report summarizing and visualizing these tables is also generated.
A rarefaction curve is also generated in html report to show the sequence depth against number of KOs. This figure infers 1). the sequence depth is enough or not. It is useful when considering the difficulty to obtaining enough samples or hight-quality RNA for environmental species. 2). If you have too much reads and you want a faster processing, you can use --reads_to_process to specify the number of reads to process.
By default profiling mode is off, using --profiling to turn it on, it is ~20% slower than non-profiling mode and could consume a little bit more memory at the last stage of generating tables
e.g.: --profiling

see more details about the output files in 4.3 - 4.7.

9. MEM or GREEDY modes in amino acid sequences search in the database

Back to top

Seq2Fun translates clean read into amino acid sequences with all six possible reading frames, followed by searching and comparing with reference protein sequences in database. Seq2Fun offers two searching modes of maximum exact matches (MEM) and GREEDY. MEM mode is a fast running mode and implemented as that translated amino acid sequences must be perfectly matched with reference protein sequences without any mismatches. Any mismatched amino acid sequence and its corresponding read will be discarded. It is designed for organisms who have reference sequence in the database. However, due to the evolutionary distance between the target organism and reference organisms as well as sequencing error, there could be mismatches between amino acid sequences translated from reads and reference protein sequences. Using MEM mode could result in the various number of unmapped reads and rapid sensitivity decay depending on the phylogenetic distance between organisms as well as the number of sequencing errors. Therefore, Greedy mode is introduced to allow mismatches between query and subject sequences to overcome the evolutionary divergences and sequencing errors. Greedy mode is designed for organisms who do not have reference sequences in database.
The default mode is Greedy mode. Or using -K or --mode to specify mode.
e.g.: -K greedy
The MEM mode can be turn on e.g.: -K mem

10. Minimum matching length of amino acid sequence with reference protein sequences

Back to top

The minimum matching length determines cut off length of how long the amino acid sequence matches with reference protein sequences. It is used for both MEM and GREEDY modes. Our results in the tested datasets show that minlength generally has complicated impacts in Greedy mode with the hightest values of coefficient of determination R square obtained with minlength ranging from 19 to 28. Therefore, the default value of minlength is set to 19, which means that the matching amino acid length must be no less than 19 aa. However, it is should be carefully tested on different organisms especially when your organisms have large evolutionary distances with organisms in the database. Decreasing the minlength would help overcome the large sequence divergence between query and subject sequences. But it would tradeoff specificity, resulting in some false positive matches. Using -J or --minlength to set minimum matching length with defulat value 19
e.g.: -J 19

11. Allowed amino acid mismatches (Greedy mode only)

Back to top

Seq2Fun employs GREEDY model to overcome evolutionary distance and sequencing error. It is implemented with setting the number of mismatches. The number of mismatches depends on the nature of protein itself, evolutionary distance and sequence error. Our results in the tested datasets show that minlength generally has negative impacts in Greedy mode. Therefore, the default value of minlength is set to 2, that means allowing two amino acid mismatches. However, it is should be carefully tested on different organisms especially when your organisms have large evolutionary distances with organisms in the database. Increasing the number of mismatches would help overcome the large sequence divergence between query and subject sequences. But it would tradeoff specificity, resulting in some false positive matches. As there are more than 20 amino acids, increasing the number of mismatches could result in exponential increasing searches in the database (also dependent on composition in the database) and would significantly slow down the search speed.This number of mismatches will translate into the affection of abundance values of KO in the output table.
Using -E or --mismatch to specify number of amino acid mismatches with default value of 2.
e.g.: -E 2

12. Minimum matching score (Greedy mode only)

Back to top

Matching score based on blosum62 is used to assess how closely the translated amino acid sequence with mutations match with reference protein sequences. Just as number of the mismatched amino acids, it could trades off sensitivity and precision. Decreasing the value of the minimum matching score will increase sensitivity with a decreasing precision. It could have affect on the runtime as it servers as filter to filter out low blosum62 score AA fragments. Therefore, higher minimum matching score could results in a speed gain. Based on our assessment in three datasets, the minimum matching score has very limited affect on the overall gene abundance quantification. Thus, we set the default minimum matching score to 80.
Using -j or --minscore to specify minimum matching score
e.g.: -j 80

13. Keep all the translated amino acid fragments(not recommended)

Back to top

After read translation into all possible amino acid fragments using the six reading frames, only the top longest fragments are select for homolog search. This filtering will greatly reduce the number of false translations if there is no true stop codon in that read, which in turn can increase the speed of Seq2Fun. This hard filtering will probably filter out some reads containing true stop codon, but it has very limited on the downstream analysis such as fold change in comparative studies.
We offer the option to keep all the fragments, but their length must be > minimum matching length.
Using --allFragments keep all the fragments
e.g.: --allFragments

14. Select codon table

Back to top

Select the codon table (same as blastx in NCBI), we provide 33 codon tables from 'https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG31'. Be default is the Standard Code

Standard1
VertebrateMitochondria2
YeastMitochondrial3
MoldProtozoanCoelenterateMitochondrialMycoplasmaSpiroplasma4
InvertebrateMitochondrial5
CiliateDasycladaceanHexamitaNuclear6
EchinodermFlatwormMitochondrial9
EuplotidNuclear10
AlternativeYeastNuclear12
AscidianMitochondrial13
InvertebrateMitochondiral5
ChlorophyceanMitochondrial16
TrematodeMitochondrial21
ScenedesmusobliquusMitochondrial22
RhabdopleuridaeMitochondrial24
PachysolentannophilusNuclear26
KaryorelictNuclear27
MesodiniumNuclear29
PeritrichNuclear30
BlastocrithidiaNuclear31
CephalodiscidaeMitochondrial33
Using --codontable to specify the codon table; "Standard1" is the default
e.g.: --codontable VertebrateMitochondria2

15. Trim the fist 6 base of read

Back to top

The fist 6 base of a read could have higher mismatch rates due to the random-hexamer priming, we advise to trim the fist 6 bases of a read
Using --trim_front1 and/or --trim_front2 for single or pair-end reads
e.g.: --trim_front1 6 --trim_front2 6

16. Custom built database

Back to top

Seq2Fun fully supports customer built database. Seq2Fun needs two input files, FM index file, which is built from protein sequences and used for database; and protein annotation table, which is used for mapping protein ID to s2f_id/KO/GO/gene/species. To build your own database, you must supply a protein fasta file
e.g.: birds.fasta

>gga:419912
MEGNEGDPPGAHRRRDSHGRRPSREINAEDQVAQETEEVFRSYTFYRYQQEREEGGAEVPMDPEIMEIQQELGSTGSQVGRRLAIIGDDINKRYDAEFRHMLKSLQPTKENAYEYFTTIASSLFDSGINWGRVIALLGFGYCMAIHVYQQGITGFLRRIARYVTEFMLRNRIARWIAQQGGWVAALDLDNVYMKYMLVVLALVMVGHLVVRRFFRP
>gga:427365
MAGDGGVYRLRCSLLGHEQDVRGITRGLFPEGGFVSVSRDRTARLWTPDGLNRGFTEMHCMRGHSNFVSCVCIIPPSDLYPRGLIATGGNDHNICVFTVDNTAPLYVLKGHTNTVCSLSSGKFGTLLSGSWDTTCKVWLNDRCMMTLQGHTAAIWAVKILPEQGLMLTGSADKTIKLWKAGRCERTYAGHEDCVRGLAILSEMEFLSCSNDASVRRWHISGECLHVYYGHTNYIYSVSVFPHCKDFITTGEDRSLRIWKQGECVQTIRLPAQSVWCCCVLDNDDIVVGASDGIIRVFTKSLERTASAEEIQAFENELSQASIDPKTGDLGDINADDLPGKEHLKDPGMRDGQTRLIKDNGKVEAYQWSVSEGRWIKIGDVVGSSGGTQQTSGKVLFEGKEYDYVFTIDVNESGPSYKLPYNISDDPWLTAYNFLQKNDLNPMFLDQVAKFIIDNTKGQAPLNTNTEFSDPFTGAGRYVPGSSSLSNTAPAADPFTGMGAYQSAKAENIYFPKKDAVTFDQANPTQILGKLKELNGNAAEEQKLTEDDLIILEKLLSATCNTSAEMPTAQQIQTLWKAINWPEDIVFPALDILRLSVRHPIVNENFCGEKAHVQFIHLLLKFLNPQGKPANQLLALRALCNCFVSQAGQKLLMEQRDEIMTQAIEVKSGNNKNVHIALATLTLNYAVCLHKVNNIEGKAQCLSVISTVMEVVKDLEAVFRLLVALGTLISDDKNAVQLAKSLGVDSQIKKYVSVSEPAKVKECCRFVLNLL
...
Please be noted that the protein sequences must consist of the standarad 20 amino acid (ACDEFGHIKLMNPQRSTVWY) in the uppercase.
S2F_HOME/bin/mkbwt -n 8 -a ACDEFGHIKLMNPQRSTVWY -o brids_proteins birds.fasta;
S2F_HOME/bin/mkfmi brids_proteins;
                            
It generate a fmi file named brids.fmi. See 5. Protein database for translated search.

To use the database, you must obtain the ortholog among your protein ids, we recommend to use OrthoFinder to identify ortholog groups for these phylogenetic groups. For example. s2f_0000000 represent one ortholog. Then prepare a mapping file containing protein ID and its corresponding s2f_id/ortholog, KO, GO, gene symbol and description, as well as species name, and the last column is the core gene frequency (N. of organisms in that orthogroup / total N of organisms in the species group (eg. birds database) separated by "\t". The core gene frequency will be used for additional mapping assessment.
Please be noted that the same ortholog groups could contain different KO, GO, gene symbol/description, only the feature with highest frequency will be kept. The number of protein id and id in protein sequences used for FM index construction and annotation map must be identical.

e.g.: birds_annotation.txt.
Please be noted that the protein ID in the first column must be unique and identical with protein ID in birds.fasta, and must be corresponding to s2f_ID, KO/GO, in the second column.
aam:106485833	s2f_0000000	U	GO:0003676;GO:0015074	GIN1	endogenous retrovirus group k member 18 pol protein-like	Apteryx mantelli mantelli (brown kiwi)	0.9394
phi:102109513 s2f_0000001 K01317 GO:0004252;GO:0006508;GO:0008233;GO:0008236;GO:0016787 CORIN acrosin-like Pseudopodoces humilis (Tibetan ground-tit) 0.9394
etl:114070275 s2f_0000004 K17854 GO:0004497;GO:0005506;GO:0016020;GO:0016021;GO:0016491;GO:0016705;GO:0016712;GO:0020037;GO:0046872 CYP2AC7 cytochrome p450 2j2-like Empidonax traillii (willow flycatcher) 0.9394
tgu:116809310 s2f_0569608 U U U fibroin heavy chain-like isoform x1 Taeniopygia guttata (zebra finch) 0.0303
tgu:115498598 s2f_0569609 U U U uncharacterized protein loc115498598 Taeniopygia guttata (zebra finch) 0.0303
... ... ...

17. Automatically trimming polyA tail in reads

Back to top

Reads with polyA tail could affect protein comparison, resulting in a high artificial matching score. Removing or trimming reads containing polyA tail could increase precision.
By default, trimming polyA tail is activated.
If you want to disable this function
using --no_trim_polyA

18. Minimum length for merging overlapped PE read pair (PE reads only)

Back to top

The length of query amino acid sequences could have heavily impact the comparison with reference protein sequences. It generally believes that the longer of query sequences, the higher precision of protein annotation. Therefore, merging overlapped PE read into longer read could yield longer translated amino acid sequences, which could reduce number of ill mapped reads and boost annotation precision. The default value of minimum length for merging PE read pair is 30 bp if the read pair is overlapped.
Using -v or --overlap_len_require to specify miminum length of overlapped PE read pair
e.g: -v 30

19. Maximum cutoff length of translated peptides

Back to top

Seq2Fun only keeps the longest AA fragment for identification of homologies in protein database, this will filter out lots of false fragments and accelerate Seq2Fun. In some cases, only keeping at most the longest fragments could filter out a proportion of fragments that originated from the merged PE reads if start and stop codons are present in the middle of the reads. Therefore, we recap the maximum cut-off length of peptide fragment to be 60 aa (by default, user can change this cut-off), which will prevent the filtering out of some true peptide fragments from the merged reads.
Using -m or --maxtranslength to specify maximum cutoff length of translated peptides.
Note: --maxtranslength must be larger than the --minlength (minimum matching length of amino acid sequence) in 7.
e.g: -m 60

20. Output mapped reads

Back to top

For non-model organisms, it could be important to obtain the assembled genes/contigs of interest.
Seq2Fun offers a feature to capture and label each mapped clean read with KO/GO tag and output as fastq.gz files. The clean reads can be directly used to assembled into genes/contigs using various de novo assemblers such as Trinity, et al. Because of only a small proportion (~30%) of total reads are mapped and retrieved, the de novo assembly for each genes is very fast and usually can be finished within several minutes.
Various downstream analysis such as designing primers, comparable analysis, on these assembled genes/contigs can be applied, and this will greatly compensate the information loss by Seq2Fun compared with the conventional de novo assemblers.

Using --outputMappedCleanReads to enable this function.

Forward mapped clean reads file A1.CE2-S1-LT_mapped_R1.fastq.gz:
@A00266:275:HLFTWDSXX:2:1467:22182:12414 1:N:0:TTACCGAC+CGAATACG s2f_148
GCGTTTGATGAGGTTGATAAGCAATTTCATGTTCCCTAAATCCCGTTCATCAGAAGAAGCAGTGCACAATTTTTCATCGAAGCTGAACATATCAGGGTCAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00266:275:HLFTWDSXX:2:1354:31304:33943 1:N:0:TTACCGAC+CGAATACG s2f_3047
AGCTGCTTTGGCATCTTCTGGAGAGCTGAAGTCCACAAACCCAAACCCTTTAGATGATCCAGTGTCTCTGTCTGTGACTATTCTAGCACTTATAGAGCCTT
+
FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

21. General guideline for parameter settings

Back to top

Several parameters drives the trade-offs between sensitivity and precision. The default parameter setting is obtained from the parameter sensitivity test on four organisms: mouse, chicken, zebrafish and roundworm. If your studied species has no or limited number of close-related reference species in the database. You can tune parameters to obtain a better balance between sensitivity and precision.
For example: 1) increasing the number of mismatches (--mismatch) from default 2 to 3 or 4; 2) or decreasing the minimum matching length (--minlength) from default 19; 3) or decreasing the minimum BLOSUM62 score (--minscore) from default 80; 4) or decreasing the maximum length cutoff of the translated amino acid sequences for overlapped paired-end reads from default 60; will increase the mapping reads or the mapping chance for the highly divergent homologs.
It is worth mentioning that this parameter adjustment could slow down the Seq2Fun and/or increase the false positives with various degrees.

22. Submit s2f id abundance to ExpressAnalyst for downstream analysis (optional)

Back to top

After obtaining s2f_id/ortholog abundance table, we provide an online website-based analysis tool to conduct downstream analysis such as identification of differentially expressed genes/ortholog, pathway enrichment and GO analysis, et al.
Submit s2f id abundance and annotation tables to our website-based tool ExpressAnalyst (optional)
S2fid_abundance_table_all_samples_submit_2_expressanalyst.txt will be used to submit to ExpressAnalyst.

23. SeqTract full usage options

Back to top

SeqTract is used to extract mapped reads from output mapped fastq.gz files from Seq2Fun for target gene assemble, which is useful for downstream analysis such as designing primers, phylogenetic analysis, et al. SeqTract can extract reads annotated with s2f_id, but it is recommended to use only ko id for the extraction.
Use seqtract or seqtract --help to show the full usage options

  options:
--sampleTable input table for fastq files with annotation from Seq2Funs output, eg: one column for SE samples or two columns (separated by '\t') of samples for PE sample
--geneTable input table of target genes for sequences extraction, eg: one columns of s2f_ids
--outputDir output folder for extract fastq files, default is current working dir (string [=.])
--undetermined the file name of undetermined data, default is Undetermined (string [=Undetermined])
--compression compression level for gzip output (1 - 9), default is 6 (int [=6])
-w, --thread worker thread number, default is 2 (int [=2])
23.1. A typical seqtract run
S2F_HOME/bin/seqtract --sampleTable sample.txt --geneTable gene.txt --outputDir output -w 8 -V
                            
23.2. --sampleTable sample.txt consisting of one column of forward read files for SE reads or two columns of forward and reverse read files (must be separated by '\t') for PE reads of
A1.CE2-S1-LT_mapped_R1.fastq.gz A1.CE2-S1-LT_mapped_R2.fastq.gz
A2.CE2-M4-LT_mapped_R1.fastq.gz A2.CE2-M4-LT_mapped_R2.fastq.gz
B1.CE2-S2-LT_mapped_R1.fastq.gz B1.CE2-S2-LT_mapped_R2.fastq.gz
B2.CE2-M5-LT_mapped_R1.fastq.gz B2.CE2-M5-LT_mapped_R2.fastq.gz
C1.CE2-S3-LT_mapped_R1.fastq.gz C1.CE2-S3-LT_mapped_R2.fastq.gz
D1.CE2-S4-LT_mapped_R1.fastq.gz D1.CE2-S4-LT_mapped_R2.fastq.gz
D2.CE2-H2-LT_mapped_R1.fastq.gz D2.CE2-H2-LT_mapped_R2.fastq.gz
E1.CE2-S5-LT_mapped_R1.fastq.gz E1.CE2-S5-LT_mapped_R2.fastq.gz
E2.CE2-H3-LT_mapped_R1.fastq.gz E2.CE2-H3-LT_mapped_R2.fastq.gz
F1.CE2-M1-LT_mapped_R1.fastq.gz F1.CE2-M1-LT_mapped_R2.fastq.gz
F2.CE2-H4-LT_mapped_R1.fastq.gz F2.CE2-H4-LT_mapped_R2.fastq.gz
G1.CE2-M2-LT_mapped_R1.fastq.gz G1.CE2-M2-LT_mapped_R2.fastq.gz
G2.CE2-H5-LT_mapped_R1.fastq.gz G2.CE2-H5-LT_mapped_R2.fastq.gz
H1.CE2-M3-LT_mapped_R1.fastq.gz H1.CE2-M3-LT_mapped_R2.fastq.gz
... ...
23.3. --geneTable gene.txt consisting of one column of annotated feature s2f_id
s2f_10
s2f_100
s2f_1000
...
23.4. --outputDir output is specified for the extracted fastq.gz files
s2f_10_R1.fastq.gz
s2f_10_R2.fastq.gz
s2f_100_R1.fastq.gz
s2f_100_R2.fastq.gz
s2f_1000_R1.fastq.gz
s2f_1000_R2.fastq.gz
...

24. Target gene de novo assemble

Back to top

You can use transcriptome assemblers such as SOAPdenovo-Trans to conduct target gene assemble.