Seq2Fun - Manual

Full options
Running Seq2Fun
Input files
Batch sample processing
Output files
Protein database for translated search
Protein annotation mapping table
Single sample profiling
MEM or GREEDY modes in amino acid sequences search in the database
Minimum matching length of amino acid sequence with reference protein sequences
Allowed amino acid mismatches (Greedy mode only)
Minimum matching score (Greedy mode only)
Keep all translated amino acid fragments (not recommend)
Change codon table
Trim the first 6 bases (optional)
Custom built database
Automatically trimming polyA tail in reads
Minimum length for merging overlapped PE read pair (PE reads only)
Maximum cutoff length of translated peptides
Output mapped reads
General guidelines for parameter settings
Submit gene abundance tables to ExpressAnalyst for downstream analysis
Extract mapped reads for target genes
Target gene de novo assemble

1. Seq2Fun full usage options

Use seq2fun or seq2fun --help to show the full usage options

  options:

  // input/output

  -s, --sampletable                 (recommended) sample table must consist of 3 columns (sample prefix name (sample01), forward reads name (sample01_R1.fq.gz), group info (control) for single-reads or 4 columns (sample prefix name (sample01), forward reads (sample01_R1.fq.gz), reverse reads (sample01_R2.fq.gz), group info (control) for paired-end reads. The columns must be separated by tab 

  -i, --in1                         read1 input file name

  -I, --in2                         read2 input file name

  -X, --prefix                      (not recommended) prefix name for output files, eg: sample01

      --outputMappedCleanReads      enable output mapped clean reads into fastq.gz files, by default is false, using --outputMappedCleanReads to enable it

      --outputReadsAnnoMap            enable output mapped clean reads - gene map into .gz files, by default is false, using --outputReadsAnnoMap to enable it

  // Homology search;

   -D, --genemap                    gene annotation map file

       --profiling                  by default it is off. If this option is specified, rarefaction curve of reads against number of ortholog will be generated

    
  // translated search

   -d, --tfmi                       fmi index of Protein database

   -K, --mode                       searching mode either tGREEDY or tMEM (maximum exactly match). By default greedy 

   -E, --mismatch                   number of mismatched amino acid in sequence comparison with protein database with default value 2

   -j, --minscore                   minimum matching score of amino acid sequence in comparison with protein database with default value 80

   -J, --minlength                  minimum matching length of amino acid sequence in comparison with protein database with default value 19 for GREEDY and 13 for MEM model

   -m, --maxtranslength             maximum cutoff of translated peptides, it must be no less than minlength, with default 60

	--allFragments		    enable this function will force Seq2Fun to use all the translated AA fragments with length > minlength. This will slightly help to classify reads contain the true stop codon and start codon; This could have limited impact on the accuracy for comparative study and enable this function will slow down the Seq2Fun. by default is false, using --allFragments to enable it"

	--codontable		    select the codon table (same as blastx in NCBI), we provide 21 codon tables from NCBI. By default is the codontable1 (Standard Code), the complete codon table can be seen below.

        --dbDir			    specify dir for internal databases such as ko_fullname.txt


  //selected pathways

   -Z, --pathway                    list of selected pathways for target pathways analysis

   -z, --genefa                     the gene/protein sequences fasta file for retrieving proteins in selected pathways to construct database

    
  // threading

  -w, --thread                      worker thread number, default is 2

    
  -V, --verbose                     enable verbose

      --debug                       enable debug


      --phred64                     indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)

      --reads_to_process            specify how many reads/pairs to be processed. Default 0 means process all reads


  // adapter

  -A, --disable_adapter_trimming    adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled

  -a, --adapter_sequence            the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped

      --adapter_sequence_r2         the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as adapter_sequence

      --adapter_fasta               specify a FASTA file to trim both read1 and read2 (if PE) by all the sequences in this FASTA file

      --detect_adapter_for_pe       by default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data

    
  //polyA tail

      --no_trim_polyA               by default, ployA tail will be trimmed. If this option is specified, polyA trimming is disabled


  // trimming

  -f, --trim_front1                 trimming how many bases in front for read1, default is 0

  -t, --trim_tail1                  trimming how many bases in tail for read1, default is 0

  -b, --max_len1                    if read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation

  -F, --trim_front2                 trimming how many bases in front for read2. If it's not specified, it will follow read1's settings

  -T, --trim_tail2                  trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings

  -B, --max_len2                    if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation. If it's not specified, it will follow read1's settings


  // polyG tail trimming

  -g, --trim_poly_g                 force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data

      --poly_g_min_len              the minimum length to detect polyG in the read tail. 10 by default

  -G, --disable_trim_poly_g         disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data


  // polyX tail trimming

  -x, --trim_poly_x                 enable polyX trimming in 3' ends

      --poly_x_min_len              the minimum length to detect polyX in the read tail. 10 by default


  // cutting by quality

      --cut_front                   move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise

      --cut_tail                    move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise

  -r, --cut_right                   move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop

  -W, --cut_window_size             the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4

  -M, --cut_mean_quality            the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20)

      --cut_front_window_size       the window size option of cut_front, default to cut_window_size if not specified

      --cut_front_mean_quality      the mean quality requirement option for cut_front, default to cut_mean_quality if not specified

      --cut_tail_window_size        the window size option of cut_tail, default to cut_window_size if not specified

      --cut_tail_mean_quality       the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified

      --cut_right_window_size       the window size option of cut_right, default to cut_window_size if not specified

      --cut_right_mean_quality      the mean quality requirement option for cut_right, default to cut_mean_quality if not specified


  // quality filtering

  -Q, --disable_quality_filtering   quality filtering is enabled by default. If this option is specified, quality filtering is disabled

  -q, --qualified_quality_phred     the quality value that a base is qualified. Default 15 means phred quality <=Q15 is qualified

  -u, --unqualified_percent_limit   how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40%

  -n, --n_base_limit                if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5

  -e, --average_qual                if one read's average quality score <avg_qual, then this read/pair is discarded. Default 0 means no requirement


  // length filtering

  -L, --disable_length_filtering    length filtering is enabled by default. If this option is specified, length filtering is disabled

  -l, --length_required             reads shorter than length_required will be discarded, default is 60

      --length_limit                reads longer than length_limit will be discarded, default 0 means no limitation


  // low complexity filtering

      --no_low_complexity_filter    disable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1])

  -Y, --complexity_threshold        the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required


  // filter by indexes
      --filter_by_index1            specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line

      --filter_by_index2            specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line

      --filter_by_index_threshold   the allowed difference of index barcode for index filtering, default 0 means completely identical


  // base correction in overlapped regions of paired end data

  -c, --disable_correction          disenable base correction in overlapped regions (only for PE data), default is enabled");

  -v, --overlap_len_require         the minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 30 by default

      --overlap_diff_limit          the maximum number of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default

      --overlap_diff_percent_limit  the maximum percentage of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%


  // umi

  -u, --umi                         enable unique molecular identifier (UMI) preprocessing

      --umi_loc                     specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none

      --umi_len                     if the UMI is in read1/read2, its length should be provided

      --umi_prefix                  if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default

      --umi_skip                    if the UMI is in read1/read2, Seq2Fun can skip several bases following UMI, default is 0


  // overrepresented sequence analysis

  -p, --overrepresentation_analysis enable overrepresented sequence analysis

  -P, --overrepresentation_sampling one in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20



  // deprecated options

     --cut_by_quality5              DEPRECATED, use --cut_front instead

     --cut_by_quality3              DEPRECATED, use --cut_tail instead

     --cut_by_quality_aggressive    DEPRECATED, use --cut_right instead

     --discard_unmerged             DEPRECATED, no effect now, see the introduction for merging


We have followed the codon tables from NCBI;

                       
                            
                                Seq2Fun
								NCBI
                            

                            
                                codontable1
                                The Standard Code (transl_table=1)
                            
						   
                            
                                codontable2
                                The Vertebrate Mitochondrial Code (transl_table=2)
                            
						   
                            
                                codontable3
                                The Yeast Mitochondrial Code (transl_table=3)
                            
						   
                            
                                codontable4
                                The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (transl_table=4)
                            
						   
                            
                                codontable5
                                The Invertebrate Mitochondrial Code (transl_table=5)
                            
						   
                            
                                codontable6
                                The Ciliate, Dasycladacean and Hexamita Nuclear Code (transl_table=6)
                            
						   
                            
                                codontable9
                                The Echinoderm and Flatworm Mitochondrial Code (transl_table=9)
                            
						   
                            
                                codontable10
                                The Euplotid Nuclear Code (transl_table=10)
                            
						   
                            
                                codontable12
                                The Alternative Yeast Nuclear Code (transl_table=12)
                            
						   
                            
                                codontable13
                                The Ascidian Mitochondrial Code (transl_table=13)
                            
						   
                            
                                codontable14
                                The Alternative Flatworm Mitochondrial Code (transl_table=14)
                            
						   
                            
                                codontable16
                                Chlorophycean Mitochondrial Code (transl_table=16)
                            
						   						   
                            
                                codontable21
                                Trematode Mitochondrial Code (transl_table=21)
                            
						   						   						   
                            
                                codontable22
                                Scenedesmus obliquus Mitochondrial Code (transl_table=22)
                            
						   						   						   
                            
                                codontable24
                                Rhabdopleuridae Mitochondrial Code (transl_table=24)
                            
						   						   						   
                            
                                codontable26
                                Pachysolen tannophilus Nuclear Code (transl_table=26)
                            
						   						   						   
                            
                                codontable27
                                Karyorelict Nuclear Code (transl_table=27)
                            
						   						   						   
                            
                                codontable29
                                Mesodinium Nuclear Code (transl_table=29)
                            
						   						   						   
                            
                                codontable30
                                Peritrich Nuclear Code (transl_table=30)
                            	
						   
                            
                                codontable31
                                Blastocrithidia Nuclear Code (transl_table=31)
                            	
						   
                            
                                codontable33
                                Cephalodiscidae Mitochondrial UAA-Tyr Code (transl_table=33)

Seq2Fun	NCBI
codontable1	The Standard Code (transl_table=1)
codontable2	The Vertebrate Mitochondrial Code (transl_table=2)
codontable3	The Yeast Mitochondrial Code (transl_table=3)
codontable4	The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (transl_table=4)
codontable5	The Invertebrate Mitochondrial Code (transl_table=5)
codontable6	The Ciliate, Dasycladacean and Hexamita Nuclear Code (transl_table=6)
codontable9	The Echinoderm and Flatworm Mitochondrial Code (transl_table=9)
codontable10	The Euplotid Nuclear Code (transl_table=10)
codontable12	The Alternative Yeast Nuclear Code (transl_table=12)
codontable13	The Ascidian Mitochondrial Code (transl_table=13)
codontable14	The Alternative Flatworm Mitochondrial Code (transl_table=14)
codontable16	Chlorophycean Mitochondrial Code (transl_table=16)
codontable21	Trematode Mitochondrial Code (transl_table=21)
codontable22	Scenedesmus obliquus Mitochondrial Code (transl_table=22)
codontable24	Rhabdopleuridae Mitochondrial Code (transl_table=24)
codontable26	Pachysolen tannophilus Nuclear Code (transl_table=26)
codontable27	Karyorelict Nuclear Code (transl_table=27)
codontable29	Mesodinium Nuclear Code (transl_table=29)
codontable30	Peritrich Nuclear Code (transl_table=30)
codontable31	Blastocrithidia Nuclear Code (transl_table=31)
codontable33	Cephalodiscidae Mitochondrial UAA-Tyr Code (transl_table=33)

2. Running Seq2Fun

In order to make it clear and just for demonstration purpose, after downloading and installation of seq2fun, we first refer the path to seq2fun installation directory (e.g. /home/peng/Software/seq2fun) to "S2F_HOME".
This "S2F_HOME" contains several directories, e.g., bin contains executable files seq2fun mkbwt mkfmi seqtract transembler, therefore the location of executable file seq2fun is S2F_HOME/bin/seq2fun
Similarly, we create a new directory "database" in S2F_HOME, which is used to store databases of different groups of organisms. After downloading the database, we put them in the database directory, e.g., mammals, birds, fishes. Each database contains .fmi, protein_ko_go_species.txt files. Here, the database path is S2F_HOME/database/. Otherwise, you can put the database anywhere that is clear to you.

2.1. a typical run for a single sample of pair-end (PE) reads with comparative mode, which will only generate s2f_id/ortholog bundance tables for each sample (total s2f_id/ortholog abundance tables will not be produced, using -s batch mode to produce it) (not recommended):

S2F_HOME/bin/seq2fun -i A1.CE2-S1_R1.fastq.gz -I A1.CE2-S1_R2.fastq.gz --tfmi S2F_HOME/database/birds/birds.fmi --genemap S2F_HOME/database/birds/birds_annotation.txt --prefix A1.CE2-S1_R1 -w 8

or for batch sample processing, and it will produce total s2f_id/ortholog abundance tables for all samples, and s2f_id/ortholog abundant tables for each sample (recommended)

S2F_HOME/bin/seq2fun --sampletable sample.txt --tfmi S2F_HOME/database/birds/birds.fmi --genemap S2F_HOME/database/birds/birds_annotation.txt -w 8

a sample.txt is a table consisting of four columns must be separated by tab '\t'. The first column is the prefix name of sample, the second one is filename of forward reads, the third one is the filename of reverse reads and last one is the class info of that sample.

A1.CE2-S1   A1.CE2-S1_R1.fastq.gz   A1.CE2-S1_R2.fastq.gz   control

A3.CE1-M2   A3.CE1-M2_R1.fastq.gz   A3.CE1-M2_R2.fastq.gz   medium

...                     ...                     ...

or a sample.txt with path for output and input.

/path/to/A1.CE2-S1   /path/to/A1.CE2-S1_R1.fastq.gz   /path/to/A1.CE2-S1_R2.fastq.gz   control

/path/to/A3.CE1-M2   /path/to/A3.CE1-M2_R1.fastq.gz   /path/to/A3.CE1-M2_R2.fastq.gz   medium

...                     ...                     ...

2.2. a typical run for a single sample of pair-end (PE) reads with profiling mode, which will generate several tables of a s2f_id/ortholog abundance, reads s2f_id/ortholog, as well as html reports summarizing these tables and rarefaction curve, as well as reads quality checks for all samples and each sample.

S2F_HOME/bin/seq2fun -i A1.CE2-S1_R1.fastq.gz -I A1.CE2-S1_R2.fastq.gz --tfmi S2F_HOME/database/birds/birds.fmi --genemap S2F_HOME/database/birds/birds_annotation.txt --prefix A1.CE2-S1 -w 8 --profiling

or for batch sample processing

S2F_HOME/bin/seq2fun --sampletable sample.txt --tfmi S2F_HOME/database/birds/birds.fmi --genemap S2F_HOME/database/birds/birds_annotation.txt -w 8 --profiling

3. Input files

Seq2Fun supports both paired-end(PE) and single-end(SE) reads from various sequencing platforms e.g., Illumina. There is no reads length limitation. It supports both fastq and fastq.gz file format.

3.1 For PE reads using -i or --in1 and -I or --in2 for forward and reverse reads, respectively.
e.g.: -i A1.CE2-S1_R1.fastq.gz -I A1.CE2-S1_R1.fastq.gz

3.2 For SE reads using -i or --in1 for forward reads.
e.g.: -i A1.CE2-S1_R1.fastq.gz

4. Batch sample processing (highly recommended)

It is highly recommended to use bach processing mode. Batch sample processing not only avoids repeatedly loading and deleting database when processing many samples, but also produces KO/GO abundance tables consisting all samples.

Using -s or --sampletable table to specify the file containing reads files and output file.
e.g.: -s sample.txt

4.1 For PE reads using -i or --in1 and -I or --in2 for forward and reverse reads, respectively.
For PE reads, sample_table must consist of four columns using tab ("\t") as a separator. The first column is prefix name of sample, the second is the filename of forward reads, the third one is the filename of reverse reads and last one is the sample class info.

A1.CE2-S1   A1.CE2-S1_R1.fastq.gz   A1.CE2-S1_R2.fastq.gz   control

A3.CE1-M2   A3.CE1-M2_R1.fastq.gz   A3.CE1-M2_R2.fastq.gz   medium

...                     ...                     ...

4.2 For SE reads, sample_table must consist of three columns using the tab ("\t") as a separator. The first column is the prefix name of sample and the second one is the filename of forward reads, and last one is the sample group info

A1.CE2-S1   A1.CE2-S1_R1.fastq.gz   control

A3.CE1-M2   A3.CE1-M2_R1.fastq.gz   medium

...                     ...

5. Output files

By default, Seq2Fun generate 3 main output text files of s2f id, KO and GO abundance tables for all samples and each sample.
If you are using --profiling mode, there are 6 main output text files: s2f, KO and GO abundance tables, hit pathway table and hit species table, reads ko table (reads annotation table); and two html reports containing tables and figures of summary these tables, as well as reads quality check.

5.1 If you use input files from 2.1
Using -X or --prefix to specify the prefix name of output files
e.g.: -X A1.CE2-S1

5.2 If you use batch sample process mode from 3.1, the last column in sample.txt is the prefix name of output files
Using -s or --sampletable to specify file containing both input and output files.
e.g.: -s sample.txt

5.3.1. An output file of s2f_id/ortholog abundance table for all samples. It is only generated by batch mode

#NAME	A1.CE2-S1	A3.CE1-M2	A4.CE1-H5	B1.CE2-S2	B3.CE1-M3	C1.CE2-S3	C3.CE1-M4	D1.CE2-S4	D3.CE1-M5	E1.CE2-S5	E3.CE1-H1	F3.CE1-H2	G3.CE1-H3	H2.CE1-M1	H3.CE1-H4	annotation

#CLASS:XX	control	middle	high	control	middle	control	middle	control	middle	control	high	high	high	middle	high	-

s2f_7617	158	338	475	130	329	178	342	185	479	160	312	429	241	309	345	K04198|GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100|EDNRB|endothelin receptor type B

s2f_7570	10	9	9	8	1	8	23	8	23	9	9	7	15	7	12	K07437|GO:0004497;GO:0005506;GO:0016491;GO:0016705;GO:0020037;GO:0046872|CYP26A1|cytochrome P450 26A1

s2f_14150	0	1	0	0	0	0	0	0	0	1	1	0	1	0	0	K05254|GO:0005179;GO:0005576;GO:0007165;GO:0016608|GHRL|appetite-regulating hormone

s2f_6367	22	43	48	50	50	31	55	29	38	16	52	45	40	40	45	K03370|GO:0005794;GO:0006486;GO:0008373;GO:0016020;GO:0016021;GO:0016740;GO:0016757;GO:0097503|ST3GAL5|lactosylceramide alpha-2,3-sialyltransferase isoform X1

s2f_7077	42	81	44	63	61	61	63	66	60	83	46	65	72	74	51	U|GO:0000775;GO:0000776;GO:0005694|CFDP1|craniofacial development protein 1

s2f_12441	3	4	8	4	2	1	3	0	13	2	3	7	6	4	5	K15680|GO:0016491|HSD11B1L|hydroxysteroid 11-beta-dehydrogenase 1-like protein isoform X1

...

5.3.2. An output file of s2f_id/ortholog abundance table for each sample.

#s2f_id Reads_cout      annotation

s2f_7617        158     K04198|GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100|EDNRB|endothelin receptor type B

s2f_7570        10      K07437|GO:0004497;GO:0005506;GO:0016491;GO:0016705;GO:0020037;GO:0046872|CYP26A1|cytochrome P450 26A1

s2f_6367        22      K03370|GO:0005794;GO:0006486;GO:0008373;GO:0016020;GO:0016021;GO:0016740;GO:0016757;GO:0097503|ST3GAL5|lactosylceramide alpha-2,3-sialyltransferase isoform X1

s2f_7077        42      U|GO:0000775;GO:0000776;GO:0005694|CFDP1|craniofacial development protein 1

s2f_12441       3       K15680|GO:0016491|HSD11B1L|hydroxysteroid 11-beta-dehydrogenase 1-like protein isoform X1

s2f_12567       13      K09784|GO:0005154;GO:0007165;GO:0008083;GO:0016020;GO:0016021|EREG|proepiregulin

s2f_7711        6       K04357|GO:0005509;GO:0007165;GO:0008083;GO:0016020;GO:0016021|EGF|pro-epidermal growth factor


...

5.7. Html report for functional summary

if using --sampletable and --profiling, there will be a full html reports for summarizing tables for all samples.

Seq2Fun_summary_all_samples.html

6. Protein database for translated search

Translating each read into amino acid sequences and subsequently searching them in the protein database to find their homologies. If homologies identified, the homology protein's ID will be assigned to read
Seq2Fun provides various pre-built databases of protein sequences. The size of the protein database has an impact on the speed of Seq2Fun.
The protein database was built from fasta file of protein sequences based on burrows wheeler transform (BWT) and FM index for fast search. Download the pre-built database from DATABASES page.
Or you can build your own protein database (see 16).

Using -d or --tfmi to specify protein database
e.g.: -d birds.fmi

7. Gene annotation table

Each assigned read has protein IDs attached, this is used for search its corresponding s2f_id/ortholog from gene annotation map.
For most cases, one read could be annotated with multiple protein ids, which usually shares the same s2f_id/ortholog.
Occasionally, one read could be annotated with multiple ortholog ids. In this case, only ortholog id with the highest frequency is kept. e.g.: read A00266:275:HLFTWDSXX:2:1101:10013:26537 mapped to s2fid_1 and s2fid_2 for 10 and 2 times, respectively. Only s2fid_1 was kept for read A00266:275:HLFTWDSXX:2:1101:10013:26537. Also, see 4.3.
Gene annotation table is a mapping table linking Protein id to s2f_id/ortholog, KO/GO id, gene symbol, description and species name.

Using -D or --genemap to specify protein_ko_go table file
e.g.: -D bird_protein_ko_go_species.txt

It contains 7 columns separated by '\t'. The first column is protein id, it must be the same protein id in protein database from 5. The second column is s2f_id/ortholog, followed by KO id and GO set ID corresponding to its protein ID, and then are the gene symbol and description, as well as species name.

gga:408082      s2f_7617        K04198  GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100        EDNRB   endothelin receptor type B      Gallus gallus (chicken)

pcoc:116239860  s2f_7617        K04198  GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100        EDNRB   endothelin receptor type B      Phasianus colchicus (Ring-necked pheasant)

mgp:100546485   s2f_7617        K04198  GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100        EDNRB   endothelin receptor type B      Meleagris gallopavo (turkey)

cjo:107308043   s2f_7617        K04198  GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100        EDNRB   endothelin receptor type B      Coturnix japonica (Japanese quail)

nmel:110408339  s2f_7617        K04198  GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100        EDNRB   endothelin receptor type B      Numida meleagris (helmeted guineafowl)

apla:101790262  s2f_7617        K04198  GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100        EDNRB   endothelin receptor type B      Anas platyrhynchos (mallard)

acyg:106043935  s2f_7617        K04198  GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100        EDNRB   endothelin receptor type B      Anser cygnoides domesticus (swan goose)

tgu:100228255   s2f_7617        K04198  GO:0004930;GO:0004962;GO:0005886;GO:0007165;GO:0007186;GO:0008217;GO:0016020;GO:0016021;GO:0042310;GO:0048484;GO:0086100        EDNRB   endothelin receptor type B      Taeniopygia guttata (zebra finch)

...             ...     ...

8. Single sample profiling

It is known that it is very difficulty to obtain larger number of samples for lots of environmental species, and sometimes only one sample for each experimental group. In this case, it is very difficult to conduct comparative study without enough statistical power
To cope with this challenge, --profiling mode is implemented in Seq2Fun. Using this mode, 4 main levels output files KO abundance table, hit pathway table, hit species table and read KO table are generated. An additional html report summarizing and visualizing these tables is also generated.
A rarefaction curve is also generated in html report to show the sequence depth against number of KOs. This figure infers 1). the sequence depth is enough or not. It is useful when considering the difficulty to obtaining enough samples or hight-quality RNA for environmental species. 2). If you have too much reads and you want a faster processing, you can use --reads_to_process to specify the number of reads to process.
By default profiling mode is off, using --profiling to turn it on, it is ~20% slower than non-profiling mode and could consume a little bit more memory at the last stage of generating tables
e.g.: --profiling

see more details about the output files in 4.3 - 4.7.

9. MEM or GREEDY modes in amino acid sequences search in the database

Seq2Fun translates clean read into amino acid sequences with all six possible reading frames, followed by searching and comparing with reference protein sequences in database. Seq2Fun offers two searching modes of maximum exact matches (MEM) and GREEDY. MEM mode is a fast running mode and implemented as that translated amino acid sequences must be perfectly matched with reference protein sequences without any mismatches. Any mismatched amino acid sequence and its corresponding read will be discarded. It is designed for organisms who have reference sequence in the database. However, due to the evolutionary distance between the target organism and reference organisms as well as sequencing error, there could be mismatches between amino acid sequences translated from reads and reference protein sequences. Using MEM mode could result in the various number of unmapped reads and rapid sensitivity decay depending on the phylogenetic distance between organisms as well as the number of sequencing errors. Therefore, Greedy mode is introduced to allow mismatches between query and subject sequences to overcome the evolutionary divergences and sequencing errors. Greedy mode is designed for organisms who do not have reference sequences in database.
The default mode is Greedy mode. Or using -K or --mode to specify mode.
e.g.: -K greedy
The MEM mode can be turn on e.g.: -K mem

10. Minimum matching length of amino acid sequence with reference protein sequences

The minimum matching length determines cut off length of how long the amino acid sequence matches with reference protein sequences. It is used for both MEM and GREEDY modes. Our results in the tested datasets show that minlength generally has complicated impacts in Greedy mode with the hightest values of coefficient of determination R square obtained with minlength ranging from 19 to 28. Therefore, the default value of minlength is set to 19, which means that the matching amino acid length must be no less than 19 aa. However, it is should be carefully tested on different organisms especially when your organisms have large evolutionary distances with organisms in the database. Decreasing the minlength would help overcome the large sequence divergence between query and subject sequences. But it would tradeoff specificity, resulting in some false positive matches. Using -J or --minlength to set minimum matching length with defulat value 19
e.g.: -J 19

11. Allowed amino acid mismatches (Greedy mode only)

Seq2Fun employs GREEDY model to overcome evolutionary distance and sequencing error. It is implemented with setting the number of mismatches. The number of mismatches depends on the nature of protein itself, evolutionary distance and sequence error. Our results in the tested datasets show that minlength generally has negative impacts in Greedy mode. Therefore, the default value of minlength is set to 2, that means allowing two amino acid mismatches. However, it is should be carefully tested on different organisms especially when your organisms have large evolutionary distances with organisms in the database. Increasing the number of mismatches would help overcome the large sequence divergence between query and subject sequences. But it would tradeoff specificity, resulting in some false positive matches. As there are more than 20 amino acids, increasing the number of mismatches could result in exponential increasing searches in the database (also dependent on composition in the database) and would significantly slow down the search speed.This number of mismatches will translate into the affection of abundance values of KO in the output table.
Using -E or --mismatch to specify number of amino acid mismatches with default value of 2.
e.g.: -E 2

12. Minimum matching score (Greedy mode only)

Matching score based on blosum62 is used to assess how closely the translated amino acid sequence with mutations match with reference protein sequences. Just as number of the mismatched amino acids, it could trades off sensitivity and precision. Decreasing the value of the minimum matching score will increase sensitivity with a decreasing precision. It could have affect on the runtime as it servers as filter to filter out low blosum62 score AA fragments. Therefore, higher minimum matching score could results in a speed gain. Based on our assessment in three datasets, the minimum matching score has very limited affect on the overall gene abundance quantification. Thus, we set the default minimum matching score to 80.
Using -j or --minscore to specify minimum matching score
e.g.: -j 80

13. Keep all the translated amino acid fragments(not recommended)

After read translation into all possible amino acid fragments using the six reading frames, only the top longest fragments are select for homolog search. This filtering will greatly reduce the number of false translations if there is no true stop codon in that read, which in turn can increase the speed of Seq2Fun. This hard filtering will probably filter out some reads containing true stop codon, but it has very limited on the downstream analysis such as fold change in comparative studies.
We offer the option to keep all the fragments, but their length must be > minimum matching length.
Using --allFragments keep all the fragments
e.g.: --allFragments

14. Select codon table

Select the codon table (same as blastx in NCBI), we provide 33 codon tables from 'https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG31'. Be default is the Standard Code

Standard1

VertebrateMitochondria2

YeastMitochondrial3

MoldProtozoanCoelenterateMitochondrialMycoplasmaSpiroplasma4

InvertebrateMitochondrial5

CiliateDasycladaceanHexamitaNuclear6

EchinodermFlatwormMitochondrial9

EuplotidNuclear10

AlternativeYeastNuclear12

AscidianMitochondrial13

InvertebrateMitochondiral5

ChlorophyceanMitochondrial16

TrematodeMitochondrial21

ScenedesmusobliquusMitochondrial22

RhabdopleuridaeMitochondrial24

PachysolentannophilusNuclear26

KaryorelictNuclear27

MesodiniumNuclear29

PeritrichNuclear30

BlastocrithidiaNuclear31

CephalodiscidaeMitochondrial33

Using --codontable to specify the codon table; "Standard1" is the default
e.g.: --codontable VertebrateMitochondria2

15. Trim the fist 6 base of read

The fist 6 base of a read could have higher mismatch rates due to the random-hexamer priming, we advise to trim the fist 6 bases of a read
Using --trim_front1 and/or --trim_front2 for single or pair-end reads
e.g.: --trim_front1 6 --trim_front2 6

16. Custom built database

Seq2Fun fully supports customer built database. Seq2Fun needs two input files, FM index file, which is built from protein sequences and used for database; and protein annotation table, which is used for mapping protein ID to s2f_id/KO/GO/gene/species. To build your own database, you must supply a protein fasta file
e.g.: birds.fasta

>gga:419912

MEGNEGDPPGAHRRRDSHGRRPSREINAEDQVAQETEEVFRSYTFYRYQQEREEGGAEVPMDPEIMEIQQELGSTGSQVGRRLAIIGDDINKRYDAEFRHMLKSLQPTKENAYEYFTTIASSLFDSGINWGRVIALLGFGYCMAIHVYQQGITGFLRRIARYVTEFMLRNRIARWIAQQGGWVAALDLDNVYMKYMLVVLALVMVGHLVVRRFFRP

>gga:427365

MAGDGGVYRLRCSLLGHEQDVRGITRGLFPEGGFVSVSRDRTARLWTPDGLNRGFTEMHCMRGHSNFVSCVCIIPPSDLYPRGLIATGGNDHNICVFTVDNTAPLYVLKGHTNTVCSLSSGKFGTLLSGSWDTTCKVWLNDRCMMTLQGHTAAIWAVKILPEQGLMLTGSADKTIKLWKAGRCERTYAGHEDCVRGLAILSEMEFLSCSNDASVRRWHISGECLHVYYGHTNYIYSVSVFPHCKDFITTGEDRSLRIWKQGECVQTIRLPAQSVWCCCVLDNDDIVVGASDGIIRVFTKSLERTASAEEIQAFENELSQASIDPKTGDLGDINADDLPGKEHLKDPGMRDGQTRLIKDNGKVEAYQWSVSEGRWIKIGDVVGSSGGTQQTSGKVLFEGKEYDYVFTIDVNESGPSYKLPYNISDDPWLTAYNFLQKNDLNPMFLDQVAKFIIDNTKGQAPLNTNTEFSDPFTGAGRYVPGSSSLSNTAPAADPFTGMGAYQSAKAENIYFPKKDAVTFDQANPTQILGKLKELNGNAAEEQKLTEDDLIILEKLLSATCNTSAEMPTAQQIQTLWKAINWPEDIVFPALDILRLSVRHPIVNENFCGEKAHVQFIHLLLKFLNPQGKPANQLLALRALCNCFVSQAGQKLLMEQRDEIMTQAIEVKSGNNKNVHIALATLTLNYAVCLHKVNNIEGKAQCLSVISTVMEVVKDLEAVFRLLVALGTLISDDKNAVQLAKSLGVDSQIKKYVSVSEPAKVKECCRFVLNLL

...

Please be noted that the protein sequences must consist of the standarad 20 amino acid (ACDEFGHIKLMNPQRSTVWY) in the uppercase.

S2F_HOME/bin/mkbwt -n 8 -a ACDEFGHIKLMNPQRSTVWY -o brids_proteins birds.fasta;
S2F_HOME/bin/mkfmi brids_proteins;

It generate a fmi file named brids.fmi. See 5. Protein database for translated search.

To use the database, you must obtain the ortholog among your protein ids, we recommend to use OrthoFinder to identify ortholog groups for these phylogenetic groups. For example. s2f_0000000 represent one ortholog. Then prepare a mapping file containing protein ID and its corresponding s2f_id/ortholog, KO, GO, gene symbol and description, as well as species name, and the last column is the core gene frequency (N. of organisms in that orthogroup / total N of organisms in the species group (eg. birds database) separated by "\t". The core gene frequency will be used for additional mapping assessment.
Please be noted that the same ortholog groups could contain different KO, GO, gene symbol/description, only the feature with highest frequency will be kept. The number of protein id and id in protein sequences used for FM index construction and annotation map must be identical.

e.g.: birds_annotation.txt.
Please be noted that the protein ID in the first column must be unique and identical with protein ID in birds.fasta, and must be corresponding to s2f_ID, KO/GO, in the second column.

aam:106485833	s2f_0000000	U	GO:0003676;GO:0015074	GIN1	endogenous retrovirus group k member 18 pol protein-like	Apteryx mantelli mantelli (brown kiwi)	0.9394

phi:102109513   s2f_0000001     K01317  GO:0004252;GO:0006508;GO:0008233;GO:0008236;GO:0016787  CORIN   acrosin-like    Pseudopodoces humilis (Tibetan ground-tit)      0.9394

etl:114070275   s2f_0000004     K17854  GO:0004497;GO:0005506;GO:0016020;GO:0016021;GO:0016491;GO:0016705;GO:0016712;GO:0020037;GO:0046872      CYP2AC7 cytochrome p450 2j2-like        Empidonax traillii (willow flycatcher)  0.9394

tgu:116809310	s2f_0569608	U	U	U	fibroin heavy chain-like isoform x1	Taeniopygia guttata (zebra finch)	0.0303

tgu:115498598	s2f_0569609	U	U	U	uncharacterized protein loc115498598	Taeniopygia guttata (zebra finch)	0.0303

...             ...     ...

17. Automatically trimming polyA tail in reads

Reads with polyA tail could affect protein comparison, resulting in a high artificial matching score. Removing or trimming reads containing polyA tail could increase precision.
By default, trimming polyA tail is activated.
If you want to disable this function
using --no_trim_polyA

18. Minimum length for merging overlapped PE read pair (PE reads only)

The length of query amino acid sequences could have heavily impact the comparison with reference protein sequences. It generally believes that the longer of query sequences, the higher precision of protein annotation. Therefore, merging overlapped PE read into longer read could yield longer translated amino acid sequences, which could reduce number of ill mapped reads and boost annotation precision. The default value of minimum length for merging PE read pair is 30 bp if the read pair is overlapped.
Using -v or --overlap_len_require to specify miminum length of overlapped PE read pair
e.g: -v 30

19. Maximum cutoff length of translated peptides

Seq2Fun only keeps the longest AA fragment for identification of homologies in protein database, this will filter out lots of false fragments and accelerate Seq2Fun. In some cases, only keeping at most the longest fragments could filter out a proportion of fragments that originated from the merged PE reads if start and stop codons are present in the middle of the reads. Therefore, we recap the maximum cut-off length of peptide fragment to be 60 aa (by default, user can change this cut-off), which will prevent the filtering out of some true peptide fragments from the merged reads.
Using -m or --maxtranslength to specify maximum cutoff length of translated peptides.
Note: --maxtranslength must be larger than the --minlength (minimum matching length of amino acid sequence) in 7.
e.g: -m 60

20. Output mapped reads

For non-model organisms, it could be important to obtain the assembled genes/contigs of interest.
Seq2Fun offers a feature to capture and label each mapped clean read with KO/GO tag and output as fastq.gz files. The clean reads can be directly used to assembled into genes/contigs using various de novo assemblers such as Trinity, et al. Because of only a small proportion (~30%) of total reads are mapped and retrieved, the de novo assembly for each genes is very fast and usually can be finished within several minutes.
Various downstream analysis such as designing primers, comparable analysis, on these assembled genes/contigs can be applied, and this will greatly compensate the information loss by Seq2Fun compared with the conventional de novo assemblers.

Using --outputMappedCleanReads to enable this function.

Forward mapped clean reads file A1.CE2-S1-LT_mapped_R1.fastq.gz:
@A00266:275:HLFTWDSXX:2:1467:22182:12414 1:N:0:TTACCGAC+CGAATACG s2f_148
GCGTTTGATGAGGTTGATAAGCAATTTCATGTTCCCTAAATCCCGTTCATCAGAAGAAGCAGTGCACAATTTTTCATCGAAGCTGAACATATCAGGGTCAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00266:275:HLFTWDSXX:2:1354:31304:33943 1:N:0:TTACCGAC+CGAATACG s2f_3047
AGCTGCTTTGGCATCTTCTGGAGAGCTGAAGTCCACAAACCCAAACCCTTTAGATGATCCAGTGTCTCTGTCTGTGACTATTCTAGCACTTATAGAGCCTT
+
FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

21. General guideline for parameter settings

Several parameters drives the trade-offs between sensitivity and precision. The default parameter setting is obtained from the parameter sensitivity test on four organisms: mouse, chicken, zebrafish and roundworm. If your studied species has no or limited number of close-related reference species in the database. You can tune parameters to obtain a better balance between sensitivity and precision.
For example: 1) increasing the number of mismatches (--mismatch) from default 2 to 3 or 4; 2) or decreasing the minimum matching length (--minlength) from default 19; 3) or decreasing the minimum BLOSUM62 score (--minscore) from default 80; 4) or decreasing the maximum length cutoff of the translated amino acid sequences for overlapped paired-end reads from default 60; will increase the mapping reads or the mapping chance for the highly divergent homologs.
It is worth mentioning that this parameter adjustment could slow down the Seq2Fun and/or increase the false positives with various degrees.

22. Submit s2f id abundance to ExpressAnalyst for downstream analysis (optional)

After obtaining s2f_id/ortholog abundance table, we provide an online website-based analysis tool to conduct downstream analysis such as identification of differentially expressed genes/ortholog, pathway enrichment and GO analysis, et al.
Submit s2f id abundance and annotation tables to our website-based tool ExpressAnalyst (optional)
S2fid_abundance_table_all_samples_submit_2_expressanalyst.txt will be used to submit to ExpressAnalyst.

23. SeqTract full usage options

SeqTract is used to extract mapped reads from output mapped fastq.gz files from Seq2Fun for target gene assemble, which is useful for downstream analysis such as designing primers, phylogenetic analysis, et al. SeqTract can extract reads annotated with s2f_id, but it is recommended to use only ko id for the extraction.
Use seqtract or seqtract --help to show the full usage options

  options:

  --sampleTable     input table for fastq files with annotation from Seq2Funs output, eg: one column for SE samples or two columns (separated by '\t') of samples for PE sample

  --geneTable       input table of target genes for sequences extraction, eg: one columns of s2f_ids

  --outputDir       output folder for extract fastq files, default is current working dir (string [=.])

  --undetermined    the file name of undetermined data, default is Undetermined (string [=Undetermined])

  --compression     compression level for gzip output (1 - 9), default is 6 (int [=6])

  -w, --thread      worker thread number, default is 2 (int [=2])

23.1. A typical seqtract run

S2F_HOME/bin/seqtract --sampleTable sample.txt --geneTable gene.txt --outputDir output -w 8 -V

23.2. --sampleTable sample.txt consisting of one column of forward read files for SE reads or two columns of forward and reverse read files (must be separated by '\t') for PE reads of

A1.CE2-S1-LT_mapped_R1.fastq.gz A1.CE2-S1-LT_mapped_R2.fastq.gz

A2.CE2-M4-LT_mapped_R1.fastq.gz A2.CE2-M4-LT_mapped_R2.fastq.gz

B1.CE2-S2-LT_mapped_R1.fastq.gz	B1.CE2-S2-LT_mapped_R2.fastq.gz

B2.CE2-M5-LT_mapped_R1.fastq.gz	B2.CE2-M5-LT_mapped_R2.fastq.gz

C1.CE2-S3-LT_mapped_R1.fastq.gz	C1.CE2-S3-LT_mapped_R2.fastq.gz

D1.CE2-S4-LT_mapped_R1.fastq.gz	D1.CE2-S4-LT_mapped_R2.fastq.gz

D2.CE2-H2-LT_mapped_R1.fastq.gz	D2.CE2-H2-LT_mapped_R2.fastq.gz

E1.CE2-S5-LT_mapped_R1.fastq.gz	E1.CE2-S5-LT_mapped_R2.fastq.gz

E2.CE2-H3-LT_mapped_R1.fastq.gz	E2.CE2-H3-LT_mapped_R2.fastq.gz

F1.CE2-M1-LT_mapped_R1.fastq.gz	F1.CE2-M1-LT_mapped_R2.fastq.gz

F2.CE2-H4-LT_mapped_R1.fastq.gz	F2.CE2-H4-LT_mapped_R2.fastq.gz

G1.CE2-M2-LT_mapped_R1.fastq.gz	G1.CE2-M2-LT_mapped_R2.fastq.gz

G2.CE2-H5-LT_mapped_R1.fastq.gz	G2.CE2-H5-LT_mapped_R2.fastq.gz

H1.CE2-M3-LT_mapped_R1.fastq.gz	H1.CE2-M3-LT_mapped_R2.fastq.gz

...                             ...

23.3. --geneTable gene.txt consisting of one column of annotated feature s2f_id

s2f_10

s2f_100 

s2f_1000

...

23.4. --outputDir output is specified for the extracted fastq.gz files

s2f_10_R1.fastq.gz

s2f_10_R2.fastq.gz

s2f_100_R1.fastq.gz

s2f_100_R2.fastq.gz

s2f_1000_R1.fastq.gz

s2f_1000_R2.fastq.gz

...

24. Target gene de novo assemble

You can use transcriptome assemblers such as SOAPdenovo-Trans to conduct target gene assemble.

Table of Contents