分析模块封装了Trimmomatic工具,Trimmomatic是一个针对Illumina高通量测序的reads trim工具,支持paired-end(双末端)和single-end(单末端)数据。 Trimmomatic包括如下功能: l ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read l SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold l MINLEN: Drop the read if it is below a specified length l LEADING: Cut bases off the start of a read, if below a threshold quality l TRAILING: Cut bases off the end of a read, if below a threshold quality l CROP: Cut the read to a specified length l HEADCROP: Cut the specified number of bases from the start of the read 输入: 对于single-end(单末端)数据,输入单个FASTQ文件。 对于paired-end(双末端)数据,输入两个FASTQ文件(R1和R2)。 设置质量值参数,Illumina 1.3-1.7 Phred+64 对应Illumina早期平台,Illumina 1.8+ Phred+33 对应Illumina最新平台,默认参数为:Illumina 1.8+ Phred+33。 输出: 对于single-end(单末端)数据,输出修剪和过滤的clean data数据,为单个FASTQ文件。 对于paired-end(双末端)数据,输出四个文件,分别为: 两个FASTQ文件(R1-paired and R2-paired),包含read的两端pair(R1和R2)均通过数据质控的结果文件。 额外的两个FASTQ文件(R1-unpaired and R2-unpaired),包含read,其中一端pair(R1 或 R2)通过数据质控,另一端无法通过数据质控,这样,就仅保留了一端的数据结果。 附录: 对于常规的RNA或DNA测序,HiSeq4000或HiSeqXTen平台,PE100或PE150,建议使用如下参数设置: Perform initial ILLUMINACLIP step:Yes Maximum mismatch count which will still allow a full match to be performed:2 How accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment:30 How accurate the match between any adapter etc. sequence must be against a read:10 Perform Sliding window trimming (SLIDINGWINDOW):Yes Number of bases to average across:20 Average quality required:20 Drop reads below a specified length (MINLEN):Yes Minimum length of reads to be kept:35 Cut bases off the end of a read, if below a threshold quality (TRAILING):Yes Minimum quality required to keep a base:20 即,去接头污染,比对允许的最大错配数为2,palindrome模式下匹配碱基数阈值为30,simple模式下的匹配碱基数阈值为10。过滤read尾部质量值20以下的碱基,设置20bp的窗口,如果窗口内的平均质量值低于20,从窗口开始截去后端碱基,过滤质控后35bp以下的read。 对于扩增子测序,MiSeq PE 250,建议使用如下参数设置: Perform Sliding window trimming (SLIDINGWINDOW):Yes Number of bases to average across:50 Average quality required:20 Drop reads below a specified length (MINLEN):Yes Minimum length of reads to be kept:50 Cut bases off the end of a read, if below a threshold quality (TRAILING):Yes Minimum quality required to keep a base:20 即,过滤read尾部质量值20以下的碱基,设置50bp的窗口,如果窗口内的平均质量值低于20,从窗口开始截去后端碱基,过滤质控后50bp以下的read。 分析模块引用了Trimmomatic v0.32 软件( http://www.usadellab.org/cms/index.php?page=trimmomatic )。 相关文献如下所示: Bolger, A.M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170. |