Using FLASH to merge paired-end reads(PE数据拼接)

分析模块封装了FLASHFast Length Adjustment of SHort reads),FLASH根据PE reads之间的overlap关系,将成对reads拼接(merge)成一条序列。

注:只有初始DNA片段长度小于2read读长,即 PE reads测通,才能将双末端测序的paired-end reads拼接起来。


输入:

两个文件:

1FASTQ格式的原始测序数据文件(R1 /first of pair)。

2FASTQ格式的原始测序数据文件(R2 / second of pair)。


输出:

拼接后的序列文件,FASTQ格式文件。


注意:

A对于常规文库(如转录组,基因组)。

假定,一个DNA文库,文库长度均值为m=180bp,标准差为sd=18bp,测序读长为r=100bp,那么建议设置Maximum overlap为: 2*r - m + 2.5*sd = 65bp

B、对于扩增子测序文库。

假定,16s扩增区域为,V4 + v5,扩增产物长度均值为m=400bp,标准差为sd=20bpMiSeq测序读长为r=250bp,那么建议设置Maximum overlap为:2*r - m + 2.5*sd = 150bp


特别注意:

大多数情况下,测序read读长小于扩增产物长度,如下所示:

那么,通过两端reads尾部的overlap关系就可以将两端的reads拼接起来。

如果测序read读长大于扩增产物长度,如下所示:

可以看到,两端reads的尾部并不存在overlap关系,无法将两端reads拼接起来。

面对这种情况,需要在原始数据剪切和过滤时,同时将两端reads的尾部截掉,使得reads的读长小于扩增产物的长度。

"Trimmomatic PE(paired-end)"分析模块中,设置CROP过程,将reads截短到需要的长度。

两端reads截短后,如下所示,通过尾部的overlap关系,即可拼接起来。


FLASH简介如下所示:

FLASH (Fast Length Adjustment of SHort reads) is an accurate and fast tool to merge paired-end reads that were generated from DNA fragments whose lengths are shorter than twice the length of reads. Merged read pairs result in unpaired longer reads, which are generally more desired in genome assembly and genome analysis processes. Briefly, the FLASH algorithm considers all possible overlaps at or above a minimum length between the reads in a pair and chooses the overlap that results in the lowest mismatch density (proportion of mismatched bases in the overlapped region).  Ties between multiple overlaps are broken by considering quality scores at mismatch sites.  When building the merged sequence, FLASH computes a consensus sequence in the overlapped region. More details can be found in the original publication.

Limitations of FLASH include:

- FLASH cannot merge paired-end reads that do not overlap.

- FLASH is not designed for data that has a significant amount of indel errors (such as Sanger sequencing data). It is best suited for Illumina data.

分析模块引用了FLASH v1.2.11软件https://ccb.jhu.edu/software/FLASH/)。


相关文献如下所示:

FLASH: Fast length adjustment of short reads to improve genome assemblies. T. Magoc and S. Salzberg. Bioinformatics 27:21 (2011), 2957-63.