Hybrid Assemblies From Short and Long-Reads for INDEL Resolution, Finding The Missing Puzzle Piece

When it comes to genomics, contemporary bioinformatics follows the dogma of, assemble, detect, and annotate. This over-simplification however, washes over many key features, such as insertions and deletions, which may in fact be pathogenic[1].

Fig 1: A Low Complexity Region, featuring Homopolymers, leading to Ambiguous Mapping

High throughput, NGS short read data, and the algorithms developed to support them have overwhelmingly focused on detection of SNPs, while processing Structural Variants as an afterthought. This has largely been due to the combined limitations of chemistry and computation. Popular sequencers (I’m looking at you Illumina) produce strings with character lengths of several hundred bases. Matching millions of string-sets across highly-repetitive regions, where a single character or pattern of characters repeats over and over, is a nightmare from a algorithm design perspective.

Using Haplotype Calculations with Dindel, SeqAn, and Boost to Resolve INDELs

Fig 2: Using Haplotype Calculations with Dindel, SeqAn, and Boost C++ Libraries to Resolve INDELs

Because scientists and engineers are a lazy bunch with deep-rooted personality flaws, we’ve brushed the problem of sequence alignment in repetitive and low complexity regions under the rug. Sure, most aligners and callers will have some implementation for INDEL realignment, and detection (say Picard or Haplotype Caller in GATK), but if you’ve ever seen the results you know they don’t hold up well to scrutiny. Amongst this, the work of Kees Albers stands out, and anyone who has written a validated pipeline will attest to the results by Dindel’s four step process[2]:

$dindel --analysis getCIGARindels --bamFile sample.bam
        --outputFile sample.dindel_output --ref ref.fa

$python makeWindows.py 
        --inputVarFile sample.dindel_output.variants.txt
        --windowFilePrefix sample.realign_windows
        --numWindowsPerFile 1000

$while count < total.realign_windows:
       dindel --analysis indels --doDiploid 
              --bamFile sample.bam --ref ref.fa
              --varFile sample.realign_windows.2.txt
              --libFile sample.output.libraries.txt
              --outputFile sample.stage2_windows.2
       count +1

$mergeOutputDiploid.py 
        --inputFiles sample.dindel_stage2_out.txt
        --outputFile variantCalls.VCF --ref ref.fa

The process outlined above is quite compute intensive, and as can be seen in Figure 2, the number of possible haplotypes to resolve a single potential INDEL can range in the triple digits; each of which has to be aligned, and selected for by statistical relevance. Guillermo del Angel gives a great outline for those who use GATK (we find the results from Samtools more agreeable) to utilize Dindel, however even this has a fairly significant false positive rate[3].

Fig 3: GATK with Dindel Still has High FPR

An algorithm, regardless of how well designed, can’t overcome bad input data. And short reads are simply not good starting material for resolving structural variants. Luckily, several companies have taken on the challenge of developing long-read technologies, where the string lengths can reach tens of thousands of characters. We’ve been lucky enough to work with Pacific Biosciences, using data generated with their SMRT hardware and P5-C3 chemistry. Reads in suspected SV regions were extracted, the long reads were used to span the low-complexity/highly repetitive regions, and short reads reassembled using bayesian and HMM methods available in SeqAn and Boost.

Fig 4: Hybrid alignment in Lucid genome browser

At this point I should mention that while Boost is absolutely fabulous in-terms of providing great tools for C++ development, which is a great way to handle compute intensive tasks, it is without a doubt: pull-your-hair-out-frustraiting to build and use. Moreover, while our methods have significantly decreased the number of total SVs in our results, we haven’t had the chance to do a rigorous comparison for FPR; but I think a good way to do that would be similar to what Heng Li has shown recently[4].

Hybrid Assemblies From Short and Long-Reads for INDEL Resolution, Finding The Missing Puzzle Piece

Leave a comment Cancel reply

Featured Post

Lucid BAM Viewer

Recent Posts

magnolia VCF

Author

Archives

Hybrid Assemblies From Short and Long-Reads for INDEL Resolution, Finding The Missing Puzzle Piece

Share this:

Related

Leave a comment Cancel reply

Featured Post

Lucid BAM Viewer

Recent Posts

magnolia VCF

Author

Archives