Hybrid Assemblies From Short and Long-Reads for INDEL Resolution, Finding The Missing Puzzle Piece

When it comes to genomics, contemporary bioinformatics follows the dogma of, assemble, detect, and annotate. This over-simplification however, washes over many key features, such as insertions and deletions, which may in fact be pathogenic[1].

Screen Shot 2014-05-22 at 12.31.47 PM

Fig 1: A Low Complexity Region, featuring Homopolymers, leading to Ambiguous Mapping

High throughput, NGS short read data, and the algorithms developed to support them have overwhelmingly focused on detection of SNPs, while processing Structural Variants as an afterthought. This has largely been due to the combined limitations of chemistry and computation. Popular sequencers (I’m looking at you Illumina) produce strings with character lengths of several hundred bases. Matching millions of string-sets across highly-repetitive regions, where a single character or pattern of characters repeats over and over, is a nightmare from a algorithm design perspective.

Using Haplotype Calculations with Dindel, SeqAn, and Boost to Resolve INDELs

Fig 2: Using Haplotype Calculations with Dindel, SeqAn, and Boost C++ Libraries to Resolve INDELs

Because scientists and engineers are a lazy bunch with deep-rooted personality flaws, we’ve brushed the problem of sequence alignment in repetitive and low complexity regions under the rug. Sure, most aligners and callers will have some implementation for INDEL realignment, and detection (say Picard or Haplotype Caller in GATK), but if you’ve ever seen the results you know they don’t hold up well to scrutiny. Amongst this, the work of Kees Albers stands out, and anyone who has written a validated pipeline will attest to the results by Dindel’s four step process[2]:

$dindel --analysis getCIGARindels --bamFile sample.bam
        --outputFile sample.dindel_output --ref ref.fa

$python makeWindows.py 
        --inputVarFile sample.dindel_output.variants.txt
        --windowFilePrefix sample.realign_windows
        --numWindowsPerFile 1000

$while count < total.realign_windows:
       dindel --analysis indels --doDiploid 
              --bamFile sample.bam --ref ref.fa
              --varFile sample.realign_windows.2.txt
              --libFile sample.output.libraries.txt
              --outputFile sample.stage2_windows.2
       count +1

        --inputFiles sample.dindel_stage2_out.txt
        --outputFile variantCalls.VCF --ref ref.fa

The process outlined above is quite compute intensive, and as can be seen in Figure 2, the number of possible haplotypes to resolve a single potential INDEL can range in the triple digits; each of which has to be aligned, and selected for by statistical relevance. Guillermo del Angel gives a great outline for those who use GATK (we find the results from Samtools more agreeable) to utilize Dindel, however even this has a fairly significant false positive rate[3].

Fig 3: GATK with Dindel Still has High FPR

Fig 3: GATK with Dindel Still has High FPR

An algorithm, regardless of how well designed, can’t overcome bad input data. And short reads are simply not good starting material for resolving structural variants. Luckily, several companies have taken on the challenge of developing long-read technologies, where the string lengths can reach tens of thousands of characters. We’ve been lucky enough to work with Pacific Biosciences, using data generated with their SMRT hardware and P5-C3 chemistry. Reads in suspected SV regions were extracted, the long reads were used to span the low-complexity/highly repetitive regions, and short reads reassembled using bayesian and HMM methods available in SeqAn and Boost.

Fig 4: Resolved Structural Variant

Fig 4: Resolved Structural Variant

At this point I should mention that while Boost is absolutely fabulous in-terms of providing great tools for C++ development, which is a great way to handle compute intensive tasks, it is without a doubt: pull-your-hair-out-frustraiting to build and use. Moreover, while our methods have significantly decreased the number of total SVs in our results, we haven’t had the chance to do a rigorous comparison for FPR; but I think a good way to do that would be similar to what Heng Li has shown recently[4].

Leave a comment

Filed under Genomics

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s