Anomaly Detection In The Human Genome

Discovering genomic variations within a single individual, which is also the underlying factor in a previously undiagnosed pathology, can be thought of as a anomaly detection problem. Colloquially referred to as the needle in a haystack.

Multi-pass Exome filtering is illustrated

The NCBI’s human reference genomes allows for the largest filter, enabling identification of initial variants. Next, alternate loci patches to the primary build of the human reference genome, accounting for large regions of variability, will reduce the number of variants, which will still remain too large for efficient annotation. An additional resource taps into SNP databases. The NCBI’s dbSNP provides a large set of SNP locations, meanwhile The National Cancer Institute also contains a large curated database of SNPs which are placed within three categories: Confirmed, Validated, and Candidate SNPs.

Shown in the figure above are three exomes which, after comparison with the primary human reference build contain large variant sets. These are then passed on to alternate loci, and finally SNP filters. The end result being discovery of novel variants, which may be responsible for idiopathic indications.

1 Comment

Filed under Genomics

Common Visualization Methods in Genomics

Because the genome contains such a wealth of information, and in a language which we don’t quite understand, it is of utmost importance to organize it in meaningful ways. As there is no precedent for this syntax, recognizing and documenting adequate patterns is key. Currently there are approximately five conventional methods of visualizing genomic data.

a) Tracks: The bane of academics and government researchers everywhere, sequences represented as rows. Track browsers can display multiple dimensions of the same set of data as well as comparative sets. UCSC Genome Browser is the standard example for this method. However, their usefulness is often limited to close-level, specific targets.

Whole genome exploration and comparison within these browsers tends to be tedious and unfruitful.

b) Heat Maps:  These are likely what the layperson imagines when they think of genomic visualization. Normally encompassing a rectangle containing multi-colored blocks in rows and columns. This method is particularly prevalent amongst microarray data. IGV, the Integrative Genome Viewer can generate heat maps quite well, so can some all-purpose tools such as R Statistics and Gnuplot.

Heat maps have the advantage of providing a larger scale correlative representation of two dimensional data. While still holding on to some of the qualities of track browsing.

c) Circular Genome Maps: A method that attempts to combine the robustness of the track maps in a less overwhelming take. Here strips of data are aligned in concentric circles, making it easier to see possible correlations. Circular maps have shown to be especially helpful recently in showcasing drug/gene interactions. Genome Projector by Arakawa et al. can generate great circular genome maps, amongst other methods.

Circular genome maps have become quite popular of late, as they allow whole genome visualization in a single snapshot. Moreover, multiple dimensions can be represented, as concentric circles.

d) DNA Walks: Are a relatively new method which represent genomic data vectors in a two dimensional plane. Where each letter (A,T,G,C) denotes a direction (up, down, left, right). This method has been helpful in generating a single unique image representative of the sequence in question, and is particularly adapt at showcasing small changes in structural contents, i.e. GC rich regions or poly A tails.

In a DNA walk each base is assigned a direction, i.e. A up, T, down, G left, C right.

e) Network Maps: Originally used to help understand computer networks, this method has quickly proven valuable to systems biology. In viewing the genome, pathways of interactions that were once obscure are allowed to move to the foreground, as well as seeing inherent divisions in function within the genome. The go-to software at the moment is Cytoscape, although Ayasdi is showing to be a wonderful competitor.

One draw back to network mapping genomes, is the likely requirement of annotation. That is, functions must be somewhat defined, raw genomic data is unlikely to be mapped in a meaningful manner with network maps in their current state.

Life science researchers, clinicians, and software engineers all stand to benefit from and are required, for humanity to get a useful grasp on this powerful language. These visualization techniques are, as mentioned, tried and tested. We must learn from, evolve, and iterate them into new tools which strike a balance on imposing our own will on the data and showcasing its inherent, underlying structures.


Previous post “Chaos Game Analysis of Genomes” & work by GeneDrop.

Leave a comment

Filed under Genomics

What Are You Waiting For- A Certain Shade of Green? Core Science & Tech Development

Solving difficult scientific or engineering problems has proven itself to be the greatest benefactor of long-term growth and development. However, finding support for fundamental technological developments has come increasingly under fire in recent years.

From “Amusing Ourselves to Death” By Neil Postman, a book about the possibility that Aldous Huxley, not Orwell, was right.

It is not just crying wolf, and we have all heard this message before, funding for science is low, the space program takes cuts, fewer technical majors, Justin Bieber is more popular than The Doors.

A fantastic metric to determine whether our resources, in sum, are being allocated fruitfully is to look at pooled returns of venture fund indexes. Starting with its birth in the 1960s, to the 1990s. Venture capital had excellent returns, and it often closely associated with the high-capital, slow-growth, semiconductor and biotechnology industries.

VC funds have posted negative mean and median returns, starting in 1999 through the present. A small fraction of firms are the exception.

In the new millennium however, we have encountered a new paradigm for returns amongst these indexes, a shift from funding transformational technologies to supporting companies solving incremental, or “hype” based problems. A shift from long-term garden like growth, to one equivalent to big game hunting. Steve Blank, who is invested in Ayasdi, said it best recently, stating:

If investors have a choice of investing in a blockbuster cancer drug that will pay them nothing for fifteen years or a social media application that can go big in a few years, which do you think they’re going to pick? If you’re a VC firm, you’re phasing out your life science division.

This perspective is beyond the bubble argument, or the oscillations of markets. It marks the creeping penetration of triviality into our investment culture. Furthermore, it is not a decision by any individual, rather the whole return of investment ecosystem has created an illusion highlighting consumer, social, and entertainment products.

Illumina HiSeq systems, a core technology driving contemporary life-science discoveries.

Venture is often associated with bravely expanding our horizons, to seek out new lands, and bring back riches that will ensure growth for generations to come. Where will we go after all the shoe stores, and match-makers have migrated online? Once the saturation of social media has reached nauseating ubiquity? To truly create long-term returns, that assure the future financial stability of the investor, scientist/engineer, and society we must lead, not follow the bandwagon, or be part of the “me too” culture.


“Cambridge Associates LLC U.S. Venture Capital Index® And Selected Benchmark Statistics” 2011

“Lessons from Twenty Years of the Kauffman Foundation’s Investments in Venture Capital Funds and The Triumph of Hope over Experience” 2012

“What Happened To The Future” – FoundersFund Manifesto


Filed under BigPharma, Genomics

K-Mistry Typeface By Ranmalee Jayaratne

I wanted to take a moment to thank Ranmalee for her wonderful concept and execution of the K-mistry type face in the banner for this site. 

It is refreshing to see young designers take on the difficult challenge of presenting the life sciences and other technical  sectors as appealing. Although it does not come as a surprise that Ranmalee, a 21 year old from Sri Lanka who decided to switch her studies from advanced mathematics to design, found such a wonderful way to balance her output.

If you’re like me, and look forward to what Ms. Jayaratne will create next, be sure to keep an eye on her Behance page.

Leave a comment

Filed under Uncategorized

Closing The Gap Between Computational & Pharmaceutical Innovation

When confronted with the mortality of life, it becomes painfully clear that medicine has not been able to keep up with information and computational innovations. At the heart of the problem stands  the drug development process, where an average of 5 to 10 years of research and billions of dollars worth of investment often fails to produce a product.

Drug Probability of Success to Market

Figure 1 | Probability of success to market from key milestones. Data: cohort of 14 companies.

In the past few years, molecules in development have seen a frightening rate of attrition. The most capital and resource intensive period comes during the clinical trials, which can be broken-down into the following stages: Phase I trials evaluate if a new drug is safe, Phase II and Phase III trials assess a drug’s efficacy, monitor side effects, and compare the drug to similar compounds already on market. Recent studies by the Centre for Medicines Research, places Phase II success rates at 18%, lower than at any other time during drug development [1]. Spending on average of $300 million to $1 Billion up until this point of research is par for the course [2].

Successful Discovery Strategies

Figure 2 | Computer-assisted screenings and traditional discovery strategy distributions of new molecular entities (NME). Followers are in the same class as previously approved drugs.

By contrast, computational drug design strategies have made tremendous advances in the new millennia with new tools to identify targets and virtual screening assays. These include structure-based tools to lead identification and optimization utilizing X-ray crystallography. As well as, high-throughput target-based screenings of key protein families like G protein-coupled receptors. Promising indicators of computational drug designs are encouraging new companies to court Big Pharma, who to-date have relied on academia or internal projects for computation. For a company like GeneDrop, even a fraction of the development budget would be adequate to deliver favorable results.

Drug development’s addressable market-size for global corporations such as Novartis or Roche, which have between 20-100 molecules in the pipeline at a given time, is estimated at  $1.11 Trillion in 2011; down from $1.24 Trillion in 2001 [2]. There are approximately ten large pharmaceutical companies and many small ones with one or two late-stage molecules in development.

Early-Stage Computational Drug Design

Fig 3 | Early-stage computational drug design flow

To-date, most computation in the space has been limited to early-stage research on the discovery of molecules prior to the clinical trial phases. However, the fall in market cap has sent drug companies scrambling as patents on existing blockbuster drugs near expiration, and those in development see increasingly high failure rates. This begs the question: why are computational resources being spent in the early-stage, when most failures occur in the late-stage, during Phase II?


Fig 4 | Pharmacogenomics attempts to correlate how individuals will respond to drugs based genomic variability.

As always, cost has been a primary factor. Late-stage computation has meant analysis of bio-metric data, which has been limited to blood-work and questionnaires of trial subjects. The pie in the sky of course, has always been genomics, the price of which was deemed too high. Even up to a couple of years ago, it would cost over $10,000 to sequence an individual. With Phase II and III trials consisting of hundreds to thousands of patients, the method was rarely used. As of the last few months this is no longer the case, with the cost hovering around $5,000 and quickly approaching $1000 per patient.

So, we are faced with an enticing opportunity for information technology to rescue a high-capital, old-world industry. Threading this needle however is no easy task; entrenched industries with high quarterly revenues are notoriously conservative when adopting innovation, especially from the outside. Adding to this is the high barrier of the technical languages of the hard-sciences and the networking culture of global corporations. Luckily both are boundaries which have been broken before in other industries and we can be optimistic; if anyone can break it, it is the passionate and talented.


[1] Trial watch: Phase II failures: 2008–2010 by J. Arrowsmith – Nature Reviews Drug Discovery 10, 328-329 (May 2011) | doi:10.1038/nrd3439

[2] – Fig 1- A decade of change by J. Arrowsmith – Nature Reviews Drug Discovery 11, 17-18 (January 2012) | doi:10.1038/nrd3630 

[3] – Fig 2- How were new medicines discovered? by David C. Swinney & Jason Anthony – Nature Reviews Drug Discovery 10, 507-519 (July 2011) | doi:10.1038/nrd3480

[4] – Fig 4 – Genomics in drug discovery and development by Dimitri Semizarov, Eric Blomme (2008) ISBN 0470096047, 9780470096048

Leave a comment

Filed under BigPharma, Genomics

Chaos Game Analysis of Genomes

Triforce Power

Genomic code that makes us is made up of four letters, ATGC. Billions of these letters together creates a lifeform. Iterated function systems (IFS) are anything that can be made by repeating the same simple rules over and over. The easiest example being tree branches, add a simple structure repeatedly ad-infinitum and before you know it we have complex and beautiful systems; the popular example being the Sierpinski Triangle or “triforce” for the Zelda fans. As the cost of DNA sequencing becomes cheaper day by day we are confronted with a tsunami of data and it has become exceedingly difficult to derive meaningful answers from all the information contained within us.

H. Sapiens

Finding any advantage in ways to organize and view the data helps us discover minuet differences between individuals or say a normal cell versus a cancer cell. This is where Chaos Game Representation (CGR) becomes helpful, CGR is just a form of IFS that is helpful in mapping seemingly random information, that we suspect or know to have some sort of underlying structure.

In our case this would be the human genome. Although when looking at the letters coming from our DNA it seems like billions of random babbles, it is of course organized in a manner to give the blueprint for our bodies.  So let’s roll the dice-  do we get any sort of meaningful structure when applying CGR to DNA? If you are so inclined, something fun to try is the following:

genome = Import["c:\data\sequence.fasta", "Sequence"];
genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}];
chars = StringCases[genome, "G" | "C" | "T" | "A"];
f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
pts = FoldList[f, {0.5, 0.5}, chars];
Graphics[{PointSize[Tiny], Point[pts]}]

g1346a094 on Chromosome 7

For example, reading the sequence in order, apply T1 whenever C is encountered, apply T2 whenever A is encountered, apply T3 whenever T is encountered, and apply T4 whenever G is encountered. Really though any transformations to C, A, T, and G can be used and multiple methods can be compared. Self-similarity is immediately noticeable in these maps, which isn’t all that surprising since fractals are abundant in nature and DNA after all, is a natural syntax. Being aware that these patterns exist within our data, opens us up to some new questions to evaluate if IFS, CGR and fractals in general are helpful tools in the interpretation of genomic data.

Signal transducer 5B (STAT5B), on chromosome 17

Since the mapping is 1-1 and we see patterns emerge, we are hinted that there may be biological relevance; especially because different genes yield different patterns. But what exactly are the correlations between the patterns and the biological functions? It would also be very interesting to see mappings of introns/exons colored differently or color amino acids and various codons. One thing is for sure, genomes aren’t just endless columns and rows of letters, they are pictures. It is much easier to compare pictures and discover variations, which can ultimately allow us to find meaningful interpretation from this invaluable data.


Jeffrey, H. J., “Chaos game visualization of sequences,” Computers & Graphics 16 (1992), 25-33.

Ashlock, D. Golden, J.B., III. Iterated function system fractals for the detection and display of DNA reading frame (2000) ISBN: 0-7803-6375-2

VV Nair, K Vijayan, DP Gopinath ANN based Genome Classifier using Frequency Chaos Game Representation (2010)


Filed under Fractals, Genomics

Bioinformatics In Bengal

Dept. Biochemistry University of Dhaka

Visiting Bengal for the holidays I didn’t expect a thriving bioinformatics community. Yet, that’s exactly what I found when Dr. Haseena Khan invited me to visit her lab at The University of Dhaka. The Jute Genome Project was a consortium of academia, industry, and government which had sequenced & analyzed the Jute plant.

What Dr. Khan and her researchers lacked in cutting-edge equipment, they made up in passion, ingenuity & thorough knowledge of the most miniscule advancements in the field. After spending the day with them Dr. Khan insisted I meet with the industrial wing of the project.

Tucked away amidst one of the most clustered places on the planet, there are a few small buildings covered in plants, within them incredible things are happening.

Lush green home of scientists, developers & supercomputers at DataSoft

DataSoft Systems Ltd. created a sub-division, Swapnojaatra (dream journey) which would “put scientists, developers, and supercomputers in one room and throw away the key” as Palash the Director of Technology for DataSoft would tell me. Although the Jute Genome Project is now complete, the developers of Swapnojaatra are hooked on informatics. From the minute we met they were excited to show what they had done (within lines of existing NDAs) and ask what was new in the field from San Francisco. Indeed, the team here had discovered genomic re-assortment of the influenza virus, performed molecular docking studies of pneumonia and created many of their own informatics tools.

For a well-educated, computer savvy, developing region bioinformatics is a near perfect industry. With low overhead costs, compared to traditional wet-lab sciences and endless data being generated in more economically developed countries, it’s only a matter of time. Bengal and bioinformatics may have been made for each other.



A Putative Leucine-Rich Repeat Receptor-Like Kinase of Jute Involved in Stress Response (2010) by MS Islam, SB Nayeem, M Shoyaib, H Khan DOI: 10.1007/s11105-009-0166-4

Molecular-docking study of capsular regulatory protein in Streptococcus pneumoniae portends the novel approach to its treatment (2011) by S Thapa, A Zubaer DOI:10.2147/OAB.S26236

Palindromes drive the re-assortment in Influenza A (2011) by A Zubaer, S Thapa ISSN 0973-2063

Leave a comment

Filed under Genomics, Microbiology

Cancer, Stress, Genomic Treatments & Steve Jobs

Of the billions of letters making up our life-code a few simple mistakes can cause a cell to stop following the rules, multiply endlessly and destroy it’s host eco-system, you. What can introduce such mistakes in our DNA? The riskiest time being when DNA is copied, a completely molecular-mechanical processes, something that happens billions of times each day in our bodies. It then comes as no surprise that given enough time the likelihood of cancer only increases. However, life is so incredible that it’s managed to develop ingenious ways to spell-check and prevent these mistakes. One of the more rockstar mechanisms are Telomeres, buffer regions at the ends of our chromosomes.

Telomere Caps

Because replication of our chromosomes begins near the center and works itself towards the ends, the two strands of DNA tend not to match up just right at the edges. This is where telomeres reside with their repetitive regions. The idea is it wouldn’t matter if we loose some of them, which is what happens, telomeres shorten over time. Recent research however, has shown strong correlation between the degree of shortening and disease risk, specifically diseases with genomics causes, i.e., cancer. In 2009 the Nobel Prize for Physiology or Medicine was awarded to Elizabeth Blackburn of UCSF for her work in telomeres, including demonstration of the relationship between telomere length, mental stress and cancer.

Genomic Instability and an Increased Incidence of Spontaneous Cancer in Aging mTR−/− Mice

As far as clinical applications of genomics for cancers, two conclusions can be drawn. For prevention, the importance of mental stress in the mechanism of DNA replication. For diagnostics and treatments, the adoption of DNA sequencing both in risk-assessment, i.e., measuring telomere lengths and the use of pharmacogenomics in picking a drug regimen for subjects. Availability of such sequencing technologies is for the moment confined to localities of active genomics industries like San Francisco and the greater Cambridge, Massachusetts region, though often unknown to those in need.
This made me all the more flustered with the failure of clinicians to win Steve Jobs’ battle against cancer. As a case study, he clearly had the resources and lived in an area that is the birthplace of the genomics industry. Jobs was a meticulous individual who was very involved with his work, which leaves us to question his ability to deal with stress and the role that played in his remission & recurrence rates. Another lingering question remains, are the “best” oncologists money can buy, necessarily those who would employ sequencing techniques? Perhaps as time goes on we will learn more about his treatment regiment but for me it shines a bright light in the gap between what researchers see as possible and what clinicians feel comfortable utilizing.

Impartial comparative analysis of measurement of leukocyte telomere length/DNA content by Southern blots and qPCR. Nucleic Acids Res. 2011 Aug 8. Aviv A, Hunt Sc, Lin J, Cao X, Kimura M, Blackburn E.

Longevity, Stress Response, and Cancer in Aging Telomerase-Deficient MiceKarl Lenhard Rudolph1, Sandy Chang, Han-Woong Lee, Maria Blasco, Geoffrey J Gottlieb, Carol Greider, Ronald A DePinho

Leukocyte Telomere Length in Major Depression: Correlations with Chronicity, Inflammation and Oxidative Stress – Preliminary Findings. Wolkowitz OM, Mellon SH, Epel ES, Lin J, Dhabhar FS, et al. 2011 PLoS ONE 6(3): e17837.

1 Comment

Filed under Genomics, Meditation, Neuroscience

Zero-mode Waveguide & Single Molecule Real Time Sequencing

Zero-mode Waveguide - a hole, tens of nanometers in diameter, smaller than the wavelength of light used. Providing a window for watching DNA polymerase write.

Last week saw something of a historic announcement that may well be seen in grander light by our offspring than us. 23 & Me announced the $999 Exome, all the protein coding regions of our genome. This for the first time makes it feasible to extract significant amounts of genomic information from patients, consumers & trial subjects. At the same time the founders of Complete Genomics wrote an article on the lowering costs of sequencing, showing some great numbers on technologies that use less reagents, cheaper machines and faster results. Advancements are so frequent in this field it seems, though what gets me really excited is the concept of SMRT (Single Molecule Real Time) sequencing; the idea of reading a strand of DNA one letter at a time as it’s written. Most of our progress, like the $999 exome or the success of Complete Genomics has been possible as a result of High-Throughput sequencing, which evolved from the original Sanger sequencing methods. Whereas, Sanger sequencing would spit out few hundred letters of  DNA at a time, HT sequencing would spit out much less but at a faster rate. SMRT offers to give us long-reads, thousands of letters, at a fast rate. Several “in-progress” technologies that are promising long-reads range from pulling a DNA strand through nano-pores or using a large single atom that would run across a strand. ZMW (zero-mode waveguide) takes a unique approach in that it uses DNA Polymerase, the default DNA writing nano-machine in our cells to give readings as the strand is being written with fluorescent letters.

PacBio's SMRT Prototype

Although this is exciting stuff, I wonder how much it will help in the adoption of genomics in medicine. The old paradigm of pills and vaccines &  seeking magic-bullets it seems is embedded deep in our economic and psychological fabric. Now that the cost of sequencing is comparable to most other medical tests will it become as common-place as ordering an MRI or getting a new hip? Doesn’t seem like a $1000 price-tag would stop anyone. No, what has happened is we’ve handed kindergardeners college textbooks, it’s too much information and they haven’t a clue what to use it for. More than faster and cheaper, we need user-friendly, digestible data interpretation. ZMW and SMRT is sci-fi cool but if there’s anything to be learned from the world’s largest technology company it’s that adoption is more a game of collective-psyches than raw science & engineering.


PacBio Technology Backgrounder 

How Low Can We Go? Molecules, Photons, and Bits by Clifford Reid

To the man who showed my whole generation the correlation between engineering, usability & adoption 

Leave a comment

Filed under Genomics

Drug Development from Binary to Gradient Model

Earlier this year a study by the Center for the Study of Drug Development at Tufts University placed the cost of developing a new drug at $1.3 billion [1].

Distribution of Development Funding

Though the number is contested by other researchers [2], it is well within the trend of pervious studies and has now been widely accepted as an industry wide average. Exacerbating the issue is the all or nothing nature of drug development, where failure during any phase of clinical trials can cause the termination of a project. It is therefore advantageous to consider technologies that will reduce the risk of this binary success/failure model and transition to a gradient definition of therapeutic efficacy.

Trending Costs of Drug Development

Much of the high costs come in during phase 2 & 3 trials, where patient care, clinical production and regulatory leg-work consumes funds at an alarming rate. With everything riding on the individual trial subjects, their well-being directly linked to success. Undesirable reactions to experimental treatments is unavoidable and the margins for serious adverse events is kept tight by regulatory agencies to protect healthcare consumers. Often however, ground-breaking treatments have to be shelved because they affect 10-15% of trial subjects detrimentally.

RD costs of new chemical entity (NCE)

This makes any ability to view trial subjects with increased resolution and discern subtle correlations with their reactions to consumer demographics key in cutting risks of total-loss. Here I hope a story about my own experience is helpful, as I know it better than what anyone else has had to dealt with. My time at Novartis began when I was brought on-board to help with the development of a drug entering a repeat Phase IIB trial, as the first time around approximately 15% of subjects showed an adverse reaction of note.

Draft FDA Guidance on DNA Sequencing & Clinical Trials

Soon however folks began to get cold-feet, do we dump further resources behind this project or cut our losses and iterate to the next project. A third option now becoming available is that perhaps there was something specific to those 15% of patients that caused the unwanted reaction. Identifying this would allow the drug to move along its pipeline with contraindications that covered the failing demographics. No longer limiting projects to pass/fail while hedging development risks.

DiMasi et al,(2003) The price of innovation: new estimates of drug development costs
Ernst & Young Global Pharmaceutical Industry Report (2011) Progressions Building Pharma 3.0
Tufts Center for the Study of Drug Development (2011) Outlook 2011 report

Leave a comment

Filed under BigPharma, Genomics