Tag Archives: Computational Biology

Mapping KEGG Pathway Interactions with Bioconductor

Continuing from the previous post[1], dealing with structural effects of variants, we can now abstract one more level up and investigate our sequencing results from a relational pathway model.

Global Metabolic Pathway Map of H.sapien by Kanehisa Laboratories

Global Metabolic Pathway Map of H.sapien by Kanehisa Laboratories

The Kyoto Encyclopedia of Genes and Genomes (KEGG) has become an indispensable resource which has laboriously, and often manually, curated high-level functions of biological systems. Bioconductor, though not as essential as KEGG, provides some valuable tools when utilizing graph-theory for genomic analysis. If your data is well annotated and you happen to care about high-level genomic interactions, then you may have pathway annotations, containing data like the following:

KEGG IDs can be stored on an external file separate from the sequence data they are derived from. Though, storing the IDs with their respective variant is helpful, and it is possible to maintain VCF 4.1 specifications.

KEGG with VCF 4.1

KEGG Annotations in VCF 4.1

As most Bioconductor tools are based on the R programming language, having an updated installation is recommended, this post uses version 3.0.1 “Good Sport”. Creating interaction maps with KEGG data will require three packages: KEGGgraph, Rgraphviz, and org.Hs.eg.db. These packages can be downloaded as separate tarballs, however installation from within R is likely best:

Using the method above for all three. KEGG relational information is stored within XML files in the KEGG Markup Language. KGML files can be accessed through several methods, including directly from R, FTP, and subjectively the best method with REST-style KEGG API.

Screen Shot 2013-08-27 at 2.26.09 PM
Bioconductor packages downloaded above come with a few KGML files pre-loaded, which can be viewed with the following command, it is also important to note that KGML files we want to use should be placed in this directory to avoid any unnecessary errors.

In this post the branched-chain amino acid (BCAA) degradation pathway, which has a KEGG ID of hsa00280, will be mapped in relation to variants from the BCKDHA gene.
[var1] <- system.file(".xml",package="KEGGgraph")
[var2] <- parseKGML2Graph([var1], genesOnly=TRUE)
[var3] <- c("[KEGG-Gene-ID]",edges([var2])$'[KEGG-Gene-ID]')
[var4] <- subKEGGgraph([var3],[variable2])
[var5] <- sapply(edges([var4]), length) > 0
[var6] <- sapply(inEdges([var4]), lendth) > 0
[var7] <- [var5]|[var6]
[var8] <- translateKEGGID2GeneID(names([var7]))
[var9] <- sapply(mget([var8],org.Hs.egSYMBOL),"[[",1))
[var10] <- [var4]
nodes([var10]) <- [var9]
[var11] <- list();
[var11]$fillcolor <- makeAttr([var4],"[color]")
plot([var10], nodeAttrs=[var11])

Executing these steps will result in a graph whose nodes and edges should help clarify any relevant connections between the genomic regions in question.

Screen Shot 2013-08-27 at 5.49.31 PM


While dynamic visualization tools (e.g. Gephi, Ayasdi, Cytoscape) look similar, and with some work utilize KEGG, they may lack the specificity and control which Bioconductor  provides due to its foundations in R. These methods are necessary to understand more than just metabolic diseases, they also play a crucial role in drug interactions, compound heterozygosity/complex non-mendellian traits, and other high-level biological functions.

Leave a comment

Filed under Genomics

Exome Sequence Assembly Utilizing Bowtie & Samtools

OG BrowserAt the end of all the wet chemistry for a genome sequencing project we are left with the raw data in the form of fastq files. The following post documents the processing of said raw files to assembled genomes using Bowtie & Samtools.

Raw data is split into approximately 20-30 fastq files per individual

Each of these raw files, once uncompressed, contains somewhere around 1 gigabyte of nucleotide, machine, and quality information. Which will follow the fastq guidelines and look very similar to the following. It’s quickly noticeable where our nucleotide data consisting of ATGC lives within these raw files.

@HWI-ST1027:182:D1H4LACXX:5:2306:21024:142455 1:N:0:ACATTG
@HWI-ST1027:182:D1H4LACXX:5:2306:21190:142462 1:N:0:ACATTG

At this point the raw reads need to be assembled into contiguous overlapping sets, then chromosomes, and finally the entire genome. There are two general approaches here, template-based and de novo assembly. For this particular exome data set it is prudent to move forward with template-based assembly using the latest build of the human reference genome. An index of the reference genome must be built for bowtie, some indexes are also available for download though the file size can be quite large.

$ bowtie-build /Users/mokas/Desktop/codebase/max/hg19.fa hg19
 Line rate: 6 (line is 64 bytes)
 Lines per side: 1 (side is 64 bytes)
 Offset rate: 5 (one in 32)
 FTable chars: 10
Getting block 6 of 6
 Reserving size (543245712) for bucket
 Calculating Z arrays
 Calculating Z arrays time: 00:00:00
 Entering block accumulator loop:
numSidePairs: 6467211
 numSides: 12934422
 numLines: 12934422
Total time for backward call to driver() for mirror index: 02:00:28

The entire reference build should be complete within an hour or two, which may be faster than downloading an pre-built index. At this point the raw fastq file is ready to be processed using our indexed template.

$ bowtie -S [ref] [fastq] [output.sam]

At the end of this step we will have a .sam (Sequence Alignment Map) file, which will have each of our raw reads aligned to certain positions on the human reference. However, the reads will be in no useful order, and all the chromosomes and locations are mixed together.
To be able to move through such a large file with speed and ease it must be converted into a binary format, at which point all the reads can be sorted into a meaningful manner.

$ samtools view -bS -o [output.bam] [input.sam]
$ samtools sort [input.bam] [output.sorted]

We are now left with a useful file where our raw reads are assembled and sorted based on a template.

This file can be visualized and analyzed in a wide variety of available programs, the format is also accessible enough to quickly build your own tools around it. Once each of the 20-30 fastq files in a single sample have been processed in this manner the files can be merged, converted into binary for reduced file size, and indexed for quick browsing. IGV is one of the more useful browsers as a result of its simplicity and ability to quickly jump around all along the genome. Getting a cursory looks at how an assembly went.

Integrative Genomics Viewer

This post is the part of a set providing initial documentation of a systematic comparison of various pipelines with a wide range of algorithms and strategies. Check out the next post in the series on assembly with BWA & Picard.


Filed under Genomics

Virtualization of Raw Experimental Data

Earlier today it was announced that the 2012 Nobel Prize in Physiology/Medicine would be shared by Shinya Yamanaka for his discovery of 4 genes that could turn a normal cell back into a pluripotent cell. 

An effect originally shown by John B. Gurdon with his work on frog eggs over 40 years ago. The NCBI’s Gene Expression Omnibus (GEO) database under accession number GSE5259 contains all 24 candidate genes that were suspected to play a role in returning a cell to a non-specialized state. A practical near-term impact of the research however may be overlooked. That is you can have all of Dr. Yamanaka’s experimental DNA microarray data used in making the prize winning discovery.

Unless you’ve been living under a rock on Mars, or you don’t care what dorky scientists are up to, then you may have heard of the ENCODE project. The Encyclopedia of DNA Elements isn’t winning any Nobel Prizes, not yet anyways, and if what many researchers believe to be true, it never will. All the datasets can be found, spun up, played with, and used as fodder for a new round of pure in silico research from the ENCODE Virtual Machine and Cloud Resource.

What ENCODE and the Nobel Prize in Medicine have in common is ushering in a new paradigm of raw experimental data/protocol/methodology sharing.  ENCODE, which generated huge amounts of varied data across 400+ labs has made all of the raw data available online. They go one step further to provide the exact analytic pipelines utilized per experiment, including the raw datasets, as Virtual Machines. The lines between scientist and engineers are blurring, the best of either will have to be a bit of both. From the Nobel data, can you find the 4 genes out of the 24 responsible for pluripotent mechanisms? Are there similarly valuable needles, lost in the haystack of ENCODE data? Go ahead, give it a GREP through.


Leave a comment

Filed under Genomics, Microbiology

The Fall in Gov Funding & Rise of Privatization in Genome Databases

Government Funded Sequence Database

As the spaceshuttle program comes to an end we are reminded of the role of goverments in birthing industries. And just like the manned space program, genomics has been mostly government funded and just like the space program it’s about to take a big hit:

Recently, NCBI announced that due to budget constraints, it would be discontinuing its Sequence Read Archive (SRA) and Trace Archive repositories for high-throughput sequence data. However, NIH has since committed interim funding for SRA in its current form until October 1, 2011.

With its fall there will be few if any (Japan: DDBJ & Europe: ERA) centralized public databases for nextgen sequencing. Once again we’re left to ride with the ruskies, figuratively speaking. Enter private industry and its first batter, from the SF Bay Area, DNA Nexus. Though Sundquist and his team have managed to create a very well polished and modern platform, unlike SRA there is no data aggregation. There is no public pool where researchers can access data. This is a problem in that much of the power of genomics comes from studying statistical trends and a large, public data pool is to date the best way to make any sense of what our genes say.

A similar private effort from the great innovators across the ocean comes in the form of Simbiot by Japan Bioinformatics K.K. At the moment Simbiot is edging a lead as they’ve recently released two mobile applications allowing sequence data management and analysis on the go. However, just as with DNA Nexus users are only given the option to keep their data within their own accounts or share with select others. Both of the aforementioned companies have well thought-out price plans, sleek interfaces and well produced videos. But what makes government efforts like the SRA valuable is that for a time they provided a centralized public data pool.

Said Infamous Graph

As anyone who’s seen the now infamous graph of the rate of decrease in sequencing costs vs that of Moore’s law will likely have figured out by now, the costs associated with maintaining a sequencing database only increases with adoption of the technology. As such, it was reasonable for Uncle Sam to pay for this party at first but the costs rise every year, by leaps. There must be a private model that is both aggregate & open in nature but can also pull it’s own weight in terms of cost; the future of healthcare and any form of “genomics industry” may well be dependant on it.


Filed under Genomics

Decided? No, we just finished saying Good Morning: Sage Congress 2011

“Therefore a sage has said, ‘I will do nothing (of purpose), and the people will be transformed of themselves; I will be fond of keeping still, and the people will of themselves become correct. I will take no trouble about it, and the people will of themselves become rich; I will manifest no ambition, and the people will of themselves attain to the primitive simplicity’ ”  reads Ch. 57 of the Tao Te Ching. How chillingly the 2 millennia old caricature of a wise-learned man holds true to this day.

Sage Bionetworks is a medical research organization, whose goal is “to share research and development of biological network models and their application to human disease and biology.” To this end, top geneticists, clinicians, computer scientists and pharmaceutical researchers gathered this weekend at UC San Francisco. We were given an inspirational speech by a cancer survivor, followed by report of the progress since last years congress. Although admirable on their own, the research and programs built in the last year seemed to remind us all again that in silico research was still closer to the speed of traditional life-science than the leaps and bounds by which the internet moves.

Example of an effort which aligns with & was presented at Sage

Projects like GenomeSpace by the Broad Institute give us hope of what’s possible while watching hours of debate and conjecture at Sagecon.  There were many distinguished scientists, authors , nobel laureates and government representatives, the totality of whose achievement here was coming to agreement on what should be built, who should build it and by when. Groups were divided into subgroups, and then those divided yet again. All the little policy details, software choices and even funding options would be worked out. There was a lot of talk.

Normal Conference VS Developer Conference. SHDH Illustrated by Derek Yu

Attending gatherings for software developers in silicon valley, their hackathons leave much to be desired at events like Sagecon, the least of which being the beer. I doubt anyone enjoys sitting in a stuffy blazer listening to talks for hours on end. The hacker events are very informal, there is no set goal, yet by the end of 24 hours there are often great new programs, friendships and even companies formed. Iteration rate is key to finding solutions and the rate-limiting step in the life-sciences & medicine isn’t the talent or resources it’s the culture; an opinion echoed by Sages’ own shorts-wearing heroes Aled Edwards & Eric Schadt.

“You must understand, young Hobbit, it takes a long time to say anything in Old Entish. And we never say anything unless it is worth taking a long time to say. “

Leave a comment

Filed under Genomics, Microbiology

Biotech for Hackers: Computational Genomics 1 of 2

A low hurdle to entry along with the ability to iterate rapidly is key to taking on problems & creating solutions. What do these solutions look like in genomics and why can hackers lead the way? Fig 1 shows something very similar to social interaction maps one comes across at places like Facebook.

Fig 1: Interaction map of genes implicated in Alzheimer's. Genes were grouped by those that have similar functions (squares) and those with different functions (circles). Modules with a red border have high confidence interactions. While the weight of the connecting green lines corresponds to the number of interactions between two sets.

The map above is of individual gene relationships where an algorithm began with 12 seed genes that previous experiments have shown to play a role in Alzheimer’s disease. These seeds were compared with 185 new candidate genes from regions deemed susceptible to carrying Alzheimer’s genes. From here, both experimental and computational data was combined to generate Fig 1, which the authors dubbed AD-PIN (Alzheimer’s Disease Protein Interaction Network).

Fig 2: Interactions discovered by the Hig-Confidence (HC) set generated by this study in context to known relationships in the Human Interactome (created in past studies).

What we learn by simply tracking genes already known to play a role in Alzheimer’s is the discovery of new regions of genetic code that are  also participating in the expression of related functions, in this case those being affected by the disease, such as memory. In Fig 2 we see that between seeds this algorithm produced 7 high confidence interaction results, of which 3 were  in common with previous studies. In addition to almost 200 new interactions, which can each lead to new therapies, blockbuster drugs and better understanding of the disease itself.

Many software developers have extensive experience and interest in dealing with large data sets, finding correlations  and creating meaningful solutions. However, much of our generation has had little exposure to these problems. Often resulting in the bandwagon effect, as one recent article put it “the latest fucking location fucking based fucking mobile fucking app.” Progress has often been linked to literacy, from books to programming, being able to read and write in life-code just might be the next stage.

Original published study: Interactome mapping suggests new mechanistic details underlying Alzheimer’s disease by Soler-Lopez et al.

Leave a comment

Filed under Genomics