Category Archives: Microbiology

Retooling Analysis Pipelines from Human to EBOV NGS Data for Rapid Alignment and Strain Identification

Screen Shot 2014-10-01 at 12.26.01 PMCan we use pipelines developed for human NGS analysis and quickly apply them for viral analysis? With ebolavirus being in the news, it seemed like a good time to try. Just as with a human sequencing project, it’s helpful if we have a good reference genome. The NCBI has four different ebola strain reference files located at their ftp:

Remote directory: /genomes/Viruses/*
Accession: NC_002549.1 : 18,959 bp linear cRNA
Accession: NC_014372.1 : 18,935 bp linear cRNA
Accession: NC_004161.1 : 18,891 bp linear cRNA
Accession: NC_014373.1 : 18,940 bp linear cRNA

Currently everything that’s happened in West Africa looks to match best with NC_002549.1, the Zaire strain. The Broad Institute began metagenomic sequencing from human serum this summer and the data can be accessed here (Accession: PRJNA257197). We can take some of these datasets and map them to NC_002549.1. The datasets are in .sra format, and must be extracted using fastq-dump.

Coverage map of SRA data from 2014 outbreak in Sierra Leone to the Zaire reference genome.

Coverage map of SRA data from 2014 outbreak in Sierra Leone to the Zaire reference genome.

We can see that the data maps really well to this strain. All four of the reference genomes above were indexed with a new build of bwa(0.7.10-r876-dirty git clone https://github.com/lh3/bwa.git). Because EBOV genomes are so small, compared to humans, the only alignment algorithm which seemed suitable within bwa, was mem.

EBOV mokas$ ./bwa/bwa mem Zaire_ebolavirus_uid14703.fa SRR1553514.fastq > SRR1553514.sam
[M::main_mem] read 99010 sequences (10000010 bp)...
[M::mem_process_seqs] Processed 99010 reads in 8.988 CPU sec, 9.478 real sec
[M::main_mem] read 99010 sequences (10000010 bp)...
[M::mem_process_seqs] Processed 99010 reads in 8.964 CPU sec, 9.671 real sec

If we take the same SRA data and try to map it to some of the other strain references, e.g. the Reston Virginia strain from 1989, it can help give a rough idea of how closely related the 2014 incident is.

Very few regions from 2014 map to the Reston reference

Very few regions from 2014 map to the Reston reference

It can be seen that apart from a few highly conserved regions where many reads align, the coverage map indicates that the data collected in West Africa and sequenced on the Illumina HiSeq2500 does not match to NC_004161.1. There were still approximately 500 variants with the Zaire reference on the 2014 samples, showing a good amount differences, considering the entire genome is only 18,000bp.
Screen Shot 2014-10-01 at 4.11.54 PM
All of this is, of course, good news. We can take sequencing data of new EBOV strains and apply slightly modified pipelines to get meaningful results. And with the Ion PGM now being FDA approved means data can be generated in nearly 3 hours, with Federal approval.Screen Shot 2014-10-01 at 12.58.28 PMThere have even been some publications which show that the protein VP24 can stop EBOV all together [DOI: 10.1086/520582] with the structures available for analysis as well. So, it looks like it’s all coming up humanity, our capabilities are there, and with proper resources this scary little bug can be a thing of history.

Leave a comment

Filed under Genomics, Microbiology

Virtualization of Raw Experimental Data

Earlier today it was announced that the 2012 Nobel Prize in Physiology/Medicine would be shared by Shinya Yamanaka for his discovery of 4 genes that could turn a normal cell back into a pluripotent cell. 

An effect originally shown by John B. Gurdon with his work on frog eggs over 40 years ago. The NCBI’s Gene Expression Omnibus (GEO) database under accession number GSE5259 contains all 24 candidate genes that were suspected to play a role in returning a cell to a non-specialized state. A practical near-term impact of the research however may be overlooked. That is you can have all of Dr. Yamanaka’s experimental DNA microarray data used in making the prize winning discovery.

Unless you’ve been living under a rock on Mars, or you don’t care what dorky scientists are up to, then you may have heard of the ENCODE project. The Encyclopedia of DNA Elements isn’t winning any Nobel Prizes, not yet anyways, and if what many researchers believe to be true, it never will. All the datasets can be found, spun up, played with, and used as fodder for a new round of pure in silico research from the ENCODE Virtual Machine and Cloud Resource.

What ENCODE and the Nobel Prize in Medicine have in common is ushering in a new paradigm of raw experimental data/protocol/methodology sharing.  ENCODE, which generated huge amounts of varied data across 400+ labs has made all of the raw data available online. They go one step further to provide the exact analytic pipelines utilized per experiment, including the raw datasets, as Virtual Machines. The lines between scientist and engineers are blurring, the best of either will have to be a bit of both. From the Nobel data, can you find the 4 genes out of the 24 responsible for pluripotent mechanisms? Are there similarly valuable needles, lost in the haystack of ENCODE data? Go ahead, give it a GREP through.

Citations:

Leave a comment

Filed under Genomics, Microbiology

Bioinformatics In Bengal

Dept. Biochemistry University of Dhaka

Visiting Bengal for the holidays I didn’t expect a thriving bioinformatics community. Yet, that’s exactly what I found when Dr. Haseena Khan invited me to visit her lab at The University of Dhaka. The Jute Genome Project was a consortium of academia, industry, and government which had sequenced & analyzed the Jute plant.

What Dr. Khan and her researchers lacked in cutting-edge equipment, they made up in passion, ingenuity & thorough knowledge of the most miniscule advancements in the field. After spending the day with them Dr. Khan insisted I meet with the industrial wing of the project.

Tucked away amidst one of the most clustered places on the planet, there are a few small buildings covered in plants, within them incredible things are happening.

Lush green home of scientists, developers & supercomputers at DataSoft

DataSoft Systems Ltd. created a sub-division, Swapnojaatra (dream journey) which would “put scientists, developers, and supercomputers in one room and throw away the key” as Palash the Director of Technology for DataSoft would tell me. Although the Jute Genome Project is now complete, the developers of Swapnojaatra are hooked on informatics. From the minute we met they were excited to show what they had done (within lines of existing NDAs) and ask what was new in the field from San Francisco. Indeed, the team here had discovered genomic re-assortment of the influenza virus, performed molecular docking studies of pneumonia and created many of their own informatics tools.

For a well-educated, computer savvy, developing region bioinformatics is a near perfect industry. With low overhead costs, compared to traditional wet-lab sciences and endless data being generated in more economically developed countries, it’s only a matter of time. Bengal and bioinformatics may have been made for each other.

 

Citations:

A Putative Leucine-Rich Repeat Receptor-Like Kinase of Jute Involved in Stress Response (2010) by MS Islam, SB Nayeem, M Shoyaib, H Khan DOI: 10.1007/s11105-009-0166-4

Molecular-docking study of capsular regulatory protein in Streptococcus pneumoniae portends the novel approach to its treatment (2011) by S Thapa, A Zubaer DOI:10.2147/OAB.S26236

Palindromes drive the re-assortment in Influenza A (2011) by A Zubaer, S Thapa ISSN 0973-2063


Leave a comment

Filed under Genomics, Microbiology

Decided? No, we just finished saying Good Morning: Sage Congress 2011

“Therefore a sage has said, ‘I will do nothing (of purpose), and the people will be transformed of themselves; I will be fond of keeping still, and the people will of themselves become correct. I will take no trouble about it, and the people will of themselves become rich; I will manifest no ambition, and the people will of themselves attain to the primitive simplicity’ ”  reads Ch. 57 of the Tao Te Ching. How chillingly the 2 millennia old caricature of a wise-learned man holds true to this day.

Sage Bionetworks is a medical research organization, whose goal is “to share research and development of biological network models and their application to human disease and biology.” To this end, top geneticists, clinicians, computer scientists and pharmaceutical researchers gathered this weekend at UC San Francisco. We were given an inspirational speech by a cancer survivor, followed by report of the progress since last years congress. Although admirable on their own, the research and programs built in the last year seemed to remind us all again that in silico research was still closer to the speed of traditional life-science than the leaps and bounds by which the internet moves.

Example of an effort which aligns with & was presented at Sage

Projects like GenomeSpace by the Broad Institute give us hope of what’s possible while watching hours of debate and conjecture at Sagecon.  There were many distinguished scientists, authors , nobel laureates and government representatives, the totality of whose achievement here was coming to agreement on what should be built, who should build it and by when. Groups were divided into subgroups, and then those divided yet again. All the little policy details, software choices and even funding options would be worked out. There was a lot of talk.

Normal Conference VS Developer Conference. SHDH Illustrated by Derek Yu

Attending gatherings for software developers in silicon valley, their hackathons leave much to be desired at events like Sagecon, the least of which being the beer. I doubt anyone enjoys sitting in a stuffy blazer listening to talks for hours on end. The hacker events are very informal, there is no set goal, yet by the end of 24 hours there are often great new programs, friendships and even companies formed. Iteration rate is key to finding solutions and the rate-limiting step in the life-sciences & medicine isn’t the talent or resources it’s the culture; an opinion echoed by Sages’ own shorts-wearing heroes Aled Edwards & Eric Schadt.

“You must understand, young Hobbit, it takes a long time to say anything in Old Entish. And we never say anything unless it is worth taking a long time to say. “

Leave a comment

Filed under Genomics, Microbiology

Library of Life: Genomic Databases & Browsers

DNA at it’s heart is enormous chunks of information. The genome of an organism like  yeast, mice or humans contains an ocean of data. Currently there are several on-line genomic databases, a great example being SGD dedicated to the yeast S. cerevisiae. SGD has become a necessary tool for life-scientist over the past 10  years but at the same time has not kept up with information technology, resulting in a platform which works like a 10 year old website.

SGD is clunky but necessary, for now

Above we see a typical SGD search, it takes  5 windows to arrive at the sequence data of 1 gene. Nevertheless, SGD is used by drug companies trying to find the next big hit, academic labs trying to cure cancer and field biologists studying wildlife.

DNA is extracted and placed through a sequencing machine which spits out the information into a computer file.  Just as having an aged internet browser affects our productivity the browser one uses to view these files can have a large impact. Following the web-browser analogy we take a look at 3 different sequence browsers, starting with Vector NTI.

Vector NTI is enterprise software.

Vector NTI is well established and often bundled with hardware. It has many features but can often seem like information overload, causing most users to stumble through it’s many menus and windows. A step up in usability comes from the third-party software suite Sequencher, popular amongst mac users.

Sequencher is your friend

Sequencher strikes a healthy balance between features and usability. But is a fairly resource intensive program requiring CDs and hard drive space to store local algorithms. However, the most up to date browser is likely to be the free and light download, 4Peaks.

4Peaks Simplicity & Usability

4Peaks allows the user to go in, read their sequence file and get out. What it lacks in features it makes up for in simplicity. The end result of any software or database is to help researchers wade through all this information and continue their studies. In this environment services such as GENEART offers to perform much of the genomic related leg work on a given project.

These are all tools, the databases, browsers and services, which enable researchers to answer the questions that line our horizon. The progress of our tools has always directly correlated with our advancement, the life sciences adoption of information technology is a necessity as we discover so much of life is condensed data in every nook.

1 Comment

Filed under Genomics, Microbiology, PCR

The Polymerase Chain Reaction, A Microcosm

Creating a new life-form is an awe-inspiring experience. Writing DNA like a mere sentence and watching creation unfold in the mechanism of life is both breathtaking and humbling. None of this would be possible without the Polymerase Chain Reaction (PCR). A simple process where all the ingredients for DNA: a teaspoon of reagents, a pinch of polymerase enzyme and a handful of the “letters” that make up our genetic code are thrown into the oven, literally, well a very accurate oven that can step temperatures rather quickly. Within hours the sentence you had written out on a computer screen, is now molecules floating around in a tiny tube ready to be put into a cell, which will read the instructions and attempt to build or act accordingly. Using this simple idea the human race has been handed over the keys to the Build a Life Workshop, however this simple process often goes without scrutiny, without improvement.

Basic Principles of PCR

Much of the drug discovery in both academia and industry is now focused on protein mechanics. How does this receptor behave? What buttons turn this enzyme on and off? Focusing on protein structure and mechanism often makes PCR a boring chore that most researchers have to grudgingly get past before they can get to the interesting part. As a result, the basic process of PCR has remained the same for decades. I literally remember when a P.I. gave me a paper from 1985 to look up what settings I should use for my reaction. All this wouldn’t be a problem, except people are often wasting weeks to months trying to get the right PCR outcomes. At the root of the problem & the solution is information. PCR is a “black box” process, in that you throw all the ingredients together turn on the machine and hope that all the right molecules will bump into each other at the right times. Traditionally, it has been a exasperating trial & error based system. Now however, information technology has given a glimpse of a solution and a way to move forward to the next chapter in the development of this life-science staple.

Leave a comment

Filed under Genomics, Microbiology, PCR