Super Happy Dev House pays off once again, with a day that blurred the line between play and work. Reminiscent of the days in middle and high school where my parents would provide tables, power cords and snacks for those all night LAN parties. Except now Google was the host & playing games still ranked high on the agenda.
The National Center for Biotechnology Information (NCBI) provides a command line based standalone Basic Local Alignment Search Tool (BLAST) package known as BLAST+ to analyze and play with genomic sequence data. Although, the legacy web based BLAST can perform a range of functions, BLAST+ as a command line tool is much better to understand and analyze large amounts of nucleotide data. It may be best to get an idea of what sort of data we’re dealing with by getting into the government’s database:
mokas$ ftp ftp.ncbi.nlm.nih.gov Connected to ftp.wip.ncbi.nlm.nih.gov. 220- Warning Notice! This is a U.S. Government computer system, which may be accessed and used only for authorized Government business by authorized personnel. Unauthorized access or use of this computer system may subject violators to criminal, civil, and/or administrative action. All information on this computer system may be intercepted, recorded, read, copied... There is no right of privacy in this system.
Don’t worry about the scary message, this is all public data… well until the funding stops. Take a look in the blast/db directory for many pre-formatted databases NCBI has provided, i.e. genomic & protein reference sequences, patent nucleotide sequence databases from USPTO & EU/Japan Patent Agencies. Get yourself the latest BLAST+ from blast/executables/LATEST , I used ncbi-blast-2.2.25+-universal-macosx.tar.gz .
mokas$ tar zxvpf ncbi-blast-2.2.25+-universal-macosx.tar.gz mokas$ PATH=/Users/mokas/Desktop/ncbi-blast-2.2.25+/bin mokas$ export PATH mokas$ echo $PATH /Users/mokas/Desktop/ncbi-blast-2.2.25+/bin mokas$ mkdir ./blast-2.2.25+/db mokas$ blastn -help USAGE blastn [-h] [-help] [-import_search_strategy filename] ...
Databases should be loaded directly into /db directory created above with the mkdir command. The last thing that needs to be done is to make a “.ncbirc” text file in the main directory containing the following:
This will guide the program to where data is being kept. At the end of the day we should hope to get something like this:
mokas$ blastn -query Homo_sapiens.NCBI36.apr.rna.fa -db refseq_rna BLASTN 2.2.25+ ... Query= ENST00000361359 ncrna:Mt_rRNA chromosome:NCBI36:MT:650:1603:1 gene:ENSG00000198714 Length=954 Score E Sequences producing significant alignments: (Bits) Value ref|XR_109154.1| PREDICTED: Homo sapiens hypothetical LOC1005054... 464 5e-128 >ref|XR_109154.1| PREDICTED: Homo sapiens hypothetical LOC100505479 (LOC100505479), partial miscRNA Length=266 Score = 464 bits (251), Expect = 5e-128 Identities = 255/257 (99%), Gaps = 0/257 (0%) Strand=Plus/Minus Query 334 CACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGACTACGAAAGTGGCTTTAACAT 393 |||||||||||||||||||||||||||| |||||||| |||||||||||||||||||||| Sbjct 257 CACCTGAGTTGTAAAAAACTCCAGTTGATACAAAATAAACTACGAAAGTGGCTTTAACAT 198
Much thanks are in order to Dr. Tao Tao of NCBI, all the great folks who showed up, hung out and helped out. To Google for the food and drinks (no beer?!) and for everyone on the SHDH team who scrambled all this together, which I’m told is par for the course. Hopefully this will be a fun tool for folks not well acquainted with genomics/programming to sandbox and explore in. #funsaturday