Thursday, July 30, 2009

You can't control what you can't measure

In an opinion piece in IEEE Software, Software Engineering: An Idea Whose Time Has Come and Gone? Tom DeMarco puts his finger on something that's been bugging me about what passes for software engineering for a long time.

I still believe it makes excellent sense to engineer software. But that isn’t exactly what software engineering has come to mean. The term encompasses a specific set of disciplines including defined process, inspections and walkthroughs, requirements engineering, traceability matrices, metrics, precise quality control, rigorous planning and tracking, and coding and documentation standards. All these strive for consistency of practice and predictability.

I'll try to paraphrase his point. Although you can't control what you can't measure, control and predictability are important only in projects of marginal value. In high value projects, rigid cost control becomes insignificant. Maybe that's why so much real innovation takes place in hacker's garages rather than on corporate campuses.

Monday, July 27, 2009

Getting sequence data out of NCBI Entrez

Thanks to a coworker, I finally found out how to get sequence data out of NCBI programmatically. The catch was that I wanted to get a chunk of sequence at a time, without needing to download the whole genome. Now, I can do that through NCBI's eutils. Yay! Here's a link to the key efetch help page.

First we can use the elink call to get a list of sequences (seems to return GI accession numbers) related to a genome project:

I suppose you'll have to make a few more calls to figure out which sequence is which, but I happen to know the one I want is 15789340. So, getting a chunk of sequence is as simple as this:


You can also use refseq accession numbers instead of GIs:


You can even do tricky stuff like get a gene (in this case VNG1179C) in the reverse strand along with 70 nucleotides on either side.


For more laughs, see my previous inept fumbling related to NCBI.

At least I'm creating jobs...

David Parnas, who runs an undergraduate program in software engineering at McMaster University is quoted here in Jeff Atwood's Coding Horror, saying that bad programmers create jobs.

Q: What is the most often-overlooked risk in software engineering? A: Incompetent programmers. There are estimates that the number of programmers needed in the U.S. exceeds 200,000. This is entirely misleading. It is not a quantity problem; we have a quality problem. One bad programmer can easily create two new jobs a year. Hiring more bad programmers will just increase our perceived need for them. If we had more good programmers, and could easily identify them, we would need fewer, not more.

I think it's easier than most people think to create negative value. And, it's particularly easy in software. This relates to my crackpot theory that the reason the internet exists is to soak up the excess productivity of humanity. Still, we as a profession can't come anywhere near the negative value creation capability of the financial sector. Those guys have talent.

Programmer (in)competance

If you want to check whether you really know what you're doing, score yourself against the Programmer Competency Matrix. Or, do what I do and recognize your own failures in 5 Stages of Programmer Incompetence. We're all "guru's" on this one.

Sunday, July 26, 2009

Select operations on R data frames

The R Project

The R language is weird - particularly for those coming from a typical programmer's background, which likely includes OO languages in the curly-brace family and relational databases using SQL. A key data structure in R, the data.frame, is used something like a table in a relational database. In terms of R's somewhat byzantine type system (which is explained nicely here), a data.frame is a list of vectors of varying types. Each vector is a column in the data.frame making this a column-oriented data structure as opposed to the row-oriented nature of relational databases.

In spite of this difference, we often want to do the same sorts of things to an R data.frame that we would to a SQL table. The R docs confuse the SQL-savvy by using different terminology, so here is a quick crib-sheet for applying SQL concepts to data.frames.

We're going to use a sample data.frame with the following configuration of columns, or schema, if you prefer: (sequence:factor, strand:factor, start:integer, end:integer, common_name:character, value:double) where the type character is a string and a factor is something like an enum. Well, more accurately, value is a vector of type double and so forth. Anyway, our example is motivated by annotation of genome sequences, but the techniques aren't particular to any type of data.

> head(df)
    sequence strand start   end common_name      value
1 chromosome      +  1450  2112        yvrO  0.9542516
2 chromosome      + 41063 41716       graD6  0.2374012
3 chromosome      + 62927 63640       graD3  1.0454790
4 chromosome      + 63881 64807         gmd  1.4383845
5 chromosome      + 71811 72701        moaE -1.8739953
6 chromosome      + 73639 74739        moaA  1.2711058

So, given a data.frame of that schema, how do we do some simple select operations?

Selecting columns by name is easy:

> df[,c('sequence','start','end')]
       sequence   start     end
1    chromosome    1450    2112
2    chromosome   41063   41716
3    chromosome   62927   63640
4    chromosome   63881   64807
5    chromosome   71811   72701

As is selecting row names, or both:

> df[566:570,c('sequence','start','end')]
      sequence  start    end
566 chromosome 480999 479860
567 chromosome 481397 480999
568 chromosome 503053 501275
569 chromosome 506476 505712
570 chromosome 515461 514277

Selecting rows that meet certain criteria is a lot like a SQL where clause:

> df[df$value>3.0,]
      sequence strand   start     end common_name    value
199 chromosome      +  907743  909506        hutU 3.158821
321 chromosome      + 1391811 1393337        nadB 3.092771
556 chromosome      -  431600  431037         apt 3.043373
572 chromosome      -  519043  518186        hbd1 3.077040

For extra bonus points, let's find tRNAs.

> df[grep("trna", df$common_name,,]
      sequence strand   start     end common_name        value
18  chromosome      +  115152  115224    Asn tRNA -0.461038128
19  chromosome      +  115314  115422    Ile tRNA -0.925268307
31  chromosome      +  167315  167388    Tyr tRNA  0.112527023
32  chromosome      +  191112  191196    Ser tRNA  0.986357577

Duplicate row names

Row names are not necessarily unique in R, which breaks the method shown above for selecting by row name. Take matrix a:

< a = matrix(1:18, nrow=6, ncol=3)
< rownames(a) <- c('a', 'a', 'a', 'b', 'b', 'b')
< colnames(a) <- c('foo', 'bar', 'bat')
< a
  foo bar bat
a   1   7  13
a   2   8  14
a   3   9  15
b   4  10  16
b   5  11  17
b   6  12  18

It looks to me like trying to index by the row names just returns the first row of a given name:

< a['a',]
foo bar bat 
  1   7  13
< a['b',]
foo bar bat 
  4  10  16 

But this works:

< a[rownames(a)=='a',]
  foo bar bat
a   1   7  13
a   2   8  14
a   3   9  15

More Resources:

Help for R, the R language, or the R project is notoriously hard to search for, so I like to stick in a few extra keywords, like R, data frames, data.frames, select subset, subsetting, selecting rows from a data.frame that meet certain criteria, and find.

...more on R.

Friday, July 17, 2009

Parsing GEO SOFT files with Python and Sqlite

NCBI's GEO database of gene expression data is a great resource, but its records are very open ended. This lack of rigidity was perhaps necessary to accommodate the variety of measurement technologies, but makes getting data out a little tricky. But, all that flexibility is a curse from the point of view of extracting data. The scripts I end up with are not general parsers for GEO data, but will need to be adapted to the specifics of other datasets.

Note: It could be that I'm doing things the hard way. Maybe there's an easier way.

A GEO record consists of a platform, which describes (for example) a microarray and its probes, and series of samples. In this example, we need to do a join between the platform and the sample records to end up with a matrix of the form (seq, strand, start, end, value1, value2, ..., valueN) where the value1 column holds measurements from the first sample and so on. If we do that, we'll have coordinates on the genome and values for each measurement. My goal is to feed data into a genome browser known as HeebieGB with a stop-over in R along the way.

Merging on a common key is only slightly complicated, but tiling arrays are big (~244,000 probes in this case). I hesitate to try merging several 244K row tables in memory. Database engines are made for this sort of thing, so I'll use SQLite to get this done and Python to script the whole process.

I like to start python scripts with a template similar to Guido's main function, except that I prefer optparse to getopt. An --overwrite option will force the user to be conscious of overwriting files.

import sys
from optparse import OptionParser

def main():
 usage = "%prog [options] input_file sqlite_db_file"
 parser = OptionParser(usage=usage)
 parser.add_option("-o", "--overwrite", dest="overwrite", default=False, action="store_true", 
  help="if output db file exists, overwrite it")
 (options, args) = parser.parse_args()

 if len(args) < 2:
  parser.error("missing required arguments.")

 input_filename = args[0]
 db_filename = args[1]

if __name__ == "__main__":

GEO records a bunch of descriptive data about each sample, some of which we want. I've read that storing arbitrary key-value pairs in a relational DB is considered bad by some. But, I'm going to do it anyway. The entity attributes will go in a table called attributes whose schema is (entity_id, key, value).

The function parse_platform_table pulls the platform data from a tab-separated section in the SOFT file into a table with a schema something like this: (id, sequence, strand, start, end). There's also a tab-separated section for each of the samples that refers back to its platform, so I extract that in a similar manner in parse_sample_table. It's easiest to start out with each sample in its own table, even though that's not really what we want.

The complete script -also available from SVN here- ends up like this:

import sys
from optparse import OptionParser
import re
import os
import os.path
import sqlite3

# GEO SOFT format is documented here:

# ID field in platform joins with ID_REF field in samples

entity       = re.compile(r'\^(\S+) = (.+)')
kvp          = re.compile(r'!(\S+) = (.+)')


def overwrite(name):
 if os.path.exists(name):
  return True
 return False

def parse_series_file(file, conn):
 entity_id = None
 state = STATE_START

 # create an attributes table
  cursor = conn.cursor()
  cursor.execute('create table attributes (entity_id, key, value);')

 for line in file:
  line = line.strip()

  # read entity tags
  if line.startswith('^'):
   m = entity.match(line)
   if m:
    entity_type =
    entity_id =
    if entity_type == 'SERIES':
     state = STATE_IN_SERIES
    elif entity_type == 'PLATFORM':
     state = STATE_IN_PLATFORM
    elif entity_type == 'SAMPLE':
     state = STATE_IN_SAMPLE

  # read attribute key-value pairs and tab-separated tables
  elif line.startswith('!'):
   m = kvp.match(line)
   if m:
    key =
    value =
    handle_attribute(conn, entity_id, key, value)
   elif state==STATE_IN_PLATFORM and line=='!platform_table_begin':
    parse_platform_table(file, conn, entity_id)
   elif state==STATE_IN_SAMPLE and line=='!sample_table_begin':
    parse_sample_table(file, conn, entity_id)

def parse_platform_table(file, conn, platform_id):
 Read the tab-separated platform section of a SOFT file and store the ID,
 sequence, strand, start, and end columns in a SQLite database.

 file: a file object open for reading
 conn: a SQLite database connection
 platform_id: a string identifying a GEO platform
 cursor = conn.cursor()
  # throw away line containing column headers
  # create platform table
  cursor.execute('create table %s (id integer primary key not null, sequence text not null, strand not null, start integer not null, end integer not null, control_type integer);' % (platform_id))
  sql = 'insert into %s values(?,?,?,?,?,?)' % (platform_id)
  for line in file:
   line = line.strip('\n')
   if (line.strip() == '!platform_table_end'):
   fields = line.split("\t")
   cursor.execute(sql, (int(fields[0]), fields[6], fields[10], fields[7], fields[8], fields[4]))

def parse_sample_table(file, conn, sample_id):
 Read a tab separated sample section from a SOFT file and store ID_REF and
 value in a SQLite DB.

 file: a file object open for reading
 conn: a SQLite database connection
 sample_id: a string identifying a GEO sample
 cursor = conn.cursor()
  # throw away line containing column headers
  # create sample table
  cursor.execute('create table %s (id_ref integer not null, value numeric not null);' % (sample_id))
  sql = 'insert into %s values(?,?)' % (sample_id)
  for line in file:
   line = line.strip('\n')
   if (line.strip() == '!sample_table_end'):
   fields = line.split("\t")
   cursor.execute(sql, (int(fields[0]), float(fields[1])))

def handle_attribute(conn, entity_id, key, value):
 Store an entity attribute in the attributes table
 cursor = None
  cursor = conn.cursor()
  cursor.execute("insert into attributes values(?,?,?);", (entity_id, key, value))
  if cursor:

def main():
 usage = "%prog [options] input_file"
 parser = OptionParser(usage=usage)
 parser.add_option("-o", "--overwrite", dest="overwrite", default=False, action="store_true", 
  help="if output db file exists, overwrite it")
 (options, args) = parser.parse_args()

 if len(args) < 2:
  parser.error("missing required arguments.")

 input_filename = args[0]
 db_filename = args[1]

 if options.overwrite:

 input_file = None
 conn = None
  conn = sqlite3.connect(db_filename)
  input_file = open(input_filename, 'r')
  parse_series_file(input_file, conn)
  if input_file:
  if conn:

if __name__ == "__main__":

The specific series I'm interested in (GSE12923) has 53 samples. The platform (GPL7255) is a custom array on Agilent's 244k feature microarrays or just short of 13 million individual features. The SOFT file is 708 MB and the script takes a good 5 or 6 minutes to ingest all that data. The next step is merging all the data into a single matrix.

This turned out to be harder than I thought. At first, I naively tried to do a big 54 way join between the platform table and all the sample tables, with an order-by to sort by chromosomal location. I let this run for a couple hours, then gave up. Sure, a big join on unindexed tables was bound to be ugly, but it only had to run once. I'm still surprised that this choked, after all, it's not that much data.

There are two ways around it. One is to index the sample tables by ID_REF and the platform table by (sequence, strand, start, end). The other is to do the big join then sort into a second table. Either takes several minutes, but it's just a one-off, so that's OK.

insert into matrix
select GPL7255.sequence, GPL7255.strand, GPL7255.start, GPL7255.end,
GSM320660.VALUE as GSM320660,
GSM320661.VALUE as GSM320661,
...GSM320712.VALUE as GSM320712
from GPL7255
join GSM320660 on GPL7255.ID = GSM320660.ID_REF
join GSM320661 on GPL7255.ID = GSM320661.ID_REF
...join GSM320712 on GPL7255.ID = GSM320712.ID_REF
where GPL7255.control_type==0 and sequence!='NA';
order by sequence, strand, start, end;

Now that we've done that, do you ever find data that doesn't need to be cleaned up a little bit?

-- swap mislabeled + and - strands (how embarrassing!)
update matrix set strand='Z' where strand='-';
update matrix set strand='-' where strand='+';
update matrix set strand='+' where strand='Z';

-- fix up sequence names
update matrix set sequence='chromosome' where sequence='chr1';
update matrix set sequence='pNRC200' where sequence='chr2';
update matrix set sequence='pNRC100' where sequence='chr3';

-- fix probes crossing the "zero" point
update matrix set start=end, end=start where end-start > 60;

That's about all the data munging I can stand for now. The rest, I'll leave for Part 2.

Thursday, July 09, 2009

Visualizing networks

Network visualization is one of the central tools of modern biology. I like this style of zooming used in a graphic on human disease from Redefining Disease, Genes and All in the NY times. (Based on a paper called The Human Disease Network from the Barabasi lab).

Richard Schneider's lab wrote a review of software tools for network visualization.


Cytoscape Web is slick. Christian Lopes and Max Franz have managed to reimplement a good part of Cytoscape (a desktop Java/Swing program) using Flash, Flex and the Flare visualization framework from the Berkeley Visualization Lab.

The March 2010 Nature Methods has a nice supplement on visualizing biological data.

Thursday, July 02, 2009

R String processing

Note: Nowadays, stringr's str_match solves this problem, nicely. Another option is gsubfn's very R-ish strapply.

Here's a little vignette of data munging using the regular expression facilities of R (aka the R-project for statistical computing). Let's say I have a vector of strings that looks like this:

> coords
[1] "chromosome+:157470-158370" "chromosome+:158370-158450" "chromosome+:158450-158510"
[4] "chromosome+:158510-159330" "chromosome-:157460-158560" "chromosome-:158560-158920"

What I'd like to do is parse these out into a data.frame with a column for each of sequence, strand, start, end. A regex that would do that kind of thing looks like this: (.*)([+-]):(\d+)-(\d+). R does regular expressions, but it's missing a few pieces. For example, in python you might say:

import re

coords = """

regex = re.compile("(.*)([+-]):(\\d+)-(\\d+)")

for line in coords.split("\n"):
 line = line.strip()
 if (len(line)==0): continue
 m = regex.match(line)
 if (m):
  seq =
  strand =
  start = int(
  end = int(
  print "%s\t%s\t%d\t%d" % (seq, strand, start, end)

As far as I've found, there doesn't seem to be an equivalent in R to regex.match, which is a shame. The gsub function supports capturing groups in regular expressions, but isn't very flexible about what you do with them. One way to solve this problem is to use gsub to pull out each individual column. Not efficient, but it works:

> coords.df = data.frame(
 seq=gsub("(.*)([+-]):(\\d+)-(\\d+)", "\\1", row.names(m), perl=T),
 strand=gsub("(.*)([+-]):(\\d+)-(\\d+)", "\\2", row.names(m), perl=T),
 start=as.integer(gsub("(.*)([+-]):(\\d+)-(\\d+)", "\\3", row.names(m), perl=T)),
 end=as.integer(gsub("(.*)([+-]):(\\d+)-(\\d+)", "\\4", row.names(m), perl=T)))
> coords.df
         seq strand  start    end
1 chromosome      + 157470 158370
2 chromosome      + 158370 158450
3 chromosome      + 158450 158510
4 chromosome      + 158510 159330
5 chromosome      - 157460 158560
6 chromosome      - 158560 158920

It seems strange that R doesn't have a more direct way of accomplishing this. I'm not an R expert, so maybe it's there and I'm missing it. I guess it's not called the R project for string processing, but still... By the way, if you're ever tempted to name a project with a single letter, consider the poor schmucks trying to google for help.