Friday, September 23, 2011

Network Science

Network analysis is hip. Applications range over social networks, security, biology, and economics. At this point, you'll hardly be the first one to the party, but if you want to give network science a try, here's a random grab-bag of resources to get started.

Coordination of frontline defense mechanisms under severe oxidative stress, Kaur et al. 2010

Learning network science

Jon Kleinberg, a professor of computer science at Cornell University, co-wrote Networks, Crowds, and Markets: Reasoning About a Highly Connected World along with David Easley. He also wrote Algorithm Design, an undergraduate textbook.

A 2004 review paper by Barabasi and Oltvai Network biology: understanding the cell's functional organization. covers a broad range of applications of networks in modern biology. Barabasi is also author of Linked.

A Science special issue on networks, from July 2009, revisits the foundations of network analysis, and delves into applications to ecological interactions, counter-terrorism, and finance.

Video and slides are available for Drew Conway's presentation on social network analysis in R, which mostly focuses on software tools.

Tools for analyzing networks

Software tools for working with networks include the R packages graph, igraph, network. Also, the NetworkX library for Python looks quite powerful. Visualization tools tend to come and go, but some well-known tools are: Cytoscape, Gephi, and GraphViz.

More network stuff

Monday, September 19, 2011

Applying control theory to the cell

In a talk about research goals in the systems biology of microbes, Adam Arkin referenced the Internal Model Principle of control theory. Here are a couple definitions.

A regulator for which both internal stability and output regulation are structurally stable properties must utilize feedback of the regulated variable and incorporate in the feedback loop a suitably reduplicated model of the dynamic structure of the exogenous signals which the regulator is required to process.

Towards an Abstract Internal Model Principle Wonham, 1976

That's a mouthful. This one's a little less scary.

Internal Model Principle: control can be achieved only if the control system encapsulates, either implicitly or explicitly, some representation of the process to be controlled.

Lecture notes on Introduction to Robust Control by Ming T. Tham, 2002

Driving this thinking is the discovery that microbes show anticipatory behavior and the associations can be fairly readily entrained and lost in a few generations. Ilias Tagkopoulos and Saeed Tavazoie, in Predictive behavior within microbial genetic networks, demonstrated associative learning through rewiring gene regulatory networks. It turns out that when E. coli senses a shift to mammalian body temperature, it begins the transition to anaerobic metabolism, nicely anticipating the correlated structure of it's environment.

In another example, Amir Mitchell working at Weizmann, showed that yeast anticipates the stages of fermentation in Adaptive prediction of environmental changes by microorganism.

This raises some important questions. How is the internal model encoded within the cell? And how does the cell acquire, parameterize and adjust its internal model over evolutionary time scales? The answers will lead to a deeper understanding of living systems and might even feed new techniques and principles back to control theory.

An interesting challenge will be to experimentally read out the information embedded in the cell's control systems and then the informatics problem of how to represent and work with such things.

Understanding how this works is a prerequisite for re-engineering living systems, otherwise known as synthetic biology, championed by George Church and Drew Endy. This month, by the way, the journal Science has a special issue on synthetic biology.

I'm fascinated by the idea of applying engineering principles to biology - evolved systems, rather than engineered artifacts. Maybe that's because my spaghetti code looks a lot like the messy interconnectedness of biology. Creating software feels organic, rather than wholly predesigned. The engineering of complex software systems tends to be an adaptive evolutionary process. As messy as biology is, modularity naturally emerges. Maybe biology has something to teach us about organizing this chaos.

Thursday, August 25, 2011

String functions in R

Here's a quick cheat-sheet on string manipulation functions in R, mostly cribbed from Quick-R's list of String Functions with a few additional links.

  • substr(x, start=n1, stop=n2)
  • grep(pattern,x, value=FALSE, ignore.case=FALSE, fixed=FALSE)
  • gsub(pattern, replacement, x, ignore.case=FALSE, fixed=FALSE)
  • gregexpr(pattern, text, ignore.case=FALSE, perl=FALSE, fixed=FALSE)
  • strsplit(x, split)
  • paste(..., sep="", collapse=NULL)
  • sprintf(fmt, ...)
  • toupper/tolower(x)
  • nchar(x)

Also see Regular Expressions as used in R and R String processing.

Note: Just to be clear, R is far from an ideal platform for processing text. For anything where that's the major concern, you're better off going to Python or Ruby.

Monday, August 15, 2011

MySQL and R

Using MySQL with R is pretty easy, with RMySQL. Here are a few notes to keep me straight on a few things I always get snagged on.

Typically, most folks are going to want to analyze data that's already in a MySQL database. Being a little bass-ackwards, I often want to go the other way. One reason to do this is to do some analysis in R and make the results available dynamically in a web app, which necessitates writing data from R into a database. As of this writing, INSERT isn't even mentioned in the RMySQL docs, sadly for me, but it works just fine.

The docs are a bit clearer for RS-DBI, which is the standard R interface to relational databases and of which RMySQL is one implementation.

Opening and closing connections

The best way to close DB connections, like you would do in a finally clause in Java, is to use on.exit, like this:

con <- dbConnect(MySQL(),
         user="me", password="nuts2u",
         dbname="my_db", host="localhost")
on.exit(dbDisconnect(con))

Building queries

Using sprintf to build the queries feels a little primitive. As far as I can tell, there's no prepared statements in RMySQL. I don't suppose SQL-injection is a concern here, but prepared statements might be a little tidier, anyway.

Processing query results

You can process query results row by row, in blocks or all at once. The highly useful function dbGetQuery(con, sql) returns all query results as a data frame. With dbSendQuery, you can get all or partial results with fetch.

con <- dbConnect(MySQL(), user="network_portal", password="monkey2us", dbname=db.name, host="localhost")
rs <- dbSendQuery(con, "select name from genes limit 10;")
data <- fetch(rs, n=10)
huh <- dbHasCompleted(rs)
dbClearResult(rs)
dbDisconnect(con)

If there's no more results, fetch returns a data frame with 0 columns and 0 rows. dbHasCompleted is supposed to indicate whether there are more records to be fetched, but seems broken. The value of huh in the code above is false, which seems wrong to me.

Retrieving AUTO_INCREMENT IDs

A standard newbie question with MySQL is how to retrieve freshly generated primary keys from AUTO_INCREMENT fields. That's what MySQL's LAST_INSERT_ID() is for.

You can retrieve the most recent AUTO_INCREMENT value with the LAST_INSERT_ID() SQL function or the mysql_insert_id() C API function. These functions are connection-specific, so their return values are not affected by another connection which is also performing inserts.

The same works with RMySQL, but there are some traps to watch out for. Let's say you're inserting a row into a table of networks. Don't worry about the specifics. You want to insert related data in another table, so you need the ID of the newly inserted row.

create.network <- function(species.id, network.name, data.source, description) {
  
  con <- dbConnect(MySQL(),
           user="super_schmuck", password="nuts2u",
           dbname="my_db", host="localhost")
  on.exit(dbDisconnect(con))

  sql <- sprintf("insert into networks
                  (species_id, name, data_source, description, created_at)
                  values (%d, '%s', '%s', '%s', NOW());",
                 species.id, network.name, data.source, description)
  rs <- dbSendQuery(con, sql)
  dbClearResult(rs)

  id <- dbGetQuery(con, "select last_insert_id();")[1,1]

  return(id)
}

Don't forget to clear the result of the insert. If you do, you'll get 0 from the last_insert_id(). Also, using dbGetQuery for the insert produces an strange error when you go to call last_insert_id:

Error in mysqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not run statement: Commands out of sync; you can't run this command now)

Alternatively, you can also combine both SQL statements into one call to dbSendQuery, but, you have to remember to set a flag when you make the connection: client.flag=CLIENT_MULTI_STATEMENTS. Trying to use multiple queries seems not to work with dbGetQuery.

create.network <- function(species.id, network.name, data.source, description) {

  con <- dbConnect(MySQL(),
           user="super_schmuck", password="nuts2u",
           dbname="my_db", host="localhost",
           client.flag=CLIENT_MULTI_STATEMENTS)
  on.exit(dbDisconnect(con))

  sql <- sprintf("insert into networks
                  (species_id, name, data_source, description, created_at)
                  values (%d, '%s', '%s', '%s', NOW());
                  select last_insert_id();",
                 species.id, network.name, data.source, description)

  rs <- dbSendQuery(con, sql)

  if (dbMoreResults(con)) {
    rs <- dbNextResult(con)
    id <- fetch(rs)[1,1]
  } else {
    stop('Error getting last inserted id.')
  }

  dbClearResult(rs)

  return(id)
}

Any effort saved by combining the SQL queries is lost in the extra house-keeping so I prefer the first method.

In spite of these few quirks, RMySQL generally works fine and is pretty straightforward.

Wednesday, August 03, 2011

Microarrays

Microarrays are one of the workhorses of modern biology. Measuring transcript levels enables studies of differential expression - asking what the difference is, at the gene expression level, for example, between cancer tumor cells and normal cells.

Bruz Marzolf, who up 'til recently ran my local microarray facility, spoke recently, tracing the journey of microarrays through the full technology life-cycle, starting in 1995 with the publication of Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray in Science. Bruz put microarrays in the category of a utility technology, but not quite to the point of commoditization as there remain major differences between manufacturers.

  • Affymetrix, first to commercialize microarray technologies, is the 800 pound gorilla. Their photolithography process borrows from computer chip manufacturing and their standardized probe sets are well supported by tools such as Bioconductor. The technology is robust but producing the masks is quite expensive, thus custom arrays are not economical.
  • Agilent, which spun out of HP, uses ink-jet technology. Custom arrays can be designed using Agilent's eArray software. Agilent arrays come in a variety of resolutions including 8x60k, 1x244k and 1x1m with 60mer probes.
  • Illumina builds arrays out of beads coated in oligo probes. Beads are laid out randomly on the slides, necessitating a layout discovery step. These chips have extra redundancy to account for randomness in bead-probe count.
  • Nimblegen's maskless photolithography process is more flexible for custom arrays. Nimblegen provides arrays in 385K, 4x72K, and 12x135K resolutions using 60mer probes. They emphasize high array-to-array data reproducibility.

As an aside, our group uses custom spotted arrays and Agilent arrays. We tried Nimblegen and found that inter-array consistency was excellent, but inter-probe consistency was not. Below we see the ribosomal RNAs and adjacent genes with total RNA measured by a custom Agilent array (in blue) plotted next to a custom Nimblegen array (in green). To be fair there might be other explanations for what we saw, but it certainly looks like there is significant variability between probes that we would expect to have identical readings.

In RNA-Seq: a revolutionary tool for transcriptomics (Nature Reviews Genetics, 2009), Zhong Wang, Mark Gerstein & Michael Snyder show this comparison between microarrays and RNA-Seq.

While RNA-seq, no doubt, has a higher dynamic range, does it really have less noise? Some folks say so. With tens or hundreds of thousands of probes, fairly dense coverage of whole microbial genomes is possible. If you know what you're looking for, microarrays are still cheaper. Discovery oriented work is going increasingly toward sequencing.

Links

Friday, July 08, 2011

Notes on Engineering Data Analysis (with R and ggplot2)

Hadley Wickham gave a Google Tech Talk a couple weeks back titled Engineering Data Analysis (with R and ggplot2). These are my notes.

The data analysis cycle is to iteratively transform, visualize and model. Leading into the cycle is data access and the output of the process is knowledge, insight and understanding which can be communicated to others. Transforming the data is almost always necessary to bring data into a workable form. Visualization and modeling have something of a duality where visualization is good at revealing the unexpected but has problems scaling. Models scale better, but will only find expected relationships. A larger cycle comes about when answers to one question lead to more questions.

Hadley makes a case for data analysis in code, rather than GUIs and for R in particular. Working in a programming language gives you a means of:

  • reproducibility
  • automation
  • version control
  • communication

Advantages of R:

  • open source
  • runs anywhere
  • well established community
  • huge library of packages
  • connectivity to other languages

Downsides of R are it's learning curve, strangeness relative to other programming languages, lack of programming infrastructure and prickliness of the community. R scales well up to about a million observations. How to scale the interactive analysis cycle up to billions of observations is an open question. Programming infrastructure is an area where programmers can contribute.

DSLs help express and think clearly about common problems in data analysis. Hadley views his libraries as DSLs (domain specific languages) within R for the phases of the analysis cycle. For visualization, there's ggplot2. DSLs align nicely with ggplot's philosophy as a grammar of graphics. R's model formula is the DSL for modeling. Plyr is the DSL for data transformation.

The four key verbs of data transformation are:

  • subset
  • mutate
  • arrange
  • summarize

...plus...

  • group by
  • join

Data can be divided by subsetting or filtering; mutated, for example adding new columns to a table that are functions of other columns; rearranged or sorted; and summarized, condensing a data set down to a smaller number of values. These actions can be combined with a group by operator. Finally, data sets can be joined to other related data sets.

The second half of a talk is a case study, dissecting a set of cause-of-death statistics from the Mexican government. Finally, Hadley makes a familiar sounding point about the tension between making new things and making well-engineered user-friedly software that does old things.

Saturday, July 02, 2011

Running in Queen Anne

The crown of Queen Anne is a great place for running. It's mostly level along tree-lined streets and has great views of Elliot Bay and the Seattle skyline. The Queen Anne Boulevard route is 4.1 miles, according to Google. Jen and I often do a slightly shorter 3.7 mile run, by cutting short the north-west loop staying on 7th West past Coe Elementary School.

More Seattle running routes