Wednesday, February 03, 2010

Upgrade MacBook Pro Memory?

I was thinking of upgrading my 2007 MacBook Pro with more RAM. It came with 2GB, and the specs say it can take up to 3GB, although some online sources say they can successfully install 4GB. Apparently, these older MacBooks map "system functions", I guess meaning IO mapping and ROM into the region between 3GB and 4GB.

... at least 3 GB of RAM should be fully accessible, while when 4 GB of RAM installed, ~700 MB of of the RAM is overlapping critical system functions, making it non-addressable by the system.

OK, so no 4GB for me, but what about replacing one of the 1GB sticks with a 2GB stick for a total of 3GB? It turns out that if I did that, I'd take a small performance hit.

All Intel Core Macs support dual channel memory access if matching modules are installed. The customary estimate is that this gives a 6% - 8% real world performance benefit. The modules do not have to be the same brand. That means it is quite possible but not 100% guaranteed, that adding a 3rd party SODIMM to an Apple supplied SODIMM of the same size will make a matched pair.

Verdict: don't bother...

Thursday, January 28, 2010

Analytics in Seattle

Seattle hosts a lot of activity in analytics (aka data analysis, visualization, and business intelligence).

Robert Gentleman, one of the principles of the R project, was for a long time associated with Fred Hutchinson Cancer Research Institute, just across the lake from where I sit. He got hired in September of 2009 by Genentech (now a subsidiary of Roche), and was also recently named to the board of REvolution computing. REvolution's product development office is here in seattle.

R is an open source implementation of the S language, which originated at Bell Labs. S-plus, a commercial offspring of S, was introduced in 1988 by UW professor R. Douglas Martin's company Statistical Sciences, Inc. That was bought in 1993 by MathSoft, maker of MathCAD, then sold off again in 2001 as Insightful corporation. MathSoft continued to struggle and was itself acquired in 2006 by PTC Corporation. TIBCO bought Insightful in 2008. TIBCO, as of 2007, also owns SpotFire.

Tableau, maker of analytics and visualization software that competes with SpotFire, is next to Google in Fremont. Probably doesn't hurt that UW has a strong statistics, biostats, and computer science departments.

Just last year, Microsoft nabbed former South Lake Union residents Rosetta biosoftware for integration into the spooky sounding Microsoft Amalga Unified Intelligence System.

Vaguely related

Monday, January 18, 2010

Split with escaped delimiters

I have a lot of little scripts that read tab-delimited text files, 'cause that's the way things get done in the creaky world of bioinformatics. Due to the limitations of flat files, you often see an attributes field, which holds key-value pairs in some more-or-less parsable way, typically something like this:

key1=value1;key2=value2;

Parsing this with regular expressions is pretty easy. Take this fragment of Ruby:

 attributes = {}
File.foreach('attributes.txt') do |line|
line.chomp!()
line.scan(/([^;=]+)=([^;]+)/) do |key, value|
attributes[key] = value
end
end

attributes.each_pair do |key, value|
puts("#{key}=#{value}")
end

Because I'm a knucklehead, I decided to make it hard. What if we wanted to allow escaped delimeter characters? RegexGuru tells you that Split() is Not Always The Best Way to Split a String. This will work.

 def unescape(string)
string.gsub("\\\\","\\").gsub("\\=","=").gsub("\\;",";")
end

attributes = {}
File.foreach('attributes.txt') do |line|
line.chomp!()
line.scan(/((?:\\\\|\\=|\\;|[^;=])+)=((?:\\\\|\\;|[^;])+)/) do |key, value|
attributes[unescape(key)] = unescape(value)
end
end

attributes.each_pair do |key, value|
puts("#{key}=#{value}")
end

Take home message: This is way more trouble than it's worth. Also, don't forget that split works differently with regards to empty fields from one language to another.

Friday, January 15, 2010

State of data integration for biology

The field of biology has a proliferation of data repositories and software tools. Probably, it's inevitable. Lab techniques, computational methods, and algorithms advance rapidly. Technology trends come and go. And funding for sustained maintenance is hard to come by. Biological data comes in a diversity in types and varying quality driving the need to coordinate several lines of evidence.

Data integration has been a persistent problem. Lincoln Stein's Creating a bioinformatics nation and Integrating biological databases give the background. A pair of reviews update this topic.

State of the nation in data integration for bioinformatics

In State of the nation in data integration for bioinformatics, Carole Goble and Robert Stevens give an update (to 2007) with a semantic web spin. The complexity of biological data arises from:

  • diversity of data types
  • varying quality
  • describing a sample and its originating context
  • ridiculous situation with identifiers: Every conceivable pathology of identifiers is standard practice in biology - multiple names for the same thing, the same name for different things. Equality is often hazy and context dependent.

To address this complication, various strategies that have been tried, which they categorize as:

  • Service oriented architectures: CORBA, WS, ReST
  • Link integration
  • Data warehousing
  • View integration (distributed queries)
  • Model-driven service oriented architecture (caBIG, Gaggle?)
  • Integration applications (not clear to me what these are)
  • Workflows (Taverna)
  • Mashups
  • Semantic Web (smashups):
    • RDF
    • Ontologies
    • Linked data sets
    • Semantic mapping

Messaging probably belongs on this list. They call Gaggle a Model-driven service oriented architecture, which I'm not sure is the right classification. Message passing fits Gaggle better. Messaging tends to get overlooked, in spite of lots of work on message-oriented middleware and enterprise application integration in the business domain. (See Gregor Hohpe.) Messaging has evolved from a basis on the remote procedure call to a more document-centric view, but it seems that shift is not always recognized.

They use the term touch-points to mean common keys on which to join data.

They integrate on a range of common or corresponding touch-points: data values, names, identities, schema properties, ontology terms, keywords, loci, spatial-temporal points, etc. These touch-points are the means by which integration is possible, and a great deal of a bioinformatician’s work is the mapping of one touch-point to another.

They mention BioMoby, a tool for composing web-services into workflows based on describing services and their input and output data types with controlled vocabularies. The founders of BioMoby seem to have drifted steadily towards the semantic web, but at one time they said this:

[...] interoperability in the domain of bioinformatics is, unexpectedly, largely a syntactic rather than a semantic problem. That is to say, interoperability between bioinformatics Web Services can be largely achieved simply by specifying the data structures being passed between the services (syntax) even without rich specification of what those data structures mean (semantics).

Microformats get a brief mention as a means to enrich web content with semantics - stuctured representation, links to supporting evidence or related data, provenance, etc.

Goble and Stevens also cite the late great SIMILE Piggy Bank, a firefox extension allowing users to add their own semantic markup to web pages - a key source of inspiration for Firegoose. Another interesting reference is Alon Halevy's Principles of dataspace systems (2006), an effort to think about (just-in-time) data integration at web scale. (See Dynamic Fusion of Web Data for more.)

State of the cyberinfrastructure

Lincoln Stein's Towards a cyberinfrastructure for the biological sciences (2008) assesses the state of computing in biology with the aim of “the ability to create predictive, quantitative models of complex biological processes”. He defines cyberinfrastructure as consisting of data sources, computing resources, communication (not just networks, but syntactic and semantic connectivity as well), and human infrastructure (skills and cultural aspects).

The current biology cyberinfrastructure has a strong data infrastructure, a weak to non‑existent computational grid, patchy syntactic and semantic connectivity, and a strengthening human infrastructure.

He seems to conclude that the semantic web is promising, but not quite there yet. Almost echoing the quote above from the BioMoby paper, Stein says:

[...] in the short term, [...] we can gain many of the benefits of more sophisticated systems by leveraging the human capacity to make sense of noisy and contradictory information. Semantic integration will occur in the traditional way: many human eyes interpreting what they read and many human hands organizing and reorganizing the information.

Both of these papers suggest that the semantic web may finally solve these persistent data integration problems, but this cynical reader was left with a picture of a string of technologies that promised the same and never lived up to the hype. Semantic markup is a significant effort, whose costs fall on the data provider but whose benefits go mainly to others. Maybe tools will come along that reduce that effort, but I can't imagine there will ever be a day when quick-and-dirty looses its appeal. This is research, after all. In an exploratory analysis, a scientist is likely to try several false starts before finding a fruitful path. If integration at the (quicker and dirtier) syntactic level is even a little easier, it seems like a better choice for this purpose. The opposite is probably true for final published results.

In both articles, mashups and just-in-time integration show signs of getting some respect, which I like. Same for ideas related to linked data.

The really important goal

There is probably a small set of principles of data organization within a given domain that would impose little or no burden on data providers. And if we all accepted those priciples up front, any two arbitrary peices of data from different sources could be joined together either automatically or with reasonably little effort, and without any need for further central coordination.

Mo' links

Saturday, January 09, 2010

Pivot tables in R

The R ProjectA common data-munging operation is to compute cross tabulations of measurements by categories. SQL Server and Excel have a nice feature called pivot tables for this purpose. Here we'll figure out how to do pivot operations in R.

Let's imagine an experiment where we're measuring the gene activity of an organism under different conditions -- exposure to different nutrients and toxins. Our conditions are silly: copper, beer, pizza, and cheetos. First we make a list of genes. Then expand.grid generates all combinations of genes and conditions. Finally, we tack on a column of randomly generated measurements.

> genes = paste('MMP', sprintf("%04d",1:10), sep="")
> data = expand.grid(gene=genes, condition=c('copper', 'cheetos', 'beer', 'pizza'))
> data$value = rnorm(40)
> data
gene condition value
1 MMP0001 copper 0.90412805
2 MMP0002 copper 0.92664376
3 MMP0003 copper 0.27772147
4 MMP0004 copper 0.08958930
5 MMP0005 copper -0.20132304
6 MMP0006 copper 0.34524729
7 MMP0007 copper -0.33910206
8 MMP0008 copper 1.21006486
9 MMP0009 copper 0.78008022
10 MMP0010 copper 1.05364315
11 MMP0001 cheetos -2.31796229
12 MMP0002 cheetos 0.76706591
13 MMP0003 cheetos -2.93692935
14 MMP0004 cheetos 0.25452306
15 MMP0005 cheetos 0.24168329
16 MMP0006 cheetos 0.28739734
17 MMP0007 cheetos 0.69233543
18 MMP0008 cheetos 0.48865250
19 MMP0009 cheetos -0.11129319
20 MMP0010 cheetos 0.53322842
21 MMP0001 beer -0.74965948
22 MMP0002 beer 0.27105205
23 MMP0003 beer -0.99261363
24 MMP0004 beer 0.65143639
25 MMP0005 beer -0.35589696
26 MMP0006 beer 1.40147484
27 MMP0007 beer 0.37492710
28 MMP0008 beer 0.64453865
29 MMP0009 beer 0.35925345
30 MMP0010 beer 0.96394785
31 MMP0001 pizza -1.91818504
32 MMP0002 pizza 0.31690523
33 MMP0003 pizza -1.20566043
34 MMP0004 pizza -1.91750166
35 MMP0005 pizza 1.98010023
36 MMP0006 pizza 0.90468249
37 MMP0007 pizza 0.04284970
38 MMP0008 pizza -0.08141461
39 MMP0009 pizza -0.72471771
40 MMP0010 pizza -0.01085060

We want to pivot the conditions into columns so that we end up with one column for each condition and one row for each gene. The easy way is to use the reshape package by Hadley Wickham, which is made for restructuring data and does this job nicely. If you don't already have it, you'll have to run install.packages, then load the library.

> install.packages('reshape')
> library(reshape)

Using cast to move conditions into columns is a snap.

> cast(data, gene ~ condition)
gene copper cheetos beer pizza
1 MMP0001 0.9041281 -2.3179623 -0.7496595 -1.91818504
2 MMP0002 0.9266438 0.7670659 0.2710521 0.31690523
3 MMP0003 0.2777215 -2.9369294 -0.9926136 -1.20566043
4 MMP0004 0.0895893 0.2545231 0.6514364 -1.91750166
5 MMP0005 -0.2013230 0.2416833 -0.3558970 1.98010023
6 MMP0006 0.3452473 0.2873973 1.4014748 0.90468249
7 MMP0007 -0.3391021 0.6923354 0.3749271 0.04284970
8 MMP0008 1.2100649 0.4886525 0.6445386 -0.08141461
9 MMP0009 0.7800802 -0.1112932 0.3592535 -0.72471771
10 MMP0010 1.0536432 0.5332284 0.9639479 -0.01085060

Done!

That was too easy

Just as an exercise, what would we have to do without reshape? And, just to keep ourselves honest, let's make sure we can deal with missing data (as reshape can). Make some data go missing:

> data.incomplete <- data[data$value > -1.0,]
> dim(data.incomplete)
[1] 35 3

Now, split the data frame up by condition. This produces a list where each element is a data frame containing a subset of the data for each condition. Notice that the cheetos data frame has values for 8 of the 10 genes.

> data.by.condition <- split(data.incomplete, data.incomplete$condition)
> typeof(data.by.condition)
[1] "list"
> names(data.by.condition)
[1] "copper" "cheetos" "beer" "pizza"
> data.by.condition$cheetos
gene condition value
12 MMP0002 cheetos 0.7670659
14 MMP0004 cheetos 0.2545231
15 MMP0005 cheetos 0.2416833
16 MMP0006 cheetos 0.2873973
17 MMP0007 cheetos 0.6923354
18 MMP0008 cheetos 0.4886525
19 MMP0009 cheetos -0.1112932
20 MMP0010 cheetos 0.5332284

We're going to recombine the data into a data frame with one row for each gene, so let's get that started:

> result = data.frame(gene=genes)

Now comes some executable line noise. We're going to loop through the list and add a column to the result data frame during each iteration of the loop. We pull the column out of the data frame in the list, but we have to make sure the column has an element for each gene. Merging with the all parameter set is like an outer join. We get a row for each gene, inserting NA's where there data is missing.

> for (i in seq(along=data.by.condition)) { result[[names(data.by.condition)[i]]] <- merge(data.by.condition[[i]], genes, by.x='gene', by.y=1, all=T)$value }

> result
gene copper cheetos beer pizza
1 MMP0001 0.9041281 NA -0.7496595 NA
2 MMP0002 0.9266438 0.7670659 0.2710521 0.31690523
3 MMP0003 0.2777215 NA -0.9926136 NA
4 MMP0004 0.0895893 0.2545231 0.6514364 NA
5 MMP0005 -0.2013230 0.2416833 -0.3558970 1.98010023
6 MMP0006 0.3452473 0.2873973 1.4014748 0.90468249
7 MMP0007 -0.3391021 0.6923354 0.3749271 0.04284970
8 MMP0008 1.2100649 0.4886525 0.6445386 -0.08141461
9 MMP0009 0.7800802 -0.1112932 0.3592535 -0.72471771
10 MMP0010 1.0536432 0.5332284 0.9639479 -0.01085060

Extra finesse points if you can figure out how to do that last step with Reduce instead of a loop.

Sunday, December 27, 2009

SQL group by in R

The R ProjectThe R statistical computing environment is awesome, but weird. How to do database operations in R is a common source of questions. The other day I was looking for an equivalent to SQL group by for R data frames. You need this to compute summary statistics on subsets of a data set divided up by some categorical variable. It turns out there are several ways to get the same effect, some more limited than others.

The best answer seems to be plyr. It automates the The split-apply-combine strategy for data analysis you'd use otherwise. The ddply splits a data frame into subset data frames, performs some function on the subsets, and returns the results as a recombined data frame. Here's a A Fast Intro to Plyr for R, Block-processing a data frame with plyr and Split, apply, and combine in R using PLYR.

Some commands that get you part of the way there are: split, by, tapply (nicely explained here), aggregate. The R wiki has an entry on Performing calculations within sub-sets of a data-frame that uses the reshape library. You could always use sqldf or RSQLite. Several options are discussed here. You can cobble up a fully general process using split, some form of sapply, and unsplit. But, that's what plyr does automatically.

Side notes: While fooling around with this, I noticed that, for some crazy reason, split.data.frame splits matrices into nice subunits, but split has the ugly side-effect of reducing matrices to vectors. Also, Google has a style guide for R.

More R mini-tutorials:

Sunday, December 20, 2009

Layered architecture for a Java application

This is a template for an application architecture that came out of my genome browser project. This Java/Swing desktop application aspires to be very flexible - implementing new visualizations and accommodating task-specific components when necessary. I'm not sure how well I did at this goal, so don't take this as here's-what-you-should-do, just here's-what-one-poor-schmuck-did.

The Application object holds application scope state, and binds components together, performing dependency injection where necessary. The entry point is myapp.Main, which configures Options using command-line arguments and starts the application. The application manages a concurrent event dispatch queue.

Components communicate with each other by events placed on the event queue and have access to an application API. Like in Swing, component code may run in the thread of the event dispatch, or, if it is long running, spin up its own thread. Components are responsible for managing their own threading and synchronization.

Layering, in this template, works something like this:

+--------+
| UI |
+--------+
+-----------------+ +----------------+
| Application | | components |
+-----------------+ +----------------+
+--------------+
| services |
+--------------+
+------------+ +----------+
| domain | | util |
+------------+ +----------+

Higher layers can depend on lower layers, while dependencies may not flow up. Utilities and domain model can have no internal dependencies and anything may depend on them. Services are things like file I/O, persistence, and maybe algorithms. The application exposes an API to components and will in turn implement that API in terms of components. I think this circular dependency doesn't bother me too much. If it did, the Application could have no dependencies on the components and the components could have some organized dependency structure among themselves. The UI sits on top where nothing should depend on it.