Monday, November 30, 2009

Design Patterns 15 years later

Design Patterns 15 Years Later: An Interview with Erich Gamma, Richard Helm, and Ralph Johnson was recently featured on Lambda the Ultimate.

Some say design patterns are just work-arounds for the defects of C++. The paper Essential Programming Paradigm argues that design patterns occur because the programming paradigm disallows certain run-time composition of dynamic and static code. The GoF authors confirm that their design patterns fit object-oriented languages, and arise specifically from experience with C++ and Smalltalk, so are tied to language of implementation. "Design patterns eventually emerge for any language. Design déjà-vu is language neutral." Different design patterns may be emerging for dynamic languages or for functional languages.

They discuss the development of more design patterns beyond the 23 examples chosen for the Design Patterns book. Eric Gamma suggest some sort of collective intelligence approach for editing design patterns and rating their importance and applicability. Sounds like a good idea. Some new patterns they mention as candidates for inclusion in a revised set are: Null Object, Type Object, Dependency Injection, and Extension Object/Interface. Their new (draft) categorization of design patterns looks like this:

They seem to have dropped several, some of which I won't miss. but why axe composite or observer? And bridge, maybe not the most useful in practise, but when I finally understood what they meant, I felt like I had accomplished something.

Design patterns links

Saturday, November 28, 2009

Leroy Hood on a career in science

ISBLeroy Hood, the founder of the Institute for Systems Biology, where I've worked for 3 years now, wrote up some career advice for scientists last year. It probably applies fairly well to any professional.

I leave students (and even some of my colleagues) with several pieces of advice. First, I stress the importance of a good cross-disciplinary education. Ideally, I suggest a double major with the two fields being orthogonal-say, biology with computer science or applied physics. Some argue that there is insufficient time to learn two fields deeply at the undergraduate level.

I argue that this is not true. If we realize that many undergraduate courses now taught are filled with details that are immediately forgotten after the course is finished, we must then learn to teach in an efficiently conceptual manner. As I noted above, as an undergraduate at Caltech I had Feynman for physics and Pauling for chemistry, and both provided striking examples of the power of conceptual teaching.

Second, I argue that students should grow accustomed to working together in teams: In the future, there will be many hard problems (like P4 medicine) that will require the focused integration of many different types of expertise.

Third, I suggest that students acquire an excellent background in mathematics and statistics and develop the ability to use various computational tools. Fourth, I argue that a scholar, academic, scientist, or engineer should have four major professional objectives: (a) scholarship, (b) education (teaching), (c) transferring knowledge to society, and (d ) playing a leadership role in the local community to help it become the place in which one would like one’s children and grandchildren to live.

Fifth, with regard to the scientific careers of many scientists-they can be described as bellshaped curves of success-they rise gradually to a career maximum and then slowly fall back toward the base line. To circumvent this fate, I propose a simple solution: a major change in career focus every 10 or so years. By learning a new field and overcoming the attendant insecurities that come from learning new areas, one can reset the career clock. Moreover, with a different point of view and prior experience, one can make fundamental new contributions to the new field by thinking outside the box. Then the new career curve can be a joined series of the upsides of the bellshaped curve, each reinvigorated by the ten-year changes.

Finally, science is all about being surrounded by wonderful colleagues and having fun with them, so I recommend choosing one’s science, environment, and colleagues carefully. I end this discussion with what I stressed at the beginning-I am so fortunate to have been surrounded by outstanding colleagues who loved science and engineering. Science for each of us is a journey with no fixed end goal. Rather, our goals are continually being redefined.

ISB recently topped 300 employees and, as of early 2009, had a budget of $55 million. Dr. Hood turned 70 in 2008.

Monday, November 09, 2009

Computational representation of biological systems

Computational representation of biological systems by Zach Frazier, Jason McDermott, Michal Guerquin, Ram Samudrala is a book chapter in Springer's Computational Systems Biology. It introduces basic data warehousing concepts along with a data warehousing effort targeted at biology called Bioverse.

They contrast data warehousing with online transaction processing. OLTP entails frequent concurrent updates. Updates traditionally look like bank machine operations or travel reservations. Data warehousing, in contrast, typically updates only occasionally in an additive way as new data arrives or annotations are added. The star schema, which supports efficient subsetting and computing of aggregates (min, max, sum, count, average), centers on a table of atomic data elements called facts surrounded by related tables holding different types of search criteria called dimensions.

Bioverse nicely illustrates several of the main problems challenges with data warehousing. First, it's data (54 organisms) appears to have been last updated in 2005. Also, we must choose at what granularity to create the fact table based on the questions we expect to ask, but questions come at many scales.

Hierarchical data occurs throughout the Bioverse. Representation of these structures is particularly difficult in relational databases.

They go on to cite a method for supporting efficient hierarchical queries from Kimball, R., Ross, M., (2002). The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling.

In addition to hierarchies, graphs and networks are common structures in biological systems including protein­protein interaction networks, biochemical pathways, and others. However, the techniques outlined for trees and directed acyclic graphs are no longer appropriate for graphs. Answering any more than very basic graph queries is hard in relational databases.

So, with all these drawbacks, you have to wonder whether the relational database is a good basis for data mining applications. Especially with the prevalence of networks in biological data. I'm a lot more intrigued by ideas along the lines of NoSQL schema-less databases, Dynamic fusion of web data and the web as a channel for structured data.

Friday, November 06, 2009

The Stories Networks Tell

Dr. Carl Bergstrom gave a very cool lecture on “The Stories Networks Tell” at UW a week or two ago. Eigenfactor applies network analysis methods to mapping the citation structure of academic journals. He has a nice method of discovering modularity in networks using random walks. There are some nice flash interactive graphics on his lab's website. It's cool that he's published on economics, evolution, infectious disease, and information theory. On the theme of biology and economics:

Tuesday, November 03, 2009

Elements of Programming

Don't you hate it when someone reads the first chapter of a book and then decides they're qualified to review it? So do I. But based on the first chapter, Elements of Programming by Alexander Stepanov, Paul McJones is something I might want to come back to, so here's my reminder to myself.

Stepanov is the guy behind the C++ Standard Template Library (STL), an advocate of generic programming and a critic of OOP. (papers here)

The authors have done for C++ something that is more commonly done for the functional languages, which is to put the language on a formal mathematical basis. Their central insight is the parallel between algorithms and data structures and algebraic structures. Abstract algebraic structures consist of one or more sets and one or more operations on those sets.

The formalism introduced in the book is based on objects with mutable state "computers with memory constitute the only available realization of a universal computing device". They seem to value that C++ is "faithful to the underlying machine". I wonder how this formalism will hold up to parallel computation, in contrast to the immutability and statelessness that seem popular in these increasingly parallel days. I also wonder about the comparison between this formalism and the lambda calculus.

They define a notion called regularity. A regular function is a pure side-effect-free function (If I'm reading right). Procedures and types can also be regular. A regular type is required to implement operation for equality, assignment, destructor, default constructor, copy constructor, total ordering, and underlying type. A concept is a minimal set of requirements a data type must satisfy for an algorithm to be applicable. Algorithms can then be implemented in terms of concepts rather than concrete data types. (I think?) Concepts can have semantic requirement like linear time complexity for a given operation, bringing efficiency into the mix.

Whether one likes this book will be strongly influenced by how well one likes generic programming in C++. And maybe whether one wants one's theory sullied by implementation details. I like the idea that generic programming can be something more than making a type checker happy -- more than the type-safe collections of Java's puny generics. And who would have thought you could extract a pretty mathematical theory from the warty ugster known as C++?

A reviewer on Amazon suggests a related Google Tech Talk "A Possible Future of Software Development" by Sean Parent.