<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-5964816804623588850</id><updated>2012-01-26T07:24:28.653-08:00</updated><category term='lectures'/><category term='Firefox extension'/><category term='data integration'/><category term='Python'/><category term='visualization'/><category term='technology'/><category term='data mining'/><category term='crackpot theory'/><category term='clojure'/><category term='Javascript'/><category term='bug'/><category term='Computer science'/><category term='books'/><category term='tutorial'/><category term='messaging'/><category term='sematic web'/><category term='graphics'/><category term='interoperability'/><category term='gaggle'/><category term='analytics'/><category term='Java'/><category term='concurrency'/><category term='links'/><category term='NoSQL'/><category term='networks'/><category term='microformats'/><category term='Swing'/><category term='Scala'/><category term='software architecture'/><category term='biology'/><category term='Ruby'/><category term='Programming languages'/><category term='software engineering'/><category term='reference'/><category term='seattle'/><category term='stats'/><category term='scientific computing'/><category term='UsingR'/><category term='machine learning'/><category term='rant'/><category term='Bioinformatics'/><category term='R'/><category term='db'/><title type='text'>Digithead's Lab Notebook</title><subtitle type='html'>Mad science in silico...</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default?start-index=101&amp;max-results=100'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>197</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-8084898280630599596</id><published>2011-12-23T14:09:00.000-08:00</published><updated>2011-12-24T11:49:53.976-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Practical advice for applying machine learning</title><content type='html'>&lt;p&gt;Sprinkled throughout Andrew Ng's &lt;a href="http://ml-class.org"&gt;machine learning class&lt;/a&gt; is a lot of &lt;i&gt;practical advice for applying machine learning&lt;/i&gt;. That's what I'm trying to compile and summarize here. Most of Ng's advice centers around the idea of making decisions in an empirical way, fitting to a data-driven discipline, rather than relying on gut feeling.&lt;/p&gt;

&lt;h4&gt;Training / validation / test&lt;/h4&gt;
&lt;p&gt;The key is dividing data into training, cross-validation and test sets. The test set is used only to evaluate performance, not to train parameters or select a model representation. The rationale for this is that training set error is not a good predictor of how well your hypothesis will generalize to new examples. In the course, we saw the cross-validation set used to select degrees of polynomial features and find optimal regularization parameters.&lt;/p&gt;

&lt;h4&gt;Model representation&lt;/h4&gt;
&lt;p&gt;The representation of the hypothesis, the function &lt;i&gt;h&lt;/i&gt;, defines the space of solutions that your algorithm can find. The example used in the class was modeling house price as a function of size. The model tells you what parameters your algorithm needs to learn. If we've selected a linear function, then there are two parameters, the slope and intersect of the line.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/--WWoxH4nUyc/TvQf64GO0dI/AAAAAAAADWY/MhulcDMTZfs/s1600/hypothesis.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="375" width="400" src="http://4.bp.blogspot.com/--WWoxH4nUyc/TvQf64GO0dI/AAAAAAAADWY/MhulcDMTZfs/s400/hypothesis.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;h4&gt;Feature selection and treatment&lt;/h4&gt;
&lt;p&gt;Are the given features sufficiently informative to make predictions? Asking whether a human expert can confidently predict the output given the input features will give a good indication.&lt;/p&gt;

&lt;p&gt;At times, it may be necessary to derive new features. Polynomial features, for example, can let linear regression fit non-linear functions. Computing products, ratios, differences or logarithms may be informative. Creativity comes in here, but remember to test the effectiveness of your new features on the cross-validation set.&lt;/p&gt;

&lt;p&gt;Features are on different scales may benefit from &lt;b&gt;feature scaling&lt;/b&gt;. Mean normalizing and scaling to a standard deviation of one puts features on an even footing.&lt;/p&gt;

&lt;p&gt;Gathering data might be expensive. Another option is &lt;b&gt;artificial data synthesis&lt;/b&gt;, either creating new examples out of whole cloth or by transforming existing examples. In text recognition, a library of digital fonts might be used to generate examples, or existing examples might be warped or reflected.&lt;/p&gt;

&lt;h4&gt;Overfitting&lt;/h4&gt;
&lt;p&gt;Often, a learning algorithm may fit the training data very well, but perform poorly on new examples. This failure to generalize is called &lt;b&gt;overfitting&lt;/b&gt;.&lt;/p&gt;
  
&lt;p&gt;The classic example is fitting a high degree polynomial, which can lead to a very curvy line that closely fits a large number of data points. Our hypothesis is complex and might be fitting noise rather than an underlying relationship and will therefore generalize poorly.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-EWiwOUgECbw/TvQf7nSHBLI/AAAAAAAADW8/slVDhoXnkjM/s1600/regularization_plot.png" imageanchor="1" style="clear:left; float:left; margin-right:1em; margin-bottom:1em"&gt;&lt;img border="0" height="144" width="200" src="http://4.bp.blogspot.com/-EWiwOUgECbw/TvQf7nSHBLI/AAAAAAAADW8/slVDhoXnkjM/s400/regularization_plot.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;One way to combat this problem is to use a simpler model. This is valid, but might be limiting. Another option is &lt;b&gt;regularization&lt;/b&gt;, which penalizes large parameter values. This prioritizes solutions fitting the training data reasonably well without curving around wildly.&lt;/p&gt;

&lt;p&gt;Regularization can be tuned by plotting training set error and validation set error as functions of the regularization parameter, lambda.&lt;/p&gt;

&lt;h4&gt;Tuning the trade off between bias vs variance.&lt;/h4&gt;

&lt;p&gt;The steps we take to improve performance depend on whether our algorithm is suffering from bias or variance. A &lt;b&gt;learning curve&lt;/b&gt; is a diagnostic that can tell which of these situations we're in, by plotting training error and validation error as a function of training set size. Look for high training and cross-validation error indicating high bias or a steadily decreasing validation error, with a gap between validation and training error indicating high variance.&lt;/p&gt;

&lt;h4&gt;Bias&lt;/h4&gt;

&lt;p&gt;A high &lt;b&gt;bias&lt;/b&gt; model has few parameters and may result in underfitting. Essentially we're trying to fit an overly simplistic hypothesis, for example linear where we should be looking for a higher order polynomial. In a high bias situation, training and cross-validation error are both high and more training data is unlikely to help much.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-_SPyaixmkZo/TvQf7Ea3hgI/AAAAAAAADWg/1j1EEkMTlWA/s1600/learning_curve_bias.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="207" width="400" src="http://2.bp.blogspot.com/-_SPyaixmkZo/TvQf7Ea3hgI/AAAAAAAADWg/1j1EEkMTlWA/s400/learning_curve_bias.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;find more features&lt;/li&gt;
  &lt;li&gt;add polynomial features&lt;/li&gt;
  &lt;li&gt;increase parameters (more hidden layer neurons, for example)&lt;/li&gt;
  &lt;li&gt;decrease regularization&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Variance&lt;/h4&gt;

&lt;p&gt;&lt;b&gt;Variance&lt;/b&gt; is the opposite problem, having lots of parameters, which carries a risk of overfitting. If we are overfitting, the algorithm fits the training set well, but has high cross-validation and testing error. If we see low training set error, with cross-validation error trending downward, then the gap between them might be narrowed by training on more data.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-UbXALHInZAc/TvQf7LG406I/AAAAAAAADWw/uxoTRaTOQEo/s1600/learning_curve_variance.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="275" width="400" src="http://3.bp.blogspot.com/-UbXALHInZAc/TvQf7LG406I/AAAAAAAADWw/uxoTRaTOQEo/s400/learning_curve_variance.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;more training data&lt;/li&gt;
  &lt;li&gt;reduce number of features, manually or using a model selection algorithm&lt;/li&gt;
  &lt;li&gt;increase regularization&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Error analysis&lt;/h4&gt;
&lt;p&gt;To improve performance of a machine learning algorithm, one helpful step is to manually examine the cases your algorithm gets wrong. Look for systematic trends in the errors. What features would have helped correctly classify these cases?&lt;/p&gt;

&lt;p&gt;For multi-step machine learning pipelines, &lt;b&gt;ceiling analysis&lt;/b&gt; can help decide where to invest effort to improve performance. The error due to each stage is estimated by substituting labeled data for that stage, revealing how well the whole pipeline would perform if that stage had no error. Stepping through the stages, we note the potential for improvement at each one.&lt;/p&gt;

&lt;h4&gt;Precision/Recall&lt;/h4&gt;
&lt;p&gt;It's helpful to have a single number to easily compare performance. Precision and recall and the F1 statistic can help when trying to classify very skewed classes, where one class is rare in the data. Simply taking a percentage of correct classifications can be misleading, since always guessing the more common class means you'll almost always be right.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;precision&lt;/b&gt;: true positives / predicted positives, predicted positives = true positives + false positives
(Of all the patients we predicted to have cancer, what fraction actually has cancer?)&lt;/p&gt;

&lt;p&gt;&lt;b&gt;recall&lt;/b&gt;: true positives / actual positives, actual positives = true positives + false negatives
(Of all the patients that really had cancer, how many did we correctly predict?)&lt;/p&gt;

&lt;p&gt;&lt;i&gt;&lt;b&gt;F1&lt;/b&gt; = 2·p·r / (p + r)&lt;/i&gt;&lt;/p&gt;

&lt;h4&gt;Miscellaneous tips&lt;/h4&gt;
&lt;p&gt;Principle component analysis (&lt;b&gt;PCA&lt;/b&gt;) can help by &lt;b&gt;reducing dimensionality&lt;/b&gt; of high-dimensional features. Collapsing highly correlated features can help learning algorithms run faster.&lt;/p&gt;

&lt;p&gt;Often, incorrectly implemented machine learning algorithms can appear to work, producing no obvious error, but simply converging slower or with more error than a correct implementation. &lt;b&gt;Gradient checking&lt;/b&gt; is a technique for checking your work, applied in the class to back propagation, but probably more generally applicable.&lt;/p&gt;

&lt;h4&gt;The recommended approach&lt;/h4&gt;

&lt;p&gt;Quickly testing ideas empirically and optimizing developer time is the approach embodied in these steps:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;First, implement a simple quick and dirty algorithm, plot learning curves, and perform error analysis.&lt;/li&gt;
  &lt;li&gt;Create a list of potential ideas to try to improve performance. Then, start trying promising ideas, using the validation set to test for improvement.&lt;/li&gt;
  &lt;li&gt;Use a learning algorithm with many parameters and many features - low bias. Get a very large training set.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Supporting that last point is the findings of &lt;a href="http://dl.acm.org/citation.cfm?id=1073012.1073017"&gt;Banko and Brill, 2001&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote style="font-style:italic;font-size:22pt;font-family:Times New Roman;line-height:100%;"&gt;&amp;ldquo;It's not who has the best algorithm that wins. It's who has the most data.&amp;rdquo;&lt;/blockquote&gt;


&lt;ul&gt;
  &lt;li&gt;diagrams by Andrew Ng&lt;/li&gt;
  &lt;li&gt;..more on the &lt;a href="/p/ml-class.html"&gt;machine learning class&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-8084898280630599596?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/8084898280630599596/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/12/practical-advice-for-applying-machine.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/8084898280630599596'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/8084898280630599596'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/12/practical-advice-for-applying-machine.html' title='Practical advice for applying machine learning'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/--WWoxH4nUyc/TvQf64GO0dI/AAAAAAAADWY/MhulcDMTZfs/s72-c/hypothesis.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-5709277421412978421</id><published>2011-12-14T18:16:00.000-08:00</published><updated>2011-12-15T11:55:12.582-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='interoperability'/><category scheme='http://www.blogger.com/atom/ns#' term='db'/><category scheme='http://www.blogger.com/atom/ns#' term='Bioinformatics'/><title type='text'>SDCube and hybrid data storage</title><content type='html'>&lt;p&gt;The &lt;a href="http://sorger.med.harvard.edu/"&gt;Sorger lab&lt;/a&gt; at Harvard published a piece in the February 2011 Nature Methods that shows some really clear thinking on the topic of designing data storage for biological data. That paper, &lt;a href="http://www.nature.com/nmeth/journal/v8/n6/full/nmeth.1600.html"&gt;Adaptive informatics for multifactorial and high-content biological data&lt;/a&gt;, Millard et al., introduces a storage system called SDCubes, short for &lt;i&gt;semantically typed data hypercubes&lt;/i&gt;, which boils down to HDF5 plus XML. The software is hosted &lt;a href="http://www.semanticbiology.com/"&gt;semanticbiology.com&lt;/a&gt;.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-3mb_E60DxjQ/TulSMbxVJeI/AAAAAAAADVY/phJtGBuVy3I/s1600/sdcube.png" imageanchor="1" style="clear:left; float:left;margin-right:1em; margin-bottom:1em"&gt;&lt;img border="0" height="300" width="200" src="http://2.bp.blogspot.com/-3mb_E60DxjQ/TulSMbxVJeI/AAAAAAAADVY/phJtGBuVy3I/s320/sdcube.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;This two part strategy applies &lt;a href="http://www.hdfgroup.org/"&gt;HDF5&lt;/a&gt; to store high-dimensional numeric data efficiently, while storing sufficient metadata in XML to reconstruct the design of the experiment. This is smart, because with modern data acquisition technology, you're often dealing with volumes of data where attention to efficiency is required. But, experimental design, the place where creativity directly intersects with science, is a rapidly moving target. The only hope of capturing that kind of data is a flexible semi-structured representation.&lt;/p&gt;

&lt;p&gt;This approach is very reminiscent, if a bit more sophisticated, than something that was tried in the Baliga lab called &lt;a href="http://gaggle.systemsbiology.net/dataStandards/"&gt;EMI-ML&lt;/a&gt;, which was roughly along the same lines except that the numeric data was stored in tab-separated text files rather than HDF5. Or to put it another way, TSV + XML instead of HDF5 + XML.&lt;/p&gt;

&lt;p&gt;Another ISB effort, &lt;a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3092287/"&gt;Addama&lt;/a&gt; (Adaptive Data Management) started as a ReSTful API over a content management system and has evolved into a ReSTful service layer providing authentication and search and enabling access to underlying CMS, SQL databases, and data analysis services. &lt;a href="http://code.google.com/p/addama/"&gt;Addama&lt;/a&gt; has ambitions beyond data management, but shares with SDCube the emphasis on adaptability to the unpredictable requirements inherent in research by enabling software to reflect the design of individual experiments.&lt;/p&gt;

&lt;p&gt;There's something to be said for these hybrid approaches. Once you start looking, you see a similar pattern in lots of places.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NoSQL - SQL hybrids&lt;/li&gt;
&lt;li&gt;SQL - XML hybrids. SQL-Server, for one, has great &lt;a href="http://msdn.microsoft.com/en-us/library/ms189887.aspx"&gt;support for XML&lt;/a&gt; enabling &lt;a href="http://msdn.microsoft.com/en-us/library/ms189075.aspx"&gt;XQuery&lt;/a&gt; and &lt;a href="http://msdn.microsoft.com/en-us/library/ms172038.aspx"&gt;XPath&lt;/a&gt; mixed with SQL.&lt;/li&gt;
&lt;li&gt;Search engines, like Solr, are typically used next to an existing database&lt;/li&gt;
&lt;li&gt;Key/value tables in a database (aka Entity–attribute–value)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Combining structured and semi-structured data allows room for flexibility where you need it, while retaining RDBMS performance where it fits. Using HDF5 adapts the pattern for scientific applications working with vectors and matrices, structures not well served by either relational or hierarchical models. Where does that leave us in biology with our networks?. I don't know whether HDF5 can store graphs. Maybe we need a triple hybrid relational-matrix-graph database.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-hVD9LmP2xns/TulWxa8kcsI/AAAAAAAADVk/iexWnzSFhyU/s1600/sdcube2.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="108" width="320" src="http://1.bp.blogspot.com/-hVD9LmP2xns/TulWxa8kcsI/AAAAAAAADVk/iexWnzSFhyU/s320/sdcube2.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;By the way, HDF5 libraries exist for &lt;a href="http://www.hdfgroup.org/tools5desc.html"&gt;several languages&lt;/a&gt;. SDCube is in Java. MATLAB can read HDF5. There is an &lt;a href="http://cran.r-project.org/web/packages/hdf5/"&gt;HDF5 package for R&lt;/a&gt;, but it seems incomplete.&lt;/p&gt;

&lt;p&gt;Relational databases work extremely well for some things. But, flexibility has never been their strong point. They've been optimized for 40 years, but &lt;a href="/2010/03/analytics-vs-transaction-processing.html"&gt;shaped by the transaction processing problems&lt;/a&gt; they were designed to solve, and they just get awkward for certain uses, to name some - graphs, matrices and frequently changing schemas.&lt;/p&gt;

&lt;p&gt;Maybe, before too long the big database vendors go multi-paradigm and we'll see matrices, graphs, key-value pairs, XML and JSON as native data structures that can be sliced and diced and joined at will right along with tables.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-5709277421412978421?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/5709277421412978421/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/12/sdcube-and-hybrid-data-storage.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5709277421412978421'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5709277421412978421'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/12/sdcube-and-hybrid-data-storage.html' title='SDCube and hybrid data storage'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-3mb_E60DxjQ/TulSMbxVJeI/AAAAAAAADVY/phJtGBuVy3I/s72-c/sdcube.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-2026439450289696718</id><published>2011-12-11T11:11:00.001-08:00</published><updated>2011-12-12T12:38:09.141-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Neural Networks</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://ml-class.org" imageanchor="1" style="clear:right;float:right;margin-left:1em; margin-bottom:1em"&gt;&lt;img border="0" height="160" width="138" src="http://3.bp.blogspot.com/-u7Fkk8haI40/TuOkKrTbbWI/AAAAAAAADUU/Hsib8G7daYA/s320/ml-robot.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;Exercise 4 in &lt;a href="http://www.ml-class.org/"&gt;Andrew Ng's Machine Learning class&lt;/a&gt; is on neural networks. Back in the cretaceous period, in 1994 or so, in on of the periodic phases of popularity for neural networks, I hacked up some neural network code in Borland Turbo C++ on a screaming 90MHz Pentium. That code is probably on a 3 1/2 inch floppy in the bottom of a box somewhere in my basement. Back then, I remember being fascinated by the idea that we could get a computer to learn by example.&lt;/p&gt;

&lt;p&gt;Neural networks go in and out of style like miniskirts. [cue lecherous graphic.] Some people scoff. Neural networks are just a special cases of Bayesian networks, they say, and SVMs have more rigorous theory. Fine. But, it's a bit like seeing an old friend to find that neural networks are respectable again.&lt;/p&gt;

&lt;p&gt;To see how neural networks work, let's look at how information flows through them, first in the forward direction.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/--QHu9YfgzTc/TuUBGacYL5I/AAAAAAAADVA/AgPs6nJqllE/s1600/neural_network_forward.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="334" width="400" src="http://1.bp.blogspot.com/--QHu9YfgzTc/TuUBGacYL5I/AAAAAAAADVA/AgPs6nJqllE/s400/neural_network_forward.png" /&gt;&lt;/a&gt;
&lt;p style="font-size:8pt;font-style:italic;text-align:right;"&gt;(Andrew Ng)&lt;/p&gt;&lt;/div&gt;

&lt;h4&gt;Forward Propogation&lt;/h4&gt;

&lt;p&gt;An input example &lt;i&gt;X&lt;sub&gt;t&lt;/sub&gt;&lt;/i&gt; becomes the activation at the first layer, the &lt;i&gt;n&lt;/i&gt; x 1 vector &lt;i&gt;a1&lt;/i&gt;. Prepending a bias node whose value is always 1, we then multiply by the weight matrix, &lt;i&gt;Theta1&lt;/i&gt;. This returns &lt;i&gt;z&lt;sub&gt;2&lt;/sub&gt;&lt;/i&gt;, whose rows are the sum of the products of the input neurons with their respective weights. We pass those sums through the sigmoid function to get the activations of the next layer. Repeating the same procedure for the output layer gives the outputs of the neural network.&lt;/p&gt;

&lt;p&gt;Implementing this in Octave, I think, could be fully vectorized to compute all training examples at once, but I did this in a loop over &lt;i&gt;t&lt;/i&gt; which indexes a single training example, like this:&lt;/p&gt;

&lt;pre class="codebox"&gt;a1 = X(t,:)';
z2 = Theta1 * [1; a1];
a2 = sigmoid(z2);
z3 = Theta2 * [1; a2];
a3 = sigmoid(z3);
&lt;/pre&gt;

&lt;p&gt;As in the previous cases of linear and logistic regression, neural networks have a cost function to be minimized by moving in short steps along a gradient.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-gQAIgNEATw0/TuUBGcb4RuI/AAAAAAAADU0/Fja9bgAsqQU/s1600/cost_function_nn.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="95" width="400" src="http://1.bp.blogspot.com/-gQAIgNEATw0/TuUBGcb4RuI/AAAAAAAADU0/Fja9bgAsqQU/s400/cost_function_nn.png" /&gt;&lt;/a&gt;
&lt;p style="font-size:8pt;font-style:italic;text-align:right;"&gt;(Andrew Ng)&lt;/p&gt;
&lt;/div&gt;

&lt;h4&gt;Back propagation&lt;/h4&gt;

&lt;p&gt;The gradients are computed by back propagation, which pushes the error backwards through the hidden layers. It was the publication of this algorithm in &lt;i&gt;Nature&lt;/i&gt; in 1986, that led to the resurgence that I caught the tail end of in the 90's.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-ezAuzrTeZDQ/TuUBHbX3l3I/AAAAAAAADVM/AUnXq_zxx6w/s1600/2051_001.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="400" width="309" src="http://4.bp.blogspot.com/-ezAuzrTeZDQ/TuUBHbX3l3I/AAAAAAAADVM/AUnXq_zxx6w/s400/2051_001.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;h4&gt;Cross-validation, regularization and gradient checking&lt;/h4&gt;

&lt;p&gt;When using neural networks, choices need to be made about &lt;b&gt;architecture&lt;/b&gt;, the number of layers and number of units in each layer. Considerations include over-fitting, bias and computational costs. Trying a range of architectures and cross-validating is a good way to make this choice.&lt;/p&gt;

&lt;p&gt;The layered approach gives neural networks the ability to fit highly non-linear boundaries, but also makes them prone to over-fitting, so it's helpful to add a &lt;b&gt;regularization&lt;/b&gt; term to the cost function that penalizes large weights. Selecting the regularization parameter can be done by cross-validation.&lt;/p&gt;

&lt;p&gt;Translating the math into efficient code is tricky and it's not hard to get incorrect implementations that still seem to work. It's a good idea to confirm the correctness of your computations with a technique called &lt;b&gt;gradient checking&lt;/b&gt;. You compute the partial derivatives numerically and compare with your implementation.&lt;/p&gt;

&lt;p&gt;Back in the day, I implemented back-prop twice. Once in in my C++ code and again in Excel to check the results.&lt;/p&gt;

&lt;h4&gt;A place in the toolbox&lt;/h4&gt;

&lt;p&gt;The progression of ideas leading up to this point in the course is very cleverly arranged. Linear regression starts you on familiar ground and helps introduce gradient descent. Logistic regression adds the simple step of transforming your inputs through the sigmoidal function. Neural networks then follow naturally. It's just logistic regression in multiple layers.&lt;/p&gt;

&lt;p&gt;In spite of their tumultuous history, neural networks can be looked at as just another tool in the machine learning toolbox, with pluses and minus like other tools. The history of the idea is interesting, in terms of seeing inside the sausage factory of science.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-2026439450289696718?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/2026439450289696718/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/12/neural-networks.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2026439450289696718'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2026439450289696718'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/12/neural-networks.html' title='Neural Networks'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-u7Fkk8haI40/TuOkKrTbbWI/AAAAAAAADUU/Hsib8G7daYA/s72-c/ml-robot.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-3226449262581588083</id><published>2011-12-08T18:48:00.001-08:00</published><updated>2011-12-22T11:33:50.120-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='seattle'/><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='books'/><category scheme='http://www.blogger.com/atom/ns#' term='analytics'/><title type='text'>Effective Data Visualizations</title><content type='html'>&lt;p&gt;&lt;a href="http://twitter.com/#!/noahi"&gt;Noah Iliinksy&lt;/a&gt; spoke at UW, yesterday, on the topic of Effective Visualization. Iliinksy has a new book out, &lt;a href="http://shop.oreilly.com/product/0636920022060.do"&gt;Designing Data Visualization&lt;/a&gt; (&lt;a href="http://vallandingham.me/designing_data_visualizations_review.html"&gt;review&lt;/a&gt;), and served as editor of &lt;a href="http://shop.oreilly.com/product/0636920022060.do"&gt;Beautiful Visualization&lt;/a&gt;, both from O'Reilly. And, yay &lt;a href="/search/label/seattle"&gt;Seattle&lt;/a&gt;, he lives here in town and has a degree from UW.&lt;/p&gt;

&lt;p&gt;If you had to sum up the talk in a sentence, it would be this: Take the advice from your college technical writing class and apply it to data visualization. Know your audience. Have a goal. Consider the needs, interests and prior knowledge of your readers / viewers. Figure out what do you want them to take away. Ask, &amp;ldquo;&lt;a href="http://blog.buzzdata.com/post/6483500638/noah-iliinsky-on-good-visualizations"&gt;who is my audience, and what do they need?&lt;/a&gt;&amp;rdquo; I guess that's more than a sentence.&lt;/p&gt;

&lt;h4&gt;Encoding data&lt;/h4&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-vFmURyOMraA/TuF4T4VlxRI/AAAAAAAADTY/fGjymWI5mHc/s1600/data_encoding.png" imageanchor="1" style="clear:right; float:right; margin-left:1em; margin-bottom:1em"&gt;&lt;img border="0" height="165" width="320" src="http://4.bp.blogspot.com/-vFmURyOMraA/TuF4T4VlxRI/AAAAAAAADTY/fGjymWI5mHc/s320/data_encoding.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;The human eye is great at perceiving small differences in position. Use position for your most salient features.&lt;/p&gt;

&lt;p&gt;Color is often used poorly. Question: Is orange higher or lower than purple? Answer: No! Color is not ordered. However, brightness and saturation are and can be used effectively to convey quantitative information. Temperature is something of an exception, since it is widely understood that blue is cold and red is hot. Also, color is often loaded with cultural meanings - think of black hats and white hats or the political meanings of red, orange or green, &lt;a href="/2010/08/using-r-for-introductory-statistics-33.html"&gt;boy/girl = blue/pink&lt;/a&gt;, etc.&lt;/p&gt;

&lt;h4&gt;Appropriate encodings by data type&lt;/h4&gt;
&lt;p&gt;Click to expand this handy chart!&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-GVslptzhpA4/TuF5ERmpb8I/AAAAAAAADTw/lRXR5Wv1nfk/s1600/visualization_chart.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="252" width="400" src="http://2.bp.blogspot.com/-GVslptzhpA4/TuF5ERmpb8I/AAAAAAAADTw/lRXR5Wv1nfk/s400/visualization_chart.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;As an example of how to do it right, Iliinsky points to &lt;a href=""&gt;Hipmunk&lt;/a&gt;, which crams an enormous amount of data into a simple chart of flights from Seattle to Phuket, Thailand.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-jH0F0o2nvpM/TuF4lyZ1XVI/AAAAAAAADTk/8ywNjKVusOA/s1600/hipmunk.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="265" width="400" src="http://4.bp.blogspot.com/-jH0F0o2nvpM/TuF4lyZ1XVI/AAAAAAAADTk/8ywNjKVusOA/s400/hipmunk.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;We can see departure and arrival time and duration, in both the absolute and relative senses, plus layovers, airline, airport and price. And, you can sort by "Agony", which is cool. They've encoded lengths (of time) as lengths, used text (sparingly) for exact amounts, color to show categorical variables (airline) and iconography to indicate the presence or absence of wireless internet on flights.&lt;/p&gt;

&lt;p&gt;The cool chart and the quote about encoding, were expropriated from the &lt;a href="http://strataconf.com/stratany2011/public/schedule/detail/21525"&gt;slides from Iliinksy's talk at Strata&lt;/a&gt;. If you want more, there's a video of a related talk on &lt;a href="http://www.youtube.com/watch?v=lTAeMU2XI4U"&gt;You-Tube&lt;/a&gt; and a podcast on &lt;a href="http://www.uie.com/brainsparks/2011/03/24/noah-iliinsky-beautiful-visualization-letting-data-tell-the-story/"&gt;Letting Data Tell the Story&lt;/a&gt;. Tools recomended by &lt;a href="http://complexdiagrams.com/"&gt;Iliinsky&lt;/a&gt; include &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; and &lt;a href="http://had.co.nz/ggplot2/"&gt;GGPlot&lt;/a&gt;, &lt;a href="http://mbostock.github.com/d3/"&gt;D3&lt;/a&gt; and &lt;a href="http://mbostock.github.com/protovis/"&gt;Protovis&lt;/a&gt;, and &lt;a href="http://www.tableausoftware.com/"&gt;Tableau&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-3226449262581588083?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/3226449262581588083/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/12/effective-data-visualizations.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3226449262581588083'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3226449262581588083'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/12/effective-data-visualizations.html' title='Effective Data Visualizations'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-vFmURyOMraA/TuF4T4VlxRI/AAAAAAAADTY/fGjymWI5mHc/s72-c/data_encoding.png' height='72' width='72'/><thr:total>0</thr:total><georss:featurename>3720 15th Ave NE, University of Washington, Seattle, WA 98195, USA</georss:featurename><georss:point>47.65180177401242 -122.31310844421387</georss:point><georss:box>47.64645327401242 -122.32297894421387 47.657150274012416 -122.30323794421386</georss:box></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-3431807255822292656</id><published>2011-12-06T21:27:00.000-08:00</published><updated>2011-12-06T11:18:53.910-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>K-means</title><content type='html'>&lt;p&gt;The topics in this week's programming exercise in the &lt;a href="http://ml-class.org"&gt;machine learning class&lt;/a&gt; are K-means and PCA.&lt;/p&gt;

&lt;p&gt;K-means is a fairly easily understood clustering algorithm. Once you specify &lt;i&gt;K&lt;/i&gt;, the number of clusters, and pick some random initial centroids, it's just two steps. First, assign each data point to a cluster according to it's nearest centroid. Next, recompute the centroids based on the average of the data points in each cluster. Repeat until convergence. It's that simple.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-SwU1pTHgil0/TtxWuevJHLI/AAAAAAAADS0/tIuLRMMMtAw/s1600/k-means_algorithm.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="219" width="400" src="http://1.bp.blogspot.com/-SwU1pTHgil0/TtxWuevJHLI/AAAAAAAADS0/tIuLRMMMtAw/s400/k-means_algorithm.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;Vectorizing the two steps is a little tricky. Let's take a look at what we're trying to accomplish in step 1. If we had an &lt;i&gt;m&lt;/i&gt; by &lt;i&gt;k&lt;/i&gt; matrix with the distances where each element &lt;i&gt;i&lt;/i&gt;,&lt;i&gt;j&lt;/i&gt; was the distance from data point &lt;i&gt;i&lt;/i&gt; to centroid &lt;i&gt;j&lt;/i&gt;, we could take the &lt;a href="http://www.mathworks.com/help/techdoc/ref/min.html"&gt;min&lt;/a&gt; of each row. Actually, we want the &lt;i&gt;index&lt;/i&gt; of the &lt;i&gt;min&lt;/i&gt; of each row. That would give the assignments for all &lt;i&gt;m&lt;/i&gt; data points in one shot.&lt;/p&gt;

&lt;p&gt;For example, say we have just 3 data points (&lt;i&gt;m&lt;/i&gt;=3) with 2 features each and 2 centroids, (&lt;i&gt;a&lt;/i&gt;,&lt;i&gt;b&lt;/i&gt;) and (&lt;i&gt;c&lt;/i&gt;,&lt;i&gt;d&lt;/i&gt;). How do you get to a distance matrix given &lt;i&gt;X&lt;/i&gt; and the centroids?&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-lv9gkTs1jVg/Tt5nJeN9AjI/AAAAAAAADTM/ueGz6EcuuRs/s1600/2052_001.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="349" width="400" src="http://3.bp.blogspot.com/-lv9gkTs1jVg/Tt5nJeN9AjI/AAAAAAAADTM/ueGz6EcuuRs/s400/2052_001.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;What I came up with can surely be improved upon, but here it is anyway. I loop over each cluster &lt;i&gt;k&lt;/i&gt;, finding the distance between that cluster's centroid and all data points.&lt;/p&gt;

&lt;p&gt;The cryptic &lt;a href="http://www.mathworks.com/help/techdoc/ref/bsxfun.html"&gt;bsxfun&lt;/a&gt; may sound like something republicans wouldn't approve of, but really, it's a bit like the apply functions in &lt;b&gt;R&lt;/b&gt;. It can work several ways, but in this case takes a function, a matrix &lt;i&gt;X&lt;/i&gt;, &lt;i&gt;m&lt;/i&gt; by &lt;i&gt;n&lt;/i&gt;, and the &lt;i&gt;n&lt;/i&gt; by 1 vector of the &lt;i&gt;k&lt;/i&gt;th centroid. It applies the function to each row in &lt;i&gt;X&lt;/i&gt; along with the centroid vector. The result is an &lt;i&gt;m&lt;/i&gt; by &lt;i&gt;n&lt;/i&gt; matrix whose &lt;i&gt;i&lt;/i&gt;th row is the difference between the &lt;i&gt;i&lt;/i&gt;th example and the &lt;i&gt;k&lt;/i&gt;th centroid. We square that matrix, element-wise. Then, sum all the rows to compute the vector of &lt;i&gt;m&lt;/i&gt; squared distances. After we've filled in our distances for all the centroids, we take the &lt;i&gt;min&lt;/i&gt;, row-wise, returning the index of the nearest centroid.&lt;/p&gt;

&lt;pre class="codebox"&gt;function idx = findClosestCentroids(X, centroids)
  K = size(centroids, 1);
  m = size(X,1)

  idx = zeros(m, 1);
  dist = zeros(size(X,1), K);

  for k = 1:K
    dist(:,k) = sum(bsxfun(@(x,mu_k) x-mu_k, X, centroids(k,:)) .^ 2, 2);
  end

  [min_dist, idx] = min(dist, [], 2);
end
&lt;/pre&gt;

&lt;p&gt;There must be a cleaner way to do that. If we looped over &lt;i&gt;m&lt;/i&gt; rather than &lt;i&gt;k&lt;/i&gt;, we could compute mins one row at a time and never need the whole dist matrix in memory. Maybe there's some magic linear algebra that could efficiently do the whole thing. Anyone wanna clue me in on that?&lt;/p&gt;

&lt;p&gt;Luckily, the next step is easier. To recompute the centroids, we're finding the average of each cluster. Again, I used a loop over the &lt;i&gt;k&lt;/i&gt; clusters. We grab the subset of data points belonging to each cluster using &lt;a href="http://www.gnu.org/software/octave/doc/interpreter/Finding-Elements-and-Checking-Conditions.html#doc_002dfind"&gt;find&lt;/a&gt;.&lt;/p&gt;

&lt;pre class="codebox"&gt;for k = 1:K
  example_idx = find(idx==k);
  centroids(k,:) = sum(X(example_idx,:),1) / size(example_idx,1);
end
&lt;/pre&gt;

&lt;p&gt;With those two steps in place, we're clustering away. One puzzle that comes with this particular algorithm is how to choose &lt;i&gt;K&lt;/i&gt;. According to Andrew Ng, The elbow method can be problematic because a clear elbow may not present itself. Letting downstream requirements dictate the choice of &lt;i&gt;K&lt;/i&gt; seems to be better.&lt;/p&gt;

&lt;p&gt;It's getting pretty clear that the trick to most of these programming exercises is vectorization. The combination of functional programming and vectorized operations is very powerful, but definitely comes at the cost of some brain-strain, at least if you have as few working brain cells as I've got left.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-3431807255822292656?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/3431807255822292656/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/12/k-means.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3431807255822292656'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3431807255822292656'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/12/k-means.html' title='K-means'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-SwU1pTHgil0/TtxWuevJHLI/AAAAAAAADS0/tIuLRMMMtAw/s72-c/k-means_algorithm.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-5751312665099318519</id><published>2011-12-05T07:55:00.001-08:00</published><updated>2011-12-05T20:49:16.274-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>International Open Data Hackathon</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://dev.socrata.com/images/posts/2011-11-18-iodh.jpg" imageanchor="1" style="clear:left; float:left;margin-right:1em; margin-bottom:1em"&gt;&lt;img border="0" height="77" width="163" src="http://dev.socrata.com/images/posts/2011-11-18-iodh.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;


&lt;p&gt;This past Saturday, I hung out at the Seattle branch of the &lt;a href="http://www.opendataday.org/"&gt;International Open Data Hackathon&lt;/a&gt;. The event was hosted at the Pioneer Square office of &lt;a href="http://www.socrata.com/"&gt;Socrata&lt;/a&gt;, a small company that helps governments provide public open data.&lt;/p&gt;

&lt;p&gt;A pair of data analysts from &lt;a href="http://www.tableausoftware.com/"&gt;Tableau&lt;/a&gt; were showing off a visualization for the Washington Post's &lt;a href="http://www.washingtonpost.com/blogs/fact-checker"&gt;FactChecker&lt;/a&gt; blog called &lt;a href="http://www.washingtonpost.com/blogs/fact-checker/post/perry-vs-romney-vs-huntsman-on-jobs/2011/09/09/gIQATUy7FK_blog.html"&gt;Comparing Job Creation Records&lt;/a&gt;. Tableau pays these folks to play with data and make cool visualizations that make their software look good. One does politics and the other does pop culture. A nice gig, if you can get it!&lt;/p&gt;

&lt;p&gt;A pair of devs from Microsoft's &lt;a href="http://www.odata.org/"&gt;Open Data Protocol (OData)&lt;/a&gt; also showed up. OData looks to be a well thought out set of tools for ReST data services. If I understand correctly, it seems to have grown up around pushing relational data over Atom feeds. They let you define typed entities and associations between them, then do CRUD operations on them. You might call it ReSTful enterprise application integration.&lt;/p&gt;

&lt;p&gt;Socrata's &lt;a href="http://opendata.socrata.com/"&gt;OpenData portal&lt;/a&gt; has all kinds of neat stuff, from White House staff salaries to radiation contamination measurements to investors who were bilked by Bernie Madoff. 13,710 datasets in all. They're available for download as well as through a nice &lt;a href="http://dev.socrata.com/"&gt;ReST/JSON API&lt;/a&gt;. Socrata's platform runs &lt;a href="http://explore.data.gov/"&gt;explore.data.gov&lt;/a&gt;, &lt;a href="http://data.seattle.gov/"&gt;data.seattle.gov&lt;/a&gt; among &lt;a href="http://www.socrata.com/customer-spotlight/"&gt;others&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For example, if you've got reasonably fat pipes, and want to know about &lt;a href="http://data.seattle.gov/Permitting/Building-Permits-Current/mags-97de"&gt;building permits in Seattle&lt;/a&gt;, fire up R and enter this:&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; permits.url &amp;lt;- &amp;#x27;http://data.seattle.gov/api/views/mags-97de/rows.csv&amp;#x27;
&amp;gt; p &amp;lt;- read.csv(permits.url)
&amp;gt; head(p)
&lt;/pre&gt;

&lt;p&gt;Socrata follows the rails-ish convention of letting you indicate the return format like a file extension. In this case, we're asking for .csv, 'cause R parses it so easily. You can get JSON, XML, RDF and several other formats.&lt;/p&gt;

&lt;p&gt;Let's say you want to know what Seattlites are paying for kitchen remodels. Holy crap, it's appalling how boring and middle-aged I've gotten. Someone, shoot me!&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; cost &amp;lt;- as.numeric(gsub(&amp;#x27;\\$(.*)&amp;#x27;, &amp;#x27;\\1&amp;#x27;, p$Value))
&amp;gt; a &amp;lt;- cost[ grepl(&amp;#x27;kitchen&amp;#x27;, p$Description) &amp;amp; p$Category==&amp;quot;SINGLE FAMILY / DUPLEX&amp;quot; &amp;amp; cost &amp;gt; 0 &amp;amp; cost &amp;lt; 200000 ]
&amp;gt; hist(a, xlab=&amp;#x27;cost $&amp;#x27;, main=&amp;#x27;Distribution of kitchen remodels in Seattle&amp;#x27;, col=sample(gray.colors(10),10))
&lt;/pre&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-fbho01weuEM/TtzugtghXxI/AAAAAAAADTA/DbwighEx2UA/s1600/kitchen_remodels_hist.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="326" width="400" src="http://2.bp.blogspot.com/-fbho01weuEM/TtzugtghXxI/AAAAAAAADTA/DbwighEx2UA/s400/kitchen_remodels_hist.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;You saw it here, first, folks!&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; a &amp;lt;- cost[ grepl(&amp;#x27;kitchen&amp;#x27;, p$Description) &amp;amp; p$Category==&amp;quot;SINGLE FAMILY / DUPLEX&amp;quot; &amp;amp; cost &amp;gt; 0]
&amp;gt; summary(a)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    500   18000   35000   46290   59770  420000
&lt;/pre&gt;

&lt;p&gt;I dunno who's 420 thousand dollar kitchen that is, but if I find out, I'm coming over for dinner!&lt;/p&gt;

&lt;p&gt;Socrata's API offers a JSON based way of defining queries. Several datasets are updated in near real time. There's gotta be loads of cool stuff to be done with this data. Let's hope the government sees the value in cheap and innovative ideas like these and continues &lt;a href="http://www.readwriteweb.com/archives/fate_of_datagov_revealed_us_gov_almost_completely.php"&gt;funding for data.gov&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-5751312665099318519?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/5751312665099318519/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/12/international-open-data-hackathon.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5751312665099318519'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5751312665099318519'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/12/international-open-data-hackathon.html' title='International Open Data Hackathon'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-fbho01weuEM/TtzugtghXxI/AAAAAAAADTA/DbwighEx2UA/s72-c/kitchen_remodels_hist.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-3930508400830165342</id><published>2011-11-30T22:26:00.001-08:00</published><updated>2011-12-06T11:01:31.555-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Support Vector Machines</title><content type='html'>&lt;p&gt;Week 7 of &lt;a href="http://ml-class.org"&gt;Andrew Ng's Machine Learning class&lt;/a&gt; covers &lt;b&gt;support vector machines&lt;/b&gt;, pragmatically from the perspective of calling a library rather than implementation. I've been wanting to learn more about SVMs for quite a while, so I was excited for this one.&lt;/p&gt;

&lt;p&gt;A support vector machine is a supervised classification algorithm. Given labeled training data, typically high-dimensional vectors, SVM finds the maximum-margin hyperplane separating the positive and negative examples. The algorithm selects a decision boundary that does the best job of separating the classes with the extra stipulation that the boundary be as far as possible from  the nearest samples on either side. This is where the large margin part comes from.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-fg0jkx9TRyU/TtcfpqCmU6I/AAAAAAAADSQ/ZOG4BKwgTMA/s1600/svm_diagram_large_margin.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="313" width="400" src="http://2.bp.blogspot.com/-fg0jkx9TRyU/TtcfpqCmU6I/AAAAAAAADSQ/ZOG4BKwgTMA/s400/svm_diagram_large_margin.png" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;p style="font-size:smaller;"&gt;(from Andrew Ng's &lt;a href="http://www.ml-class.org/"&gt;ml-class&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;The cost function used with SVMs is a slightly modified version of that used with logistical regression:&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-2yv3-ljgiSM/TtcfpFFj6FI/AAAAAAAADR4/qeMZEFkrPaA/s1600/cost_function_logistic_regression.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="44" width="400" src="http://1.bp.blogspot.com/-2yv3-ljgiSM/TtcfpFFj6FI/AAAAAAAADR4/qeMZEFkrPaA/s400/cost_function_logistic_regression.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;With SVMs, we replace the sigmoid functions with linearized version called &lt;i&gt;cost&lt;sub&gt;1&lt;/sub&gt;&lt;/i&gt; and &lt;i&gt;cost&lt;sub&gt;2&lt;/sub&gt;&lt;/i&gt;.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-DJ7fqw-0gzI/Ttcfpe6GoEI/AAAAAAAADSA/S4XB93MfKxM/s1600/cost_function_svm.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="51" width="400" src="http://1.bp.blogspot.com/-DJ7fqw-0gzI/Ttcfpe6GoEI/AAAAAAAADSA/S4XB93MfKxM/s400/cost_function_svm.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;Some error is accepted, allowing for misclassification of some training examples in the interest of getting the majority correct. The parameter C acts as a form of regularization, specifying tolerance for training error.&lt;/p&gt;

&lt;p&gt;But, what if the boundary between classes is non-linear, like the one shown here?&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-zqPadKCrd2Y/Ttcfp_YD8JI/AAAAAAAADSg/XC-3UvzHi_A/s1600/svm_diagram_nonlinear.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="255" width="400" src="http://4.bp.blogspot.com/-zqPadKCrd2Y/Ttcfp_YD8JI/AAAAAAAADSg/XC-3UvzHi_A/s400/svm_diagram_nonlinear.png" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;p style="font-size:smaller;"&gt;(from Andrew Ng's &lt;a href="http://www.ml-class.org/"&gt;ml-class&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;The SVM algorithm generalizes to non-linear cases with the aid of kernel functions. A straight line in n dimensions, a hyper-plane, can be viewed as a linear kernel. The other widely used class of kernel functions is the guassian kernel. It's my understanding that the kernel function maps a non-linear boundary in the problem space to a linear boundary in a higher dimensional space.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-EGeuaZkSmCY/TtcfqaeCUCI/AAAAAAAADSo/_mkhp4L-3rc/s1600/Kernel_Machine.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="181" width="400" src="http://1.bp.blogspot.com/-EGeuaZkSmCY/TtcfqaeCUCI/AAAAAAAADSo/_mkhp4L-3rc/s400/Kernel_Machine.png" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;p style="font-size:smaller;"&gt;(from the wikipedia entry for &lt;a href="http://en.wikipedia.org/wiki/Support_vector_machine"&gt;support vector machine&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;The SVM algorithm is sped up by a performance hack called the kernel trick, which I understand just in general outline: The &lt;a href="http://en.wikipedia.org/wiki/Kernel_trick"&gt;kernel trick&lt;/a&gt; is a way of mapping observations into a higher dimensional space &lt;i&gt;V&lt;/i&gt;, without ever having to compute the mapping explicitly. The trick is to use learning algorithms that only require dot products between the vectors in &lt;i&gt;V&lt;/i&gt;, and choose the mapping such that these high-dimensional dot products can be computed within the original space, by means of a kernel function.&lt;/p&gt;

&lt;p&gt;There is some equivalence between SVMs and neural networks that I don't quite grasp. The process of computing the kernel function on the input vectors is something like the hidden layer of the neural network, which transforms and weighs the input features. I'm not sure if the analogy between &lt;a href="http://www.svms.org/anns.html"&gt;SVMs and ANNs&lt;/a&gt; goes deeper. Also by virtue of the kernels, SVMs are a member of a more general class of statistical algorithms called kernel methods.&lt;/p&gt;

&lt;p&gt;The exercise was to build a spam filter based on a small subset of the &lt;a href="http://spamassassin.apache.org/publiccorpus/"&gt;SpamAssassin public corpus&lt;/a&gt; of 6047 messages, of which roughly a third are spam. I trained an SVM and tried in on email from my spam-magnet yahoo email address, and it worked!&lt;/p&gt;

&lt;p&gt;So, I guess the up-shot is that I'm still a little hazy on SVMs, if a bit less so than before. If I really want to know more, there's the source code that came with the homework. Or, I could read &lt;a href="http://dl.acm.org/citation.cfm?id=130401"&gt;A training algorithm for optimal margin classifiers&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;More SVM links&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v24/n12/full/nbt1206-1565.html"&gt;What is a support vector machine?&lt;/a&gt; by William S Noble from Nature Biotechnology's wonderful &lt;a href="/2011/06/primers-in-computational-biology.html"&gt;Primers in Computation Biology&lt;/a&gt; series.&lt;/li&gt;

&lt;li&gt;&lt;a href="http://www.jstatsoft.org/v15/i09/paper"&gt;Support Vector Machines in R&lt;/a&gt; by Karatzoglou, Meyer, and Hornik&lt;/li&gt;

&lt;li&gt;&lt;a href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/"&gt;libSVM&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;There's a Ruby wrapper around the Java port of libSVM, described in &lt;i&gt;&lt;a href="http://rubyforscientificresearch.blogspot.com/2011/11/creating-rbf-models-with-svmtoolkit.html"&gt;Creating RBF models with svm_toolkit&lt;/a&gt;&lt;/i&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-3930508400830165342?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/3930508400830165342/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/11/support-vector-machines.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3930508400830165342'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3930508400830165342'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/11/support-vector-machines.html' title='Support Vector Machines'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-fg0jkx9TRyU/TtcfpqCmU6I/AAAAAAAADSQ/ZOG4BKwgTMA/s72-c/svm_diagram_large_margin.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-7298524970851653821</id><published>2011-11-30T11:20:00.001-08:00</published><updated>2011-12-11T16:36:45.791-08:00</updated><title type='text'>2012 conference dates</title><content type='html'>&lt;p&gt;Here are a few conferences for 2012 in computing or bioinformatics:&lt;/p&gt;

&lt;style type="text/css"&gt;
  #conf_list li {
    padding-top:6pt;
  }
&lt;/style&gt;

&lt;ul id="conf_list"&gt;
&lt;li&gt;&lt;b&gt;&lt;a href="http://strataconf.com/strata2012"&gt;Strata&lt;/a&gt;&lt;/b&gt;&lt;br&gt;
    February 28-March 1, 2012&lt;br&gt;
    Santa Clara, CA&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href="http://vizbi.org/"&gt;Visualizing Biological Data&lt;/a&gt; (VIZBI 2012)&lt;/b&gt;&lt;br&gt;
    6-8 March 2012&lt;br&gt;
    Heidelberg, Germany&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href="http://www.systemsbiology.net/symposium/"&gt;ISB Symposium&lt;/a&gt;&lt;/b&gt;&lt;br&gt;
    15-16 April 2012&lt;br&gt;
    Seattle, WA&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href="http://www.alleninstitute.org/events/overview.html"&gt;Allen Brain Atlas Hackathon&lt;/a&gt;&lt;/b&gt;
  18-22 June 2012&lt;br&gt;
  Seattle, WA&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href="http://googlecode.blogspot.com/2011/11/google-io-2012-extended-to-three-days.html"&gt;Google I/O&lt;/a&gt;&lt;/b&gt;&lt;br&gt;
    June 27-29, 2012&lt;br&gt;
    Moscone Center West, San Francisco&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href="http://www.iscb.org/ismb2012"&gt;ISMB Intelligent Systems for Molecular Biology&lt;/a&gt;&lt;/b&gt;&lt;br&gt;
    13-17 July 2012&lt;br&gt;
    Long Beach, CA&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href="http://www.oscon.com/oscon2012"&gt;OSCON 2012&lt;/a&gt;&lt;/b&gt;&lt;br&gt;
    July 16-20, 2012&lt;br&gt;
    Portland, Oregon&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href="http://wiki.g2.bx.psu.edu/Events/GCC2012"&gt;2012 Galaxy Community Conference&lt;/a&gt;&lt;/b&gt; (GCC2012)&lt;br&gt;
    July 25-27&lt;br&gt;
    Chicago, Illinois&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href="http://www.icsb2012toronto.com/"&gt;International Conference on Systems Biology (ICSB-2012)&lt;/a&gt;&lt;/b&gt;&lt;br&gt;
    19-23 August 2012&lt;br&gt;
    Toronto, Canada&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href="http://www.eccb12.org/"&gt;European Conference on Computational Biology&lt;/a&gt; (ECCB 2012)&lt;/b&gt;&lt;br&gt;
    9-12 September 2012&lt;br&gt;
    Basel, Switzerland&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href="http://thestrangeloop.com/"&gt;Strange Loop&lt;/a&gt;&lt;/b&gt;&lt;br&gt;
    September 23-25, 2012&lt;br&gt;
    St. Louis, MO&lt;/li&gt;
&lt;li&gt;&lt;b&gt;ACM International Conference on Bioinformatics, Computational Biology
    and Biomedicine&lt;/b&gt;&lt;br&gt;
    8-10 October 2012&lt;br&gt;
    Orlando, FL&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href="http://visweek.org/"&gt;VisWeek&lt;/a&gt;&lt;/b&gt;&lt;br&gt;
    October 14-19, 2012&lt;br&gt;
    Seattle, WA&lt;/li&gt;
&lt;li&gt;&lt;b&gt;&lt;a href="http://splashcon.org/2012/"&gt;ACM SPLASH&lt;/a&gt;&lt;/b&gt; formerly known as OOPSLA&lt;br&gt;
    October 19-26, 2012&lt;br&gt;
    Tucson, AZ&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also, here's the full list of &lt;a href="http://conferences.oreillynet.com/"&gt;O'Reilly conferences&lt;/a&gt;. Last year's Strata on big data/data science was really fun.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://biovis.net/"&gt;Biovis&lt;/a&gt;, not to be confused with VizBi (see above) is part of VisWeek and will be in Seattle next October.&lt;/p&gt; 

&lt;p&gt;We'll soon be adding our own Systems Bioinformatics Workshop to the schedule, probably sometime in the Summer. Hope to see you there.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-7298524970851653821?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/7298524970851653821/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/11/2012-conference-dates.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/7298524970851653821'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/7298524970851653821'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/11/2012-conference-dates.html' title='2012 conference dates'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-4536414120585082906</id><published>2011-11-10T21:18:00.001-08:00</published><updated>2011-11-10T22:01:14.166-08:00</updated><title type='text'>Matrix arithmetic</title><content type='html'>&lt;p&gt;Here are a couple bits of basic linear algebra that'll come in handy in the &lt;a href="http://digitheadslabnotebook.blogspot.com/2011/10/stanford-machine-learning-class.html"&gt;Machine Learning class&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;How to multiply matrices&lt;/h4&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/--5tsZUPw0j8/TrywbQwyP2I/AAAAAAAADNI/D46GvqE3S4I/s1600/multiply_matrices.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="106" width="358" src="http://2.bp.blogspot.com/--5tsZUPw0j8/TrywbQwyP2I/AAAAAAAADNI/D46GvqE3S4I/s400/multiply_matrices.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;h4&gt;Matrix Identities&lt;/h4&gt;

&lt;p&gt;If X and Y are two vectors of length m:&lt;/p&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-WCknZ2DDabI/Try3FpaKT8I/AAAAAAAADNg/x2tyYp2ZXyg/s1600/dot_product.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="68" width="263" src="http://3.bp.blogspot.com/-WCknZ2DDabI/Try3FpaKT8I/AAAAAAAADNg/x2tyYp2ZXyg/s400/dot_product.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-wv5vnRtNl8c/Try3F1z5IqI/AAAAAAAADNo/f9TA27PkvFo/s1600/sum_of_squares_vector.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="68" width="160" src="http://3.bp.blogspot.com/-wv5vnRtNl8c/Try3F1z5IqI/AAAAAAAADNo/f9TA27PkvFo/s400/sum_of_squares_vector.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;If X is (m x n) matrix and Y is (n x 1) vector&lt;/p&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-YTnPSjpVmXM/Try3FjrMaaI/AAAAAAAADNU/_9qErIJ7lJE/s1600/matrix_times_vector.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="125" width="321" src="http://1.bp.blogspot.com/-YTnPSjpVmXM/Try3FjrMaaI/AAAAAAAADNU/_9qErIJ7lJE/s400/matrix_times_vector.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="http://tutorial.math.lamar.edu/Classes/LinAlg/LinAlg.aspx"&gt;Paul's Online Math Notes on Linear Algebra&lt;/a&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-4536414120585082906?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/4536414120585082906/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/11/matrix-arithmetic.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4536414120585082906'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4536414120585082906'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/11/matrix-arithmetic.html' title='Matrix arithmetic'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/--5tsZUPw0j8/TrywbQwyP2I/AAAAAAAADNI/D46GvqE3S4I/s72-c/multiply_matrices.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-2919507916573621442</id><published>2011-10-26T22:19:00.000-07:00</published><updated>2011-12-20T16:14:36.616-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='reference'/><category scheme='http://www.blogger.com/atom/ns#' term='db'/><title type='text'>PostgreSQL cheat sheet</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-8ubh-X_hT3o/TqjpFRpU9uI/AAAAAAAADJE/RbtAU8PDUKI/s1600/postgres-logo.png" imageanchor="1" style="clear:right; float:right; margin-left:1em; margin-bottom:1em"&gt;&lt;img border="0" height="80" width="100" src="http://4.bp.blogspot.com/-8ubh-X_hT3o/TqjpFRpU9uI/AAAAAAAADJE/RbtAU8PDUKI/s320/postgres-logo.png" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;p&gt;A handy cheat-sheet for the PostgreSQL database, for when I'm too lazy to dig through the &lt;a href="http://www.postgresql.org/docs/"&gt;docs&lt;/a&gt; or find &lt;a href="http://www.petefreitag.com/cheatsheets/postgresql/"&gt;another cheat-sheet&lt;/a&gt;.

&lt;h4&gt;Start and stop server&lt;/h4&gt;
&lt;pre class="codebox"&gt;sudo su postgres -c '/opt/local/lib/postgresql/bin/postgres -D /opt/local/var/db/postgres/defaultdb'
cnt-z
bg&lt;/pre&gt;
&lt;p&gt;or&lt;/p&gt;
&lt;pre class="codebox"&gt;su -c 'pg_ctl start -D /opt/local/var/db/postgres/defaultdb -l postgreslog' postgres&lt;/pre&gt;

&lt;p&gt;To shutdown&lt;/p&gt;
&lt;pre class="codebox"&gt;sudo su postgres -c 'pg_ctl stop -D /opt/local/var/db/postgres/defaultdb'&lt;/pre&gt;

&lt;h4&gt;Run client&lt;/h4&gt;
&lt;pre class="codebox"&gt;psql -U postgres&lt;/pre&gt;

&lt;h4&gt;Granting access privileges&lt;/h4&gt;
&lt;pre class="codebox"&gt;create database dbname;
create user joe_mamma with password 'password';
grant all privileges on database dbname to joe_mamma;
grant all privileges on all tables in schema public to joe_mamma;
grant all privileges on all sequences in schema public to joe_mamma;&lt;/pre&gt;

&lt;p&gt;See the docs for &lt;a href="http://www.postgresql.org/docs/current/static/sql-grant.html"&gt;GRANT&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;SQL dump and restore&lt;/h4&gt;
&lt;pre class="codebox"&gt;pg_dump -U postgres dbname | gzip &gt; dbname.dump.2011.10.24.gz&lt;/pre&gt;
&lt;pre class="codebox"&gt;gunzip &lt; dbname.dump.2011.10.24.gz | sudo -u postgres psql --dbname dbname&lt;/pre&gt;

&lt;p&gt;For more, see &lt;a href="http://www.postgresql.org/docs/current/interactive/backup.html"&gt;Backup and Restore&lt;/a&gt; from the Postgres manual.&lt;/p&gt;

&lt;h4&gt;Truncate&lt;/h4&gt;
&lt;p&gt;Delete all data from a table and related tables.&lt;/p&gt;
&lt;pre class="codebox"&gt;truncate my_table CASCADE;&lt;/pre&gt;

&lt;h4&gt;Sequences&lt;/h4&gt;
&lt;p&gt;&lt;a href="http://www.postgresql.org/docs/current/static/functions-sequence.html"&gt;Sequences can be manipulated&lt;/a&gt; with currval and setval.&lt;/p&gt;
&lt;pre class="codebox"&gt;select currval('my_table_id_seq');
select setval('my_table_id_seq',1,false);&lt;/pre&gt;

&lt;h4&gt;Trouble-shooting&lt;/h4&gt;
&lt;p&gt;If you seen an Ident authentication error...&lt;/p&gt;
&lt;pre class="codebox"&gt;FATAL:  Ident authentication failed for user "postgres"&lt;/pre&gt;
&lt;p&gt;... look in your pg_hba.conf file. Ask Postgres where this file is by typing, "show hba_file;".&lt;/p&gt;
&lt;pre class="codebox"&gt;sudo cat /etc/postgresql/9.0/main/pg_hba.conf&lt;/pre&gt;
&lt;p&gt;You might see a line that looks like this:&lt;/p&gt;
&lt;pre class="codebox"&gt;local  all  all      ident&lt;/pre&gt;

&lt;p&gt;What the 'ident' means is postgres uses you shell account name to log you in. Specifying the user on the command line "psql -U postgres" doesn't help. Either change "ident" in the pg_hba.conf to "md5" or "trust" and restart postgres, or just do what it wants: "sudo -u postgres psql". More on this can be found in &lt;a href="http://www.depesz.com/index.php/2007/10/04/ident/"&gt;“FATAL: Ident authentication failed”, or how cool ideas get bad usage schemas&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-2919507916573621442?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/2919507916573621442/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/10/postgresql-cheat-sheet.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2919507916573621442'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2919507916573621442'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/10/postgresql-cheat-sheet.html' title='PostgreSQL cheat sheet'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-8ubh-X_hT3o/TqjpFRpU9uI/AAAAAAAADJE/RbtAU8PDUKI/s72-c/postgres-logo.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-1676909275937209858</id><published>2011-10-26T21:34:00.000-07:00</published><updated>2011-10-26T22:07:48.191-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Machine-learning: gradient descent</title><content type='html'>&lt;p&gt;The first section of &lt;a href="http://ml-class.org"&gt;Andrew Ng's Machine Learning class&lt;/a&gt; is about applying gradient descent to linear regression problems.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-S11k6vJwxdc/TpDINkQ_bDI/AAAAAAAADGY/Q8t09bzT4y0/s1600/house_prices_portland.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="162" width="400" src="http://4.bp.blogspot.com/-S11k6vJwxdc/TpDINkQ_bDI/AAAAAAAADGY/Q8t09bzT4y0/s400/house_prices_portland.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;Our input data is an &lt;i&gt;m-by-n&lt;/i&gt; matrix &lt;i&gt;X&lt;/i&gt;, where we have &lt;i&gt;m&lt;/i&gt; training examples with &lt;i&gt;n&lt;/i&gt; features each. For these training examples, we know the expected outputs &lt;i&gt;y&lt;/i&gt; where &lt;i&gt;y&lt;/i&gt; is the variable we're trying to predict. We want to find a line defined by the parameter vector ϴ that minimizes the squared error between the line and our data points.&lt;/p&gt;

&lt;p&gt;Gradient descent takes a cost function, which is the squared error of the prediction vs. the training data.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-7vLgOxMhaBg/Tqjb34B9A9I/AAAAAAAADIg/w5E6cXN6x1c/s1600/cost_function.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="60" width="284" src="http://3.bp.blogspot.com/-7vLgOxMhaBg/Tqjb34B9A9I/AAAAAAAADIg/w5E6cXN6x1c/s320/cost_function.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;The update rule for each ϴ&lt;sub&gt;j&lt;/sub&gt; is the partial derivative of the cost function with respect to ϴ&lt;sub&gt;j&lt;/sub&gt;.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-ysYz2pUSNXc/TqdJwZqGs9I/AAAAAAAADH8/0HROxzQ49HI/s1600/ml-class-notes-gradient-descent-1.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="204" width="320" src="http://1.bp.blogspot.com/-ysYz2pUSNXc/TqdJwZqGs9I/AAAAAAAADH8/0HROxzQ49HI/s320/ml-class-notes-gradient-descent-1.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;Part of the challenge is converting this to matrix notation, to take advantage of fast matrix arithmetic algorithms.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-XkseSMwpOaU/TqdJwu2QxII/AAAAAAAADIM/SWaFvyVtMdk/s1600/ml-class-notes-gradient-descent-2.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="283" width="320" src="http://4.bp.blogspot.com/-XkseSMwpOaU/TqdJwu2QxII/AAAAAAAADIM/SWaFvyVtMdk/s320/ml-class-notes-gradient-descent-2.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;Next we vectorize the update rule and show how to compute least squared error directly with the normal equation.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-trhkyiBiaTs/TqdJxHBlVvI/AAAAAAAADIU/qFztMw_Ysys/s1600/ml-class-notes-gradient-descent-3.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="249" width="320" src="http://3.bp.blogspot.com/-trhkyiBiaTs/TqdJxHBlVvI/AAAAAAAADIU/qFztMw_Ysys/s320/ml-class-notes-gradient-descent-3.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;In action, gradient descent gradually approaches optimal values for ϴ. How gradual depends on the learning rate, α.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-Jq0vhkuzgfU/TqjfI_mXhLI/AAAAAAAADIs/juXvuk4qvYs/s1600/gradient_descent_linear_regression_screenshot.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="300" width="400" src="http://1.bp.blogspot.com/-Jq0vhkuzgfU/TqjfI_mXhLI/AAAAAAAADIs/juXvuk4qvYs/s400/gradient_descent_linear_regression_screenshot.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-1676909275937209858?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/1676909275937209858/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/10/machine-learning-gradient-descent.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1676909275937209858'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1676909275937209858'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/10/machine-learning-gradient-descent.html' title='Machine-learning: gradient descent'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-S11k6vJwxdc/TpDINkQ_bDI/AAAAAAAADGY/Q8t09bzT4y0/s72-c/house_prices_portland.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-1132618042602310131</id><published>2011-10-22T23:57:00.000-07:00</published><updated>2011-11-30T20:55:48.241-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='reference'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><category scheme='http://www.blogger.com/atom/ns#' term='scientific computing'/><title type='text'>Octave cheat sheet</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-NOBHaVzE62I/TqO6wuij3AI/AAAAAAAADHw/97LYh9snM2M/s1600/octave_logo.png" imageanchor="1" style="clear:left; float:left;margin-right:1em; margin-bottom:1em"&gt;&lt;img border="0" height="80" width="80" src="http://3.bp.blogspot.com/-NOBHaVzE62I/TqO6wuij3AI/AAAAAAAADHw/97LYh9snM2M/s200/octave_logo.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;I'm mucking about with &lt;a href="http://www.gnu.org/software/octave/"&gt;Octave&lt;/a&gt;, MATLAB's open source cousin, as part of &lt;a href="http://ml-class.org"&gt;Stanford's Machine Learning class&lt;/a&gt;. Here are a few crib notes to keep me right side up.&lt;/p&gt;

&lt;p&gt;The &lt;a href="http://www.gnu.org/software/octave/doc/interpreter/"&gt;docs for Octave&lt;/a&gt; must be served from a Commodore 64 in Siberia judging by the speed, but Matlab's &lt;a href="http://www.mathworks.com/help/techdoc/ref/f16-6011.html"&gt;Function Reference&lt;/a&gt; is convenient.&lt;/p&gt;


&lt;h4&gt;Matrices&lt;/h4&gt;
&lt;p&gt;Try some &lt;a href="http://www.mathworks.com/help/techdoc/ref/f16-5872.html"&gt;matrix operations&lt;/a&gt;. Create a 2x3 matrix, and a 3x2 matrix. Multiply them to get a 2x2 matrix. Try &lt;a href="http://www.mathworks.com/help/techdoc/ref/colon.html"&gt;indexing&lt;/a&gt;.&lt;/p&gt;
&lt;pre class="codebox"&gt;&amp;gt;&amp;gt; A = [1 2 3; 4 5 6]
A =
   1   2   3
   4   5   6

&amp;gt;&amp;gt; B = 2 * ones(3,2)
B =
   2   2
   2   2
   2   2

&amp;gt;&amp;gt; size(B)
ans =
   3   2

&amp;gt;&amp;gt; A * B  % matrix multiplication
ans =
   12   12
   30   30

&amp;gt;&amp;gt; who    % list variables
A    B    ans

&amp;gt;&amp;gt; A(2,3) % get row 2, column 3
ans =  6

&amp;gt;&amp;gt; A(2,:) % get 2nd row
ans =
   4   5   6

&amp;gt;&amp;gt; A'     % A transpose
ans =
   1   4
   2   5
   3   6

&amp;gt;&amp;gt; A' .* B  % element-wise multiply
ans =
    2    8
    4   10
    6   12
&lt;/pre&gt;

&lt;h4&gt;Sum&lt;/h4&gt;
&lt;p&gt;&lt;a href="http://www.mathworks.com/help/techdoc/ref/sum.html"&gt;sum&lt;/a&gt;(A,dim) is a little bass-ackwards in that the columns are dimension 1, rows are dimension 2, contrary to R and common sense.&lt;/p&gt;&lt;pre class="codebox"&gt;&amp;gt;&amp;gt; sum(A,2)
ans =
    6
   15
&lt;/pre&gt;

&lt;h4&gt;Max&lt;/h4&gt;
&lt;p&gt;The &lt;a href="http://www.mathworks.com/help/techdoc/ref/max.html"&gt;max&lt;/a&gt; function operates strangely. There are at least 3 forms of &lt;a href="http://www.gnu.org/software/octave/doc/interpreter/Utility-Functions.html#doc_002dmax"&gt;max&lt;/a&gt;.&lt;/p&gt;
&lt;pre class="codebox"&gt;[C,I] = max(A)
C = max(A,B)
[C,I] = max(A,[],dim)
&lt;/pre&gt;
&lt;p&gt;For &lt;i&gt;max(v)&lt;/i&gt;, if &lt;i&gt;v&lt;/i&gt; is a vector, returns the largest element of v. If &lt;i&gt;A&lt;/i&gt; is an &lt;i&gt;m x n&lt;/i&gt; matrix, &lt;i&gt;max(A)&lt;/i&gt; returns a row vector of length &lt;i&gt;n&lt;/i&gt; holding the largest element from each column of A. You can also get the indices of the largest values in the I return value.&lt;/p&gt;
&lt;p&gt;To get the row maximums, use the third form, with an empty vector as the second parameter. Oddly, setting dim=1 gives you the max of the columns, while dim=2 gives the row maximums.&lt;/p&gt;

&lt;h4&gt;Navigation and Reading data&lt;/h4&gt;
&lt;p&gt;Perform &lt;a href="http://www.mathworks.com/help/techdoc/ref/f16-11063.html#f16-29665"&gt;file operations&lt;/a&gt; with Unix shell type commands: pwd, ls, cd. &lt;a href="http://www.mathworks.com/help/techdoc/ref/f16-5702.html#f16-14492"&gt;Import and export data&lt;/a&gt;, like this:&lt;/p&gt;
&lt;pre class="codebox"&gt;&amp;gt;&amp;gt; data = csvread(&amp;#x27;ex1data1.txt&amp;#x27;);&lt;/pre&gt;
&lt;pre class="codebox"&gt;&amp;gt;&amp;gt; load binary_file.dat&lt;/pre&gt;

&lt;h4&gt;Printing output&lt;/h4&gt;
&lt;p&gt;The disp function is Octave's word for 'print'.&lt;/p&gt;
&lt;pre class="codebox"&gt;disp(sprintf('pi to 5 decimal places: %0.5f', pi))&lt;/pre&gt;

&lt;h4&gt;Histogram&lt;/h4&gt;
&lt;p&gt;Plot a histogram for some normally distributed random numbers&lt;/p&gt;
&lt;pre class="codebox"&gt;&amp;gt;&amp;gt; w = -6 + sqrt(10)*(randn(1,10000))  % (mean = 1, var = 2)
&amp;gt;&amp;gt; hist(w,40)
&lt;/pre&gt;

&lt;h4&gt;Plotting&lt;/h4&gt;
&lt;p&gt;&lt;a href="http://www.gnu.org/software/octave/doc/interpreter/Two_002dDimensional-Plots.html#Two_002dDimensional-Plots"&gt;Plotting&lt;/a&gt;&lt;/p&gt;
&lt;pre class="codebox"&gt;
t = [0:0.01:0.99];
y1 = sin(2*pi*4*t); 
plot(t,y1);
y2 = cos(2*pi*2*t);
hold on;         % "hold off" to turn off
plot(t,y2,'r');
xlabel('time');
ylabel('value');
legend('sin','cos');
title('my plot');
print -dpng 'myPlot.png'
close;           % or,  "close all" to close all figs
&lt;/pre&gt;

&lt;p&gt;Multiple plots in a grid.&lt;/p&gt;
&lt;pre class="codebox"&gt;
figure(2), clf;  % select figure 2 and clear it
subplot(1,2,1);  % Divide plot into 1x2 grid, access 1st element
plot(t,y1);
subplot(1,2,2);  % Divide plot into 1x2 grid, access 2nd element
plot(t,y2);
axis([0.5 1 -1 1]);  % change axis scale
&lt;/pre&gt;

&lt;p&gt;heatmap&lt;/p&gt;
&lt;pre class="codebox"&gt;
figure;
imagesc(magic(15)), colorbar&lt;/pre&gt;

&lt;p&gt;These crib notes are based on the &lt;a href="http://s3.amazonaws.com/mlclass-resources/docs/octave_session.m"&gt;Octave tutorial&lt;/a&gt; from the ml class by Andrew Ng. Also check out the nice and quick &lt;a href="http://math.jacobs-university.de/oliver/teaching/iub/resources/octave/octave-intro/octave-intro.html"&gt;Introduction to GNU Octave&lt;/a&gt;. I'm also collecting a few notes on &lt;a href="/2011/11/matrix-arithmetic.html"&gt;matrix arithmetic&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Defining a function&lt;/p&gt;
&lt;pre class="codebox"&gt;
function ret = test(a)
  ret = a + 1;
end
&lt;/pre&gt;

&lt;p&gt;Also see Peter Acklam's &lt;a href="http://home.online.no/~pjacklam/matlab/doc/mtt/"&gt;MATLAB array manipulation tips and tricks&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-1132618042602310131?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/1132618042602310131/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/10/octave-cheat-sheet.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1132618042602310131'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1132618042602310131'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/10/octave-cheat-sheet.html' title='Octave cheat sheet'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-NOBHaVzE62I/TqO6wuij3AI/AAAAAAAADHw/97LYh9snM2M/s72-c/octave_logo.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-5711788857186814015</id><published>2011-10-18T20:53:00.000-07:00</published><updated>2011-11-06T17:39:08.647-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='reference'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><title type='text'>Python cheat sheet</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://python.org" imageanchor="1" style="clear:right; float:right; margin-left:1em; margin-bottom:1em"&gt;&lt;img border="0" height="148" width="149" src="http://1.bp.blogspot.com/-o0mDgLBOZMc/Tp5IyqfWpMI/AAAAAAAADHI/rcSMmGnTn8s/s200/pythonlogo.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;The most important &lt;a href="http://docs.python.org/"&gt;docs&lt;/a&gt; at &lt;a href="http://python.org/"&gt;python.org&lt;/a&gt; are the tutorial and &lt;a href="http://docs.python.org/library/"&gt;library reference&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Pointers to the docs&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://docs.python.org/library/stdtypes.html#string-methods"&gt;string methods&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://docs.python.org/tutorial/datastructures.html#more-on-lists"&gt;list methods&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://docs.python.org/library/stdtypes.html#dict"&gt;dictionary methods&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://docs.python.org/library/stdtypes.html#set-types-set-frozenset"&gt;set methods&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://docs.python.org/library/stdtypes.html#bltin-file-objects"&gt;File objects&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://docs.python.org/library/functions.html"&gt;built-in functions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Important modules: &lt;a href="http://docs.python.org/library/sys.html"&gt;sys&lt;/a&gt;, &lt;a href="http://docs.python.org/library/os.html"&gt;os&lt;/a&gt;, &lt;a href="http://docs.python.org/library/os.path.html"&gt;os.path&lt;/a&gt;, &lt;a href="http://docs.python.org/library/re.html"&gt;re&lt;/a&gt;, &lt;a href="http://docs.python.org/library/math.html"&gt;math&lt;/a&gt;, &lt;a href="http://docs.python.org/library/io.html"&gt;io&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;Inspecting objects&lt;/h4&gt;
&lt;pre class="codebox"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; help(obj)
&amp;gt;&amp;gt;&amp;gt; dir(obj)
&lt;/code&gt;&lt;/pre&gt;

&lt;h4&gt;List comprehensions and generators&lt;/h4&gt;
&lt;pre class="codebox"&gt;&lt;code&gt;numbers = [2, 3, 4, 5, 6, 7, 8, 9]
[x * x for x in numbers if x % 2 == 0]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;a href="http://docs.python.org/tutorial/classes.html#generators"&gt;Generators&lt;/a&gt; might be thought of as lazy list comprehensions. Really they're just a compact
syntax for defining an iterator.&lt;/p&gt;

&lt;pre class="codebox"&gt;&lt;code&gt;def generate_squares():
  i = 1
  while True:
    yield i*i
    i += 1
&lt;/code&gt;&lt;/pre&gt;

&lt;h4&gt;Main method&lt;/h4&gt;
&lt;pre class="codebox"&gt;&lt;code&gt;def main():
    # do something here

if __name__ == "__main__":
    main()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;See Guido's advice on &lt;a href="http://www.artima.com/weblogs/viewpost.jsp?thread=4829"&gt;main methods&lt;/a&gt;. To parse command line arguments use &lt;a href="http://docs.python.org/library/argparse.html"&gt;argparse&lt;/a&gt; instead o the older optparse or getopt.&lt;/p&gt;

&lt;h4&gt;Classes&lt;/h4&gt;
&lt;p&gt;The tutorial covers &lt;a href="http://docs.python.org/tutorial/classes.html"&gt;classes&lt;/a&gt;, but know that there are old-style classes and &lt;a href="http://www.python.org/doc/newstyle/"&gt;new-style classes&lt;/a&gt;.&lt;/p&gt;

&lt;pre class="codebox"&gt;&lt;code&gt;class Foo(object):
  'Doc string for class'

  def __init__(self, a, b):
    'Doc string for constructor'
    self.a = a
    self.b = b
  
  def square_plus_a(self, x):
    'Doc string for a useless method'
    return x * x + a
    
  def __str__(self):
    return "Foo: a=%d, b=%d" % (self.a, self.b)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Preoccupation with classes is a bit pass&amp;eacute; these days. Javascript objects are just bags of properties to which you can &lt;a href="http://stackoverflow.com/questions/2827623/python-create-object-and-add-attributes-to-it"&gt;add arbitrary properties&lt;/a&gt; whenever you feel like it. In Ruby, you might use OpenStruct. It's quite easy in Python. You just have to define your own class. I'll follow the convention I've seen elsewhere of creating an empty class called &lt;i&gt;Object&lt;/i&gt; derived from the base &lt;i&gt;object&lt;/i&gt;. Why you &lt;a href="http://stackoverflow.com/questions/1529002/cant-set-attributes-of-object-class"&gt;can't set attributes on an object instance&lt;/a&gt; is something I'll leave to the Python gurus.&lt;/p&gt;

&lt;pre class="codebox"&gt;&lt;code&gt;class Object(object):
    pass

obj = MyEmptyClass()
obj.foo = 123
obj.bar = "A super secret message
dir(obj)
['__doc__', '__module__', 'bar', 'foo']
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can add methods, too, but they act a little funny. Self doesn't seem to work.&lt;/p&gt;

&lt;h4&gt;Files&lt;/h4&gt;

&lt;p&gt;&lt;a href="/2010/03/how-to-read-file-line-by-line-in-python.html"&gt;Reading text files&lt;/a&gt; line by line can be done like so:&lt;/p&gt;
&lt;pre class="codebox"&gt;&lt;code&gt;with open(filename, 'r') as f:
    for line in f:
        dosomething(line)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Be &lt;a href="http://docs.python.org/tutorial/inputoutput.html#methods-of-file-objects"&gt;careful&lt;/a&gt; not to mix iteration over lines in a &lt;a href="http://docs.python.org/library/stdtypes.html#bltin-file-objects"&gt;file&lt;/a&gt; with readline().&lt;/p&gt;

&lt;h4&gt;Exceptions&lt;/h4&gt;
&lt;pre class="codebox"&gt;&lt;code&gt;try:
  raise Exception('My spammy exception!', 1234, 'zot')
except Exception as e:
  print type(e)
  print e
finally:
  print "cleanup in finally clause!"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="http://docs.python.org/library/traceback.html"&gt;Traceback&lt;/a&gt; prints stack traces.&lt;/p&gt;

&lt;h4&gt;Conditional Expressions&lt;/h4&gt;
&lt;p&gt;Finally added in Python 2.5:&lt;/p&gt;
&lt;pre class="codebox"&gt;x = true_value if condition else false_value&lt;/pre&gt;

&lt;h4&gt;Packages&lt;/h4&gt;
&lt;p&gt;Here's a quick &lt;a href="http://www.djangobook.com/en/2.0/chapter02/"&gt;tip&lt;/a&gt; for finding out where installed packages are:&lt;/p&gt;
&lt;pre class="codebox"&gt;&lt;code&gt;python -c 'import sys, pprint; pprint.pprint(sys.path)'&lt;/code&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-5711788857186814015?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/5711788857186814015/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/10/python-cheat-sheet.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5711788857186814015'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5711788857186814015'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/10/python-cheat-sheet.html' title='Python cheat sheet'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-o0mDgLBOZMc/Tp5IyqfWpMI/AAAAAAAADHI/rcSMmGnTn8s/s72-c/pythonlogo.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-2211411719600624680</id><published>2011-10-08T15:04:00.000-07:00</published><updated>2011-12-27T19:39:56.052-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='lectures'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Stanford Machine Learning class</title><content type='html'>&lt;p&gt;Stanford is offering a free online version of it's &lt;a href="http://www.ml-class.org/"&gt;Machine Learning&lt;/a&gt; class taught by Andrew Ng. Study groups are popping up everywhere. Cool!&lt;/p&gt;

&lt;p&gt;The class officially starts Monday, October 10th, but the first few lectures are up already, broken into bite size pieces of 10 minutes or so. What I've seen so far is at a basic level, covering a course introduction and terminology. Ng then posses a linear regression problem.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-S11k6vJwxdc/TpDINkQ_bDI/AAAAAAAADGY/Q8t09bzT4y0/s1600/house_prices_portland.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="162" width="400" src="http://4.bp.blogspot.com/-S11k6vJwxdc/TpDINkQ_bDI/AAAAAAAADGY/Q8t09bzT4y0/s400/house_prices_portland.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;We want to find a line &lt;i&gt;y = ϴ&lt;sub&gt;0&lt;/sub&gt; + ϴ&lt;sub&gt;1&lt;/sub&gt; x&lt;/i&gt; such that we minimize the squared error between our line and our data points.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-cvSE5Src5Go/TpDIU6c62oI/AAAAAAAADGg/l0vEZohkEwI/s1600/linear_regression.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="183" width="400" src="http://4.bp.blogspot.com/-cvSE5Src5Go/TpDIU6c62oI/AAAAAAAADGg/l0vEZohkEwI/s400/linear_regression.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;The solution is our first learning algorithm, gradient descent.&lt;/p&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-y3IlYMHqvNs/TpDIcbOobbI/AAAAAAAADGo/JHMhfkEDO8k/s1600/gradient_descent.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="168" width="400" src="http://4.bp.blogspot.com/-y3IlYMHqvNs/TpDIcbOobbI/AAAAAAAADGo/JHMhfkEDO8k/s400/gradient_descent.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;The real class at Stanford is: &lt;a href="http://cs229.stanford.edu/"&gt;CS229&lt;/a&gt;. Exercises are to be done in &lt;a href="http://www.gnu.org/software/octave/"&gt;Octave&lt;/a&gt;. Recommended reading includes the &lt;a href="http://digitheadslabnotebook.blogspot.com/2011/03/learning-data-science-skills.html"&gt;usual suspects&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pattern Recognition and Machine Learning, Christopher Bishop&lt;/li&gt;
&lt;li&gt;Machine Learning, Tom Mitchell&lt;/li&gt;
&lt;li&gt;The Elements of Statistical Learning, Hastie, Tibshirani and Friedman&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Several of the &lt;a href="http://digitheadslabnotebook.blogspot.com/2011/06/primers-in-computational-biology.html"&gt;Primers in Computational Biology&lt;/a&gt; series would probably make for good supplementary material.&lt;/p&gt;

&lt;p&gt;There are threads related to the class on &lt;a href="http://www.quora.com/Stanford-ML-Class"&gt;Quora&lt;/a&gt; and &lt;a href="http://www.reddit.com/r/mlclass"&gt;Reddit&lt;/a&gt;, for whatever that's worth. Also, see &lt;a href="http://www.quora.com/Machine-Learning/What-are-some-good-resources-for-learning-about-machine-learning-Why"&gt;some good resources for learning about machine learning&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-2211411719600624680?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/2211411719600624680/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/10/stanford-machine-learning-class.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2211411719600624680'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2211411719600624680'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/10/stanford-machine-learning-class.html' title='Stanford Machine Learning class'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-S11k6vJwxdc/TpDINkQ_bDI/AAAAAAAADGY/Q8t09bzT4y0/s72-c/house_prices_portland.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-2152615348421112435</id><published>2011-09-26T22:57:00.000-07:00</published><updated>2011-10-26T22:33:50.430-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Scala'/><category scheme='http://www.blogger.com/atom/ns#' term='Programming languages'/><category scheme='http://www.blogger.com/atom/ns#' term='clojure'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Hipster programming languages</title><content type='html'>&lt;p&gt;If you look at the programming languages that are popular these days, a few patterns emerge. I'm not talking about languages that have the most hits on the job sites. I'm talking about what the cool kids are coding in - the folks that hang out on &lt;a href="http://news.ycombinator.com"&gt;hacker-news&lt;/a&gt; or at &lt;a href="https://thestrangeloop.com/news/strange-loop-2011"&gt;Strange Loop&lt;/a&gt;. Languages like &lt;b&gt;Clojure&lt;/b&gt;, &lt;b&gt;Scala&lt;/b&gt; and &lt;b&gt;CoffeeScript&lt;/b&gt;. What do these diverse languages have in common other than an aura of geek-chic?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Functional programming is emphasized over object-oriented programming.&lt;/li&gt;
&lt;li&gt;Common patterns for manipulating lists: map, filter, reduce.&lt;/li&gt;
&lt;li&gt;Modern syntax in which everything is an expression and syntactic noise like semicolons is reduced.&lt;/li&gt;
&lt;li&gt;CoffeeScript compiles to JavaScript, while both Clojure and Scala target the JVM. Targeting legacy platforms seems to be getting easier and more popular.&lt;/li&gt;
&lt;li&gt;Innovative approaches to concurrency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This last point deserves some elaboration. It might be a stretch to compare the event-driven nature of node.js with immutable data structures with actor model and software transactional memory, but, at heart, these are all strategies for dealing with concurrency. One place where Java was ahead of its peers was &lt;a href="http://jcip.net/"&gt;concurrency&lt;/a&gt;, so it's cool that Clojure and Scala are taking the next steps in concurrent programming on the JVM.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-klx3hUsOk6Y/ToFk0uItX3I/AAAAAAAADF4/in1W6oD02EU/s1600/languages_stone_tablet.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="220" width="400" src="http://3.bp.blogspot.com/-klx3hUsOk6Y/ToFk0uItX3I/AAAAAAAADF4/in1W6oD02EU/s400/languages_stone_tablet.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;h4&gt;CoffeeScript&lt;/h4&gt;
&lt;p&gt;&lt;a href="http://jashkenas.github.com/coffee-script/"&gt;CoffeeScript&lt;/a&gt; is javascript, redesigned. It cleans up the syntax adding many of the niceties familiar from Ruby and Python. Curly braces and semicolons are out. String interpolation, list-comprehensions, default arguments, and &lt;a href="http://jashkenas.github.com/coffee-script/"&gt;more tasty sugar&lt;/a&gt; are in. Its creator, &lt;a href="http://ashkenas.com/"&gt;Jeremy Ashkenas&lt;/a&gt;, believes in code as literature and it shows all the way through the project. Take a look at the &lt;a href="http://jashkenas.github.com/coffee-script/documentation/docs/grammar.html"&gt;annotated source code for the CoffeeScript grammar&lt;/a&gt; and see if it doesn't make you weep for the ugliness of your own code.&lt;/p&gt;

&lt;p&gt;Don't forget, CoffeeScript gets about 10 times hipper when combined with &lt;a href="http://nodejs.org/"&gt;node.js&lt;/a&gt;, the event-driven app-server based on google's &lt;a href="http://code.google.com/p/v8/"&gt;V8 javascript engine&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Clojure&lt;/h4&gt;

&lt;div&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://clojure.org/"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 100px; height: 100px;" src="http://1.bp.blogspot.com/_dbECP0yvozc/SqLp-OZYfJI/AAAAAAAACX8/H8OtkbtoMpw/s200/clojure-icon.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5378118160259513490" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;Clojure is a dialect of lisp targeting the JVM. One of Clojure's key features is a set of immutable functional data structures, with efficiency comparable to Java's mutable collection classes. The beauty of a Lisp dialect is that the syntax of the language is also it's data representation. Code is data, data is code. There are lots of &lt;a href="http://java.ociweb.com/mark/clojure/"&gt;Clojure resources&lt;/a&gt; floating around, including a famous talk by &lt;a href="http://www.infoq.com/presentations/Are-We-There-Yet-Rich-Hickey"&gt;Rich Hickey on state, identity, value and time&lt;/a&gt; and a project to &lt;a href="http://sicpinclojure.com/"&gt;port SICP to Clojure&lt;/a&gt;. Extra points if your client-side code is written in &lt;a href="https://github.com/clojure/clojurescript"&gt;ClojureScript&lt;/a&gt;.
  
&lt;p&gt;Several hip people have recommended reading &lt;i&gt;The Joy of Clojure&lt;/i&gt;. Also on Rich Hickey's &lt;a href="http://www.amazon.com/Clojure-Bookshelf/lm/R3LG3ZBZS4GCTH"&gt;bookshelf&lt;/a&gt; is Chris Okazaki's book &lt;i&gt;Purely Functional Data Structures&lt;/i&gt;.&lt;/p&gt;

&lt;h4&gt;Scala&lt;/h4&gt;

&lt;div&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.scala-lang.org/"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 75px; height: 75px;" src="http://2.bp.blogspot.com/_dbECP0yvozc/S3W5la7FkGI/AAAAAAAAChA/aYPerEiQ8sA/s200/scala-logo.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5437456177653190754" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;Scala is a strongly typed object-functional hybrid. It's targets include the JVM and Microsoft's CLR. It's an academic language derived from the ML family, but meant to be a pragmatic replacement for Java. It has a C++ like reputation for being fully understood only by guru level developers. One of it's key features is a type system that is Turing complete in itself. I guess I'm not completely convinced that a rocket-science type system is the answer, but it's there's cool stuff in there - generics done properly, higher-kinded types, which as near as I can tell takes parametric types to a level of meta beyond generics. One nice thing is that Scala has a tighter mapping to Java than Clojure so the interop between the two is a little more reasonable.&lt;/p&gt;

&lt;p&gt;The &lt;a href="http://akka.io/"&gt;Akka project&lt;/a&gt; is a Scala platform for concurrent applications, providing both the &lt;b&gt;actor model&lt;/b&gt; and &lt;b&gt;software transactional memory&lt;/b&gt;. Those wanting to learn more can track down some interesting talks by language designer Martin Odersky available, plus the &lt;a href="http://scalatypes.com/"&gt;Scala Types podcast&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;A couple more&lt;/h4&gt;
&lt;p&gt;Not enough languages for you? I'll throw in a couple more hip languages, &lt;b&gt;R&lt;/b&gt; and &lt;b&gt;Haskell&lt;/b&gt;. Truly cool kids know Haskell. What can I say, except that I am not yet that cool. I need to go out to a shed in the woods with a couple of books and learn me some Haskell.&lt;/p&gt;

&lt;p&gt;R may have a bastardized syntax, but, eventually, it's &lt;a href="http://www.jstor.org/pss/1390807"&gt;functional core&lt;/a&gt; shines through. R is seeing a surge in popularity based on the highly hip and trendy field of &lt;a href="/2011/03/learning-data-science-skills.html"&gt;data science&lt;/a&gt;, where it's powerful statistical methods and graphing come in handy. Aside from &lt;a href="http://www.rforge.net/doc/packages/multicore/multicore.html"&gt;mclapply&lt;/a&gt;, R is a bit lacking in support for concurrency. [&lt;span style="color:red;"&gt;See correction below in &lt;b&gt;comments&lt;/b&gt;!&lt;/span&gt;] &lt;a href="http://ml.stat.purdue.edu/rhipe/"&gt;Rhipe&lt;/a&gt; and Revolution Analytics's RHadoop are trying to change that by enabling &lt;a href="http://www.slideshare.net/jseidman/distributed-data-analysis-with-r-strangeloop-2011"&gt;distributed data analysis with R and Hadoop&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Fashion&lt;/h4&gt;
&lt;p&gt;You might be tempted to say it's all fashion. What goes around comes around. To some extent that's true, but, in each of these languages, there's something new and worthwhile to be learned. We have a ways to go before code is as expressive as we want it to be. Someone smart said that you'll like a programming language in proportion to what it teaches you. Mostly, I want to remind myself to set aside some time to play with these languages and see what new tricks they have to teach this old dog.&lt;/p&gt;

&lt;p&gt;PS: When this post grows up, it wants to be &lt;i&gt;&lt;a href="http://blog.fogus.me/2011/10/18/programming-language-development-the-past-5-years/"&gt;Programming language development: the past 5 years&lt;/a&gt;&lt;/i&gt; by the very hip Michael Fogus.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-2152615348421112435?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/2152615348421112435/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/09/hipster-programming-languages.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2152615348421112435'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2152615348421112435'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/09/hipster-programming-languages.html' title='Hipster programming languages'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-klx3hUsOk6Y/ToFk0uItX3I/AAAAAAAADF4/in1W6oD02EU/s72-c/languages_stone_tablet.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-4919771153115982248</id><published>2011-09-23T11:00:00.000-07:00</published><updated>2011-12-10T12:00:21.333-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='books'/><category scheme='http://www.blogger.com/atom/ns#' term='networks'/><category scheme='http://www.blogger.com/atom/ns#' term='Bioinformatics'/><title type='text'>Network Science</title><content type='html'>&lt;p&gt;Network analysis is hip. Applications range over social networks, security, biology, and economics. At this point, you'll hardly be the first one to the party, but if you want to give network science a try, here's a random grab-bag of resources to get started.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-hJ6trjBAnuI/TnzIeDJQqkI/AAAAAAAADFU/DpNZVPN6ao8/s1600/egrin_networks.jpg" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="285" width="320" src="http://3.bp.blogspot.com/-hJ6trjBAnuI/TnzIeDJQqkI/AAAAAAAADFU/DpNZVPN6ao8/s320/egrin_networks.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;p style="font-size:smaller;"&gt;&lt;i&gt;&lt;a href="http://www.ncbi.nlm.nih.gov/pubmed/20664639"&gt;Coordination of frontline defense mechanisms under
severe oxidative stress&lt;/a&gt;&lt;/i&gt;, Kaur et al. 2010&lt;/p&gt;

&lt;h4&gt;Learning network science&lt;/h4&gt;

&lt;p&gt;&lt;a href="http://www.cs.cornell.edu/home/kleinber/"&gt;Jon Kleinberg&lt;/a&gt;, a professor of computer science at Cornell University, co-wrote &lt;a href="http://www.cs.cornell.edu/home/kleinber/networks-book/"&gt;Networks, Crowds, and Markets: 
Reasoning About a Highly Connected World&lt;/a&gt; along with David Easley. He also wrote &lt;a href="http://www.aw-bc.com/info/kleinberg/"&gt;Algorithm Design&lt;/a&gt;, an undergraduate textbook.&lt;/p&gt;

&lt;p&gt;A 2004 review paper by Barabasi and Oltvai &lt;i&gt;&lt;a href="http://www.ncbi.nlm.nih.gov/pubmed/14735121"&gt;Network biology: understanding the cell's functional organization.&lt;/a&gt;&lt;/i&gt; covers a broad range of applications of networks in modern biology. Barabasi is also author of &lt;i&gt;Linked&lt;/i&gt;.&lt;/p&gt;

&lt;p&gt;A &lt;a href="http://www.sciencemag.org/content/325/5939.toc"&gt;Science special issue on networks&lt;/a&gt;, from July 2009,  revisits the foundations of network analysis, and delves into applications to ecological interactions, counter-terrorism, and finance.&lt;/p&gt;

&lt;p&gt;Video and slides are available for Drew Conway's presentation on &lt;a href="http://www.drewconway.com/zia/?p=1221"&gt;social network analysis in R&lt;/a&gt;, which mostly focuses on software tools.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://networkx.lanl.gov/" imageanchor="1" style="float: right; margin:1em"&gt;&lt;img border="0" height="250" width="248" src="http://networkx.lanl.gov/_static/art1.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;h4&gt;Tools for analyzing networks&lt;/h4&gt;

&lt;p&gt;Software &lt;a href="http://www.wikiviz.org/wiki/Tools"&gt;tools&lt;/a&gt; for working with networks include the R packages &lt;a href="http://www.bioconductor.org/packages/release/bioc/html/graph.html"&gt;graph&lt;/a&gt;, &lt;a href="http://cran.r-project.org/web/packages/igraph/"&gt;igraph&lt;/a&gt;, &lt;a href="http://cran.r-project.org/web/packages/network/"&gt;network&lt;/a&gt;. Also, the &lt;a href="http://networkx.lanl.gov/"&gt;NetworkX&lt;/a&gt; library for Python looks quite powerful. &lt;a href="http://www.wikiviz.org/wiki/Tools"&gt;Visualization tools&lt;/a&gt; tend to come and go, but some well-known tools are: &lt;a href="http://cytoscape.org/"&gt;Cytoscape&lt;/a&gt;, &lt;a href="http://gephi.org/"&gt;Gephi&lt;/a&gt;, and &lt;a href="http://www.graphviz.org/"&gt;GraphViz&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;More network stuff&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="/2011/04/you-cant-optimize-what-you-cant-predict.html"&gt;Synthetic biology&lt;/a&gt;: predicting and optimizing gene regulatory networks&lt;/li&gt;
&lt;li&gt;Uri Alon's &lt;a href="http://www.weizmann.ac.il/mcb/UriAlon/Papers/Network_motifs_nature_genetics_review.pdf"&gt;Network motifs: theory and experimental approaches&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-4919771153115982248?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/4919771153115982248/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/09/network-science.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4919771153115982248'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4919771153115982248'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/09/network-science.html' title='Network Science'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-hJ6trjBAnuI/TnzIeDJQqkI/AAAAAAAADFU/DpNZVPN6ao8/s72-c/egrin_networks.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-414078668138361111</id><published>2011-09-19T07:52:00.000-07:00</published><updated>2011-09-19T07:52:50.867-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='biology'/><category scheme='http://www.blogger.com/atom/ns#' term='Bioinformatics'/><title type='text'>Applying control theory to the cell</title><content type='html'>
&lt;p&gt;In a talk about research goals in the systems biology of microbes, &lt;a href="http://genomics.lbl.gov/"&gt;Adam Arkin&lt;/a&gt; referenced the &lt;b&gt;Internal Model Principle&lt;/b&gt; of control theory. Here are a couple definitions.&lt;/p&gt;

&lt;blockquote&gt;A regulator for which both internal stability and output regulation are structurally stable properties must utilize feedback of the regulated variable and incorporate in the feedback loop a suitably reduplicated model of the dynamic structure of the exogenous signals which the regulator is required to process.&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="http://dx.doi.org/10.1109/CDC.1976.267838"&gt;Towards an Abstract Internal Model Principle&lt;/a&gt; Wonham, 1976&lt;/p&gt;

&lt;p&gt;That's a mouthful. This one's a little less scary.&lt;/p&gt;

&lt;blockquote&gt;Internal Model Principle: control can be achieved only if the control system encapsulates, either implicitly or explicitly, some representation of the process to be controlled.&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="http://lorien.ncl.ac.uk/ming/robust/imc.pdf"&gt;Lecture notes on Introduction to Robust Control by Ming T. Tham, 2002&lt;/a&gt;&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://lorien.ncl.ac.uk/ming/robust/imc.pdf" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="212" width="400" src="http://1.bp.blogspot.com/-YtXjFiRadvE/TmpLUYCHuGI/AAAAAAAADDw/Rut7AJMPSbc/s400/control_system_schematic.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;Driving this thinking is the discovery that microbes show anticipatory behavior and the associations can be fairly readily entrained and lost in a few generations. &lt;a href="http://www.cs.ucdavis.edu/people/faculty/tagkopoulos.html"&gt;Ilias Tagkopoulos&lt;/a&gt; and &lt;a href="http://tavazoielab.c2b2.columbia.edu/lab/research/cellular-behavior/"&gt;Saeed Tavazoie&lt;/a&gt;, in &lt;i&gt;&lt;a href="http://www.ncbi.nlm.nih.gov/pubmed/18467556"&gt;Predictive behavior within microbial genetic networks&lt;/a&gt;&lt;/i&gt;, demonstrated associative learning through rewiring gene regulatory networks. It turns out that when &lt;i&gt;E. coli&lt;/i&gt; senses a shift to mammalian body temperature, it begins the transition to anaerobic metabolism, nicely anticipating the correlated structure of it's environment.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://www.ncbi.nlm.nih.gov/pubmed/19536156" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="244" width="400" src="http://2.bp.blogspot.com/-vljQwiXJZyQ/TmpLkzxS5LI/AAAAAAAADD4/7OCjU0GHZVE/s400/wine_yeast_anticipation.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;In another example, &lt;a href="http://amirmitchell.wordpress.com/"&gt;Amir Mitchell&lt;/a&gt; working at &lt;a href="http://longitude.weizmann.ac.il/"&gt;Weizmann&lt;/a&gt;, showed that yeast anticipates the stages of fermentation in &lt;a href="http://www.ncbi.nlm.nih.gov/pubmed/19536156"&gt;Adaptive prediction of environmental changes by microorganism&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This raises some important questions. How is the internal model encoded within the cell? And how does the cell acquire, parameterize and adjust its internal model over evolutionary time scales? The answers will lead to a deeper understanding of living systems and might even feed new techniques and principles back to control theory.&lt;/p&gt;

&lt;p&gt;An interesting challenge will be to experimentally read out the information embedded in the cell's control systems and then the informatics problem of how to represent and work with such things.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://en.wikipedia.org/wiki/Synthetic_biology" imageanchor="1" style="clear:left; float:left;margin-right:1em; margin-bottom:1em"&gt;&lt;img border="0" height="108" width="110" src="http://4.bp.blogspot.com/-GhVJxb3PQdU/TndVLX79JeI/AAAAAAAADFI/CikDC-MHP-8/s320/220px-UT_HelloWorld.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;Understanding how this works is a prerequisite for re-engineering living systems, otherwise known as &lt;b&gt;synthetic biology&lt;/b&gt;, championed by George Church and Drew Endy. This month, by the way, the journal Science has a &lt;a href="http://www.sciencemag.org/site/special/syntheticbio"&gt;special issue on synthetic biology&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I'm fascinated by the idea of applying engineering principles to biology - evolved systems, rather than engineered artifacts. Maybe that's because my spaghetti code looks a lot like the messy interconnectedness of biology. Creating software feels organic, rather than wholly predesigned. The &lt;i&gt;engineering&lt;/i&gt; of complex software systems tends to be an adaptive evolutionary process. As messy as biology is, modularity naturally emerges. Maybe biology has something to teach us about organizing this chaos.&lt;/p&gt;
&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-414078668138361111?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/414078668138361111/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/09/applying-control-theory-to-cell.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/414078668138361111'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/414078668138361111'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/09/applying-control-theory-to-cell.html' title='Applying control theory to the cell'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-YtXjFiRadvE/TmpLUYCHuGI/AAAAAAAADDw/Rut7AJMPSbc/s72-c/control_system_schematic.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-1808307888546668195</id><published>2011-08-25T17:03:00.000-07:00</published><updated>2011-12-10T11:21:02.792-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>String functions in R</title><content type='html'>&lt;p&gt;Here's a quick cheat-sheet on string manipulation functions in &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, mostly cribbed from &lt;a href="http://www.statmethods.net/management/functions.html"&gt;Quick-R's list of String Functions&lt;/a&gt; with a few additional links.&lt;/p&gt;

&lt;style type="text/css"&gt;
li { margin-top: 12pt; margin-bottom: 12pt; }
&lt;/style&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/substr.html"&gt;substr&lt;/a&gt;&lt;/strong&gt;(&lt;em&gt;x&lt;/em&gt;, &lt;strong&gt;start&lt;/strong&gt;=&lt;em&gt;n1&lt;/em&gt;, &lt;strong&gt;stop&lt;/strong&gt;=&lt;em&gt;n2&lt;/em&gt;) &lt;/li&gt;

&lt;li&gt;&lt;strong&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/grep.html"&gt;grep&lt;/a&gt;&lt;/strong&gt;(&lt;em&gt;pattern&lt;/em&gt;,&lt;em&gt;x&lt;/em&gt;, &lt;strong&gt;value&lt;/strong&gt;=&lt;em&gt;FALSE&lt;/em&gt;, &lt;strong&gt;ignore.case&lt;/strong&gt;=&lt;em&gt;FALSE&lt;/em&gt;, &lt;strong&gt;fixed&lt;/strong&gt;=&lt;em&gt;FALSE&lt;/em&gt;) &lt;/li&gt;

&lt;li&gt;&lt;strong&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/grep.html"&gt;gsub&lt;/a&gt;&lt;/strong&gt;(&lt;em&gt;pattern&lt;/em&gt;, &lt;em&gt;replacement&lt;/em&gt;, &lt;em&gt;x&lt;/em&gt;, &lt;strong&gt;ignore.case&lt;/strong&gt;=&lt;em&gt;FALSE&lt;/em&gt;, &lt;strong&gt;fixed&lt;/strong&gt;=&lt;em&gt;FALSE&lt;/em&gt;) &lt;/li&gt;

&lt;li&gt;&lt;strong&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/grep.html"&gt;gregexpr&lt;/a&gt;&lt;/strong&gt;(&lt;em&gt;pattern&lt;/em&gt;, &lt;em&gt;text&lt;/em&gt;, &lt;strong&gt;ignore.case&lt;/strong&gt;=&lt;em&gt;FALSE&lt;/em&gt;, &lt;strong&gt;perl&lt;/strong&gt;=&lt;em&gt;FALSE&lt;/em&gt;,
        &lt;strong&gt;fixed&lt;/strong&gt;=&lt;em&gt;FALSE&lt;/em&gt;)&lt;/li&gt;

&lt;li&gt;&lt;strong&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/strsplit.html"&gt;strsplit&lt;/a&gt;&lt;/strong&gt;(&lt;em&gt;x&lt;/em&gt;, &lt;em&gt;split&lt;/em&gt;)&lt;/li&gt;

&lt;li&gt;&lt;strong&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/paste.html"&gt;paste&lt;/a&gt;&lt;/strong&gt;(..., &lt;strong&gt;sep&lt;/strong&gt;=&lt;em&gt;&amp;quot;&amp;quot;&lt;/em&gt;, &lt;strong&gt;collapse&lt;/strong&gt;=&lt;em&gt;NULL&lt;/em&gt;)&lt;/li&gt;

&lt;li&gt;&lt;strong&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/sprintf.html"&gt;sprintf&lt;/a&gt;&lt;/strong&gt;(&lt;em&gt;fmt&lt;/em&gt;, &lt;em&gt;...&lt;/em&gt;)&lt;/li&gt;

&lt;li&gt;&lt;strong&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/chartr.html"&gt;toupper/tolower&lt;/a&gt;&lt;/strong&gt;(&lt;em&gt;x&lt;/em&gt;)&lt;/li&gt;

&lt;li&gt;&lt;strong&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/nchar.html"&gt;nchar&lt;/a&gt;&lt;/strong&gt;(&lt;em&gt;x&lt;/em&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also see &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html"&gt;Regular Expressions as used in R&lt;/a&gt; and &lt;a href="/2009/07/r-string-processing.html"&gt;R String processing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Note:&lt;/b&gt; Just to be clear, R is far from an ideal platform for processing text. For anything where that's the major concern, you're better off going to Python or Ruby.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-1808307888546668195?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/1808307888546668195/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/08/string-functions-in-r.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1808307888546668195'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1808307888546668195'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/08/string-functions-in-r.html' title='String functions in R'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-968603777560212135</id><published>2011-08-15T17:13:00.000-07:00</published><updated>2011-08-15T17:13:47.221-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='R'/><category scheme='http://www.blogger.com/atom/ns#' term='db'/><title type='text'>MySQL and R</title><content type='html'>&lt;p&gt;Using MySQL with R is pretty easy, with &lt;a href="http://cran.r-project.org/web/packages/RMySQL/"&gt;RMySQL&lt;/a&gt;. Here are a few notes to keep me straight on a few things I always get snagged on.&lt;/p&gt;

&lt;p&gt;Typically, most folks are going to want to analyze data that's already in a MySQL database. Being a little bass-ackwards, I often want to go the other way. One reason to do this is to do some analysis in R and make the results available dynamically in a web app, which necessitates writing data from R into a database. As of this writing, INSERT isn't even mentioned in the &lt;a href="http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf"&gt;RMySQL docs&lt;/a&gt;, sadly for me, but it works just fine.&lt;/p&gt;

&lt;p&gt;The docs are a bit clearer for &lt;a href="http://stat.bell-labs.com/RS-DBI/doc/html/DBI.html"&gt;RS-DBI&lt;/a&gt;, which is the standard R interface to relational databases and of which RMySQL is one implementation.&lt;/p&gt;

&lt;h4&gt;Opening and closing connections&lt;/h4&gt;
&lt;p&gt;The best way to close DB connections, like you would do in a &lt;i&gt;finally&lt;/i&gt; clause in Java, is to use &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/on.exit.html"&gt;on.exit&lt;/a&gt;, like this:&lt;/p&gt;

&lt;pre class="codebox"&gt;
con &amp;lt;- dbConnect(MySQL(),
         user=&amp;quot;me&amp;quot;, password=&amp;quot;nuts2u&amp;quot;,
         dbname=&amp;quot;my_db&amp;quot;, host=&amp;quot;localhost&amp;quot;)
on.exit(dbDisconnect(con))
&lt;/pre&gt;

&lt;h4&gt;Building queries&lt;/h4&gt;
&lt;p&gt;Using sprintf to build the queries feels a little primitive. As far as I can tell, there's no &lt;b&gt;prepared statements&lt;/b&gt; in RMySQL. I don't suppose SQL-injection is a concern here, but prepared statements might be a little tidier, anyway.&lt;/p&gt;

&lt;h4&gt;Processing query results&lt;/h4&gt;
&lt;p&gt;You can process query results row by row, in blocks or all at once. The highly useful function &lt;i&gt;dbGetQuery(con, sql)&lt;/i&gt; returns all query results as a data frame. With dbSendQuery, you can get all or partial results with &lt;i&gt;fetch&lt;/i&gt;.&lt;/p&gt;

&lt;pre class="codebox"&gt;
con &amp;lt;- dbConnect(MySQL(), user=&amp;quot;network_portal&amp;quot;, password=&amp;quot;monkey2us&amp;quot;, dbname=db.name, host=&amp;quot;localhost&amp;quot;)
rs &amp;lt;- dbSendQuery(con, &amp;quot;select name from genes limit 10;&amp;quot;)
data &amp;lt;- fetch(rs, n=10)
huh &amp;lt;- dbHasCompleted(rs)
dbClearResult(rs)
dbDisconnect(con)
&lt;/pre&gt;

&lt;p&gt;If there's no more results, fetch returns a data frame with 0 columns and 0 rows. &lt;i&gt;dbHasCompleted&lt;/i&gt; is supposed to indicate whether there are more records to be fetched, but seems broken. The value of &lt;i&gt;huh&lt;/i&gt; in the code above is false, which seems wrong to me.&lt;/p&gt;

&lt;h4&gt;Retrieving AUTO_INCREMENT IDs&lt;/h4&gt;

&lt;p&gt;A standard newbie question with MySQL is how to retrieve freshly generated primary keys from AUTO_INCREMENT fields. That's what MySQL's &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/information-functions.html#function_last-insert-id"&gt;LAST_INSERT_ID()&lt;/a&gt; is for.&lt;/p&gt;

&lt;blockquote&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.0/en/example-auto-increment.html"&gt;You can retrieve the most recent AUTO_INCREMENT value with the LAST_INSERT_ID() SQL function or the mysql_insert_id() C API function. These functions are connection-specific, so their return values are not affected by another connection which is also performing inserts.&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;The same works with RMySQL, but there are some traps to watch out for. Let's say you're inserting a row into a table of networks. Don't worry about the specifics. You want to insert related data in another table, so you need the ID of the newly inserted row.&lt;/p&gt;

&lt;pre class="codebox"&gt;
create.network &amp;lt;- function(species.id, network.name, data.source, description) {
  
  con &amp;lt;- dbConnect(MySQL(),
           user=&amp;quot;super_schmuck&amp;quot;, password=&amp;quot;nuts2u&amp;quot;,
           dbname=&amp;quot;my_db&amp;quot;, host=&amp;quot;localhost&amp;quot;)
  on.exit(dbDisconnect(con))

  sql &amp;lt;- sprintf(&amp;quot;insert into networks
                  (species_id, name, data_source, description, created_at)
                  values (%d, &amp;#x27;%s&amp;#x27;, &amp;#x27;%s&amp;#x27;, &amp;#x27;%s&amp;#x27;, NOW());&amp;quot;,
                 species.id, network.name, data.source, description)
  rs &amp;lt;- dbSendQuery(con, sql)
  dbClearResult(rs)

  id &amp;lt;- dbGetQuery(con, &amp;quot;select last_insert_id();&amp;quot;)[1,1]

  return(id)
}
&lt;/pre&gt;

&lt;p&gt;Don't forget to clear the result of the insert. If you do, you'll get 0 from the &lt;i&gt;last_insert_id&lt;/i&gt;(). Also, using &lt;i&gt;dbGetQuery&lt;/i&gt; for the insert produces an strange error when you go to call &lt;i&gt;last_insert_id&lt;/i&gt;:&lt;/p&gt;

&lt;pre class="codebox"&gt;
Error in mysqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not run statement: Commands out of sync; you can&amp;#x27;t run this command now)
&lt;/pre&gt;

&lt;p&gt;Alternatively, you can also combine both SQL statements into one call to &lt;i&gt;dbSendQuery&lt;/i&gt;, but, you have to remember to set a flag when you make the connection: &lt;b&gt;client.flag=CLIENT_MULTI_STATEMENTS&lt;/b&gt;. Trying to use multiple queries seems not to work with &lt;i&gt;dbGetQuery&lt;/i&gt;.&lt;/p&gt;

&lt;pre class="codebox"&gt;
create.network &amp;lt;- function(species.id, network.name, data.source, description) {

  con &amp;lt;- dbConnect(MySQL(),
           user=&amp;quot;super_schmuck&amp;quot;, password=&amp;quot;nuts2u&amp;quot;,
           dbname=&amp;quot;my_db&amp;quot;, host=&amp;quot;localhost&amp;quot;,
           client.flag=CLIENT_MULTI_STATEMENTS)
  on.exit(dbDisconnect(con))

  sql &amp;lt;- sprintf(&amp;quot;insert into networks
                  (species_id, name, data_source, description, created_at)
                  values (%d, &amp;#x27;%s&amp;#x27;, &amp;#x27;%s&amp;#x27;, &amp;#x27;%s&amp;#x27;, NOW());
                  select last_insert_id();&amp;quot;,
                 species.id, network.name, data.source, description)

  rs &amp;lt;- dbSendQuery(con, sql)

  if (dbMoreResults(con)) {
    rs &amp;lt;- dbNextResult(con)
    id &amp;lt;- fetch(rs)[1,1]
  } else {
    stop(&amp;#x27;Error getting last inserted id.&amp;#x27;)
  }

  dbClearResult(rs)

  return(id)
}
&lt;/pre&gt;

&lt;p&gt;Any effort saved by combining the SQL queries is lost in the extra house-keeping so I prefer the first method.&lt;/p&gt;

&lt;p&gt;In spite of these few quirks, RMySQL generally works fine and is pretty straightforward.&lt;/p&gt;
&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-968603777560212135?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/968603777560212135/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/08/mysql-and-r.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/968603777560212135'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/968603777560212135'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/08/mysql-and-r.html' title='MySQL and R'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-5060977859116076350</id><published>2011-08-03T12:17:00.000-07:00</published><updated>2011-09-12T11:29:29.995-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='biology'/><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='Bioinformatics'/><title type='text'>Microarrays</title><content type='html'>&lt;p&gt;&lt;a href="http://www.ncbi.nlm.nih.gov/About/primer/microarrays.html"&gt;Microarrays&lt;/a&gt; are one of the workhorses of modern biology. Measuring transcript levels enables studies of differential expression - asking what the difference is, at the gene expression level, for example, between cancer tumor cells and normal cells.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-Y6zlVZL52xY/TjmaMNF_REI/AAAAAAAAC9s/3TkNRqEUgew/s1600/life-cycle.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="221" width="320" src="http://3.bp.blogspot.com/-Y6zlVZL52xY/TjmaMNF_REI/AAAAAAAAC9s/3TkNRqEUgew/s320/life-cycle.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;Bruz Marzolf, who up 'til recently ran my local microarray facility, spoke recently, tracing the journey of microarrays through the full technology life-cycle, starting in 1995 with the publication of &lt;a href="http://www.sciencemag.org/content/270/5235/467.abstract"&gt;Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray&lt;/a&gt; in Science. Bruz put microarrays in the category of a utility technology, but not quite to the point of commoditization as there remain major differences between manufacturers.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;b&gt;Affymetrix&lt;/b&gt;, first to commercialize microarray technologies, is the 800 pound gorilla. Their photolithography process borrows from computer chip manufacturing and their standardized probe sets are well supported by tools such as Bioconductor. The technology is robust but producing the masks is quite expensive, thus custom arrays are not economical.&lt;/li&gt;

&lt;li&gt;&lt;b&gt;Agilent&lt;/b&gt;, which spun out of HP, uses ink-jet technology. Custom arrays can be designed using Agilent's &lt;a href="https://earray.chem.agilent.com/"&gt;eArray&lt;/a&gt; software. Agilent arrays come in a variety of resolutions including 8x60k, 1x244k and 1x1m with 60mer probes.&lt;/li&gt;

&lt;li&gt;&lt;b&gt;Illumina&lt;/b&gt; builds arrays out of beads coated in oligo probes. Beads are laid out randomly on the slides, necessitating a layout discovery step. These chips have extra redundancy to account for randomness in bead-probe count.&lt;/li&gt;

&lt;li&gt;&lt;b&gt;Nimblegen&lt;/b&gt;'s maskless photolithography process is more flexible for custom arrays. Nimblegen provides arrays in 385K, 4x72K, and 12x135K resolutions using 60mer probes. They emphasize high array-to-array data reproducibility.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As an aside, our group uses custom spotted arrays and Agilent arrays. We tried Nimblegen and found that inter-array consistency was excellent, but inter-probe consistency was not. Below we see the ribosomal RNAs and adjacent genes with total RNA measured by a custom Agilent array (in blue) plotted next to a custom Nimblegen array (in green). To be fair there might be other explanations for what we saw, but it certainly looks like there is significant variability between probes that we would expect to have identical readings.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-OOfGj0_GDNQ/TjmZSON5eHI/AAAAAAAAC9k/n8s1Xhu0ESM/s1600/nimblegen_vs_agilent.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="275" width="320" src="http://1.bp.blogspot.com/-OOfGj0_GDNQ/TjmZSON5eHI/AAAAAAAAC9k/n8s1Xhu0ESM/s320/nimblegen_vs_agilent.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;In &lt;a href="http://www.nature.com/nrg/journal/v10/n1/full/nrg2484.html"&gt;RNA-Seq: a revolutionary tool for transcriptomics&lt;/a&gt; (Nature Reviews Genetics, 2009), Zhong Wang, Mark Gerstein &amp;amp; Michael Snyder show this comparison between microarrays and RNA-Seq.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://www.nature.com/nrg/journal/v10/n1/fig_tab/nrg2484_F2.html" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="141" width="400" src="http://3.bp.blogspot.com/-NuP7IarybKM/Tm44kwELNaI/AAAAAAAADEA/litRsWrDQV0/s400/nrg2484-f2.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;While RNA-seq, no doubt, has a higher dynamic range, does it really have less noise? Some folks say so. With tens or hundreds of thousands of probes, fairly dense coverage of whole microbial genomes is possible. If you know what you're looking for, microarrays are still cheaper. Discovery oriented work is going increasingly toward sequencing.&lt;/p&gt;

&lt;h4&gt;Links&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;The technology life-cycle curve slide comes from a good &lt;a href="http://www.youtube.com/watch?v=5Oyf4vvJyy4"&gt;talk by Simon Wardley given at OSCON in 2010&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a href="http://www.nature.com/nbt/focus/maqc/index.html"&gt;MicroArray Quality Control (MAQC) project&lt;/a&gt;, Shi et al, Nature Biotechnology, 2006&lt;/li&gt;
  &lt;li&gt;&lt;a href="http://genome.cshlp.org/content/18/9/1509.long"&gt;RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="http://www.nature.com/nrg/journal/v10/n1/full/nrg2484.html"&gt;RNA-Seq: a revolutionary tool for transcriptomics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-5060977859116076350?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/5060977859116076350/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/08/microarrays.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5060977859116076350'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5060977859116076350'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/08/microarrays.html' title='Microarrays'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-Y6zlVZL52xY/TjmaMNF_REI/AAAAAAAAC9s/3TkNRqEUgew/s72-c/life-cycle.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-5635634352583947490</id><published>2011-07-08T21:05:00.000-07:00</published><updated>2011-07-09T11:14:44.274-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Notes on Engineering Data Analysis (with R and ggplot2)</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-lzuta0l1844/ThiZNZQSVQI/AAAAAAAAC6w/TYZ5d7f2N2w/s1600/data_analysis_cycle.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="189" src="http://2.bp.blogspot.com/-lzuta0l1844/ThiZNZQSVQI/AAAAAAAAC6w/TYZ5d7f2N2w/s320/data_analysis_cycle.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="http://had.co.nz/"&gt;Hadley Wickham&lt;/a&gt; gave a &lt;a href="http://www.youtube.com/user/GoogleTechTalks"&gt;Google Tech Talk&lt;/a&gt; a couple weeks back titled &lt;a href="http://www.youtube.com/watch?v=TaxJwC_MP9Q"&gt;Engineering Data Analysis (with R and ggplot2)&lt;/a&gt;. These are my notes.&lt;/p&gt;

&lt;p&gt;The data analysis cycle is to iteratively transform, visualize and model. Leading into the cycle is data access and the output of the process is knowledge, insight and understanding which can be communicated to others. Transforming the data is almost always necessary to bring data into a workable form. Visualization and modeling have something of a duality where visualization is good at revealing the unexpected but has problems scaling. Models scale better, but will only find expected relationships. A larger cycle comes about when answers to one question lead to more questions.&lt;/p&gt;

&lt;p&gt;Hadley makes a case for data analysis in code, rather than GUIs and for R in particular. Working in a programming language gives you a means of:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;reproducibility&lt;/li&gt;
  &lt;li&gt;automation&lt;/li&gt;
  &lt;li&gt;version control&lt;/li&gt;
  &lt;li&gt;communication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Advantages of R:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;open source&lt;/li&gt;
  &lt;li&gt;runs anywhere&lt;/li&gt;
  &lt;li&gt;well established community&lt;/li&gt;
  &lt;li&gt;huge library of packages&lt;/li&gt;
  &lt;li&gt;connectivity to other languages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Downsides of R are it's learning curve, strangeness relative to other programming languages, lack of programming infrastructure and prickliness of the community. R scales well up to about a million observations. How to scale the interactive analysis cycle up to billions of observations is an open question. Programming infrastructure is an area where programmers can contribute.&lt;/p&gt;

&lt;p&gt;DSLs help express and think clearly about common problems in data analysis. Hadley views his libraries as DSLs (domain specific languages) within R for the phases of the analysis cycle. For visualization, there's &lt;a href="http://had.co.nz/ggplot2/"&gt;ggplot2&lt;/a&gt;. DSLs align nicely with ggplot's philosophy as a grammar of graphics. R's &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html"&gt;model formula&lt;/a&gt; is the DSL for modeling. &lt;a href="http://had.co.nz/plyr/"&gt;Plyr&lt;/a&gt; is the DSL for data transformation.&lt;/p&gt;

&lt;p&gt;The four key verbs of data transformation are:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;subset&lt;/li&gt;
  &lt;li&gt;mutate&lt;/li&gt;
  &lt;li&gt;arrange&lt;/li&gt;
  &lt;li&gt;summarize&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...plus...&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;group by&lt;/li&gt;
  &lt;li&gt;join&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data can be divided by subsetting or filtering; mutated, for example adding new columns to a table that are functions of other columns; rearranged or sorted; and summarized, condensing a data set down to a smaller number of values. These actions can be combined with a group by operator. Finally, data sets can be joined to other related data sets.&lt;/p&gt;

&lt;p&gt;The second half of a talk is a case study, dissecting a set of cause-of-death statistics from the Mexican government. Finally, Hadley makes a familiar sounding point about the tension between making new things and making well-engineered user-friedly software that does old things.&lt;/p&gt;
&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-5635634352583947490?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/5635634352583947490/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/07/notes-on-engineering-data-analysis-with.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5635634352583947490'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5635634352583947490'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/07/notes-on-engineering-data-analysis-with.html' title='Notes on Engineering Data Analysis (with R and ggplot2)'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-lzuta0l1844/ThiZNZQSVQI/AAAAAAAAC6w/TYZ5d7f2N2w/s72-c/data_analysis_cycle.png' height='72' width='72'/><thr:total>0</thr:total><georss:featurename>Unknown location.</georss:featurename><georss:point>47.633470089967 -122.35336303710938</georss:point><georss:box>47.590673089967 -122.43232703710937 47.676267089967 -122.27439903710938</georss:box></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-2178225022687567799</id><published>2011-07-02T15:39:00.000-07:00</published><updated>2011-07-03T13:09:23.174-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='seattle'/><title type='text'>Running in Queen Anne</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-Ll_UXuSxwz4/Tg-YkIm9EcI/AAAAAAAAC6g/UXVIMOvjOHc/s1600/qa_run_map.png" imageanchor="1" style="margin-left:1em; margin-bottom:1em"&gt;&lt;img border="0" height="337" width="400" src="http://3.bp.blogspot.com/-Ll_UXuSxwz4/Tg-YkIm9EcI/AAAAAAAAC6g/UXVIMOvjOHc/s400/qa_run_map.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;The crown of Queen Anne is a great place for running. It's mostly level along tree-lined streets and has great views of Elliot Bay and the Seattle skyline. The &lt;a href="http://maps.google.com/maps?saddr=Bigelow+Ave+N&amp;daddr=Bigelow+Ave+N+to:Kerry+Park,+Seattle,+WA+to:47.632863,-122.3676817+to:7th+Ave+W+to:8th+Avenue+West,+Seattle,+WA+to:10th+Avenue+West,+Seattle,+WA+to:10th+Ave+W+to:47.6396,-122.359886+to:47.6414386,-122.355769+to:47.6404812,-122.3536807+to:47.6408,-122.34961+to:47.6391,-122.3483+to:Bigelow+Ave+N&amp;hl=en&amp;sll=47.638444,-122.359285&amp;sspn=0.032791,0.063&amp;geocode=FUbg1gIdzBy1-A%3BFW3J1gIdCBe1-A%3BFa7E1gIdpu-0-CGOtSQkHmj3Mw%3BFd_R1gIdP9G0-CmH3I5fbhWQVDGrGa1jfuFXTQ%3BFRzs1gIdrta0-A%3BFXvw1gIdnNG0-CldjsVecxWQVDFI99W6DYHb6A%3BFXbx1gId4ce0-CnTBtuOdBWQVDELPCxlHnUGRA%3BFaQE1wId3sm0-A%3BFTDs1gIdsu-0-CkJnrUNDRWQVDH6kJAAKvueWw%3BFV7z1gIdx_-0-ClbPTF6DhWQVDFi4oMhl_-Efw%3BFaHv1gId8Ae1-ClvjKAXDhWQVDECyBbrAOz9xQ%3BFeDw1gId1he1-CnHOLUlEBWQVDE5fRmnLHMe6Q%3BFTzq1gId9By1-CkFHgJbEBWQVDEIWXgOq3_WSg%3BFW_g1gIdzRy1-A&amp;mra=dme&amp;mrsp=13&amp;sz=14&amp;via=3,8,9,10,11,12&amp;dirflg=w&amp;t=h&amp;z=14"&gt;Queen Anne Boulevard route&lt;/a&gt; is 4.4 miles, according to Google. Jen and I often do a slightly &lt;a href="http://maps.google.com/maps?saddr=Bigelow+Ave+N&amp;daddr=Bigelow+Ave+N+to:Kerry+Park,+Seattle,+WA+to:47.632863,-122.3676817+to:7th+Ave+W+to:47.64274,-122.36627+to:47.6396,-122.36093+to:47.64144,-122.35559+to:47.640488,-122.353689+to:47.63994,-122.35185+to:47.6407747,-122.3488081+to:Bigelow+Ave+N&amp;hl=en&amp;ll=47.639601,-122.359285&amp;spn=0.032791,0.063&amp;sll=47.639485,-122.359285&amp;sspn=0.032791,0.063&amp;geocode=FUbg1gIdzBy1-A%3BFW3J1gIdCBe1-A%3BFa7E1gIdpu-0-CGOtSQkHmj3Mw%3BFd_R1gIdP9G0-CmH3I5fbhWQVDGrGa1jfuFXTQ%3BFRzs1gIdrta0-A%3BFXT41gIdwta0-CmTAw5FCxWQVDFK64LnPWIYVg%3BFTDs1gIdnuu0-CmNBNYQDRWQVDHqbMTmZ96qGQ%3BFWDz1gIdegC1-ClbPTF6DhWQVDFi4oMhl_-Efw%3BFajv1gId5we1-ClvjKAXDhWQVDECyBbrAOz9xQ%3BFYTt1gIdFg-1-ClZnvHcERWQVDHpSbOyxl5WZA%3BFcbw1gId-Bq1-CnTRkg9EBWQVDEoFV8qB97_IQ%3BFejg1gId0hy1-A&amp;mra=dme&amp;mrsp=11&amp;sz=14&amp;via=3,5,6,7,8,9,10&amp;dirflg=w&amp;t=h&amp;z=14"&gt;shorter 3.7 mile run&lt;/a&gt;, by cutting short the north-west loop staying on 7th West past Coe Elementary School.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-2178225022687567799?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/2178225022687567799/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/07/running-in-queen-anne.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2178225022687567799'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2178225022687567799'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/07/running-in-queen-anne.html' title='Running in Queen Anne'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-Ll_UXuSxwz4/Tg-YkIm9EcI/AAAAAAAAC6g/UXVIMOvjOHc/s72-c/qa_run_map.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-1233838854829437161</id><published>2011-06-25T22:05:00.000-07:00</published><updated>2011-08-08T20:56:47.926-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><title type='text'>The future of money</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-dlOzxKUVpU4/Tga8rTkVTXI/AAAAAAAAC6I/xznxg66QGTQ/s1600/MercuryCoin.jpeg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="120" src="http://3.bp.blogspot.com/-dlOzxKUVpU4/Tga8rTkVTXI/AAAAAAAAC6I/xznxg66QGTQ/s200/MercuryCoin.jpeg" width="122" /&gt;&lt;/a&gt;&lt;/div&gt;
Back during the first dot-com bubble, PayPal got started with revolutionary intentions. One of the founders, &lt;a href="http://www.crunchbase.com/person/peter-thiel"&gt;Peter Thiel&lt;/a&gt;, &lt;a href="http://techcrunch.com/2011/04/10/peter-thiel-were-in-a-bubble-and-its-not-the-internet-its-higher-education/"&gt;recently herd&lt;/a&gt; urging college students to log in, drop out, and start up, was even more radical back then.&lt;br /&gt;
&lt;blockquote&gt;
&lt;a href="http://www.cato.org/pub_display.php?pub_id=6429"&gt;
&lt;/a&gt;&lt;br /&gt;
&lt;a href="http://www.cato.org/pub_display.php?pub_id=6429"&gt;In his book The PayPal Wars, early PayPal marketing guru Eric M. Jackson recounts a stirring speech Thiel gave to the company's early staff.&lt;/a&gt;&lt;br /&gt;
&lt;a href="http://www.cato.org/pub_display.php?pub_id=6429"&gt;

"PayPal will give citizens worldwide more direct control over their currencies than they ever had before," Thiel said. "It will be nearly impossible for corrupt governments to steal wealth from their people through their old means because if they try the people will switch to dollars or pounds or yen, in effect dumping the worthless local currency for something more secure."&lt;br /&gt;


Unfortunately, that vision never panned out. PayPal thrived when it came to innovating and adapting to stay a step ahead of its early competitors. But the company proved less adept at slaying its more formidable antagonists: Lawyers and politicians.&lt;br /&gt;

&lt;/a&gt;&lt;/blockquote&gt;
&lt;a href="http://www.vanityfair.com/business/features/2011/04/jack-dorsey-201104"&gt;Jack Dorsey&lt;/a&gt;'s &lt;a href="https://squareup.com/"&gt;Square&lt;/a&gt;, a web 2.0 and mobile compliant version of PayPal, &lt;a href="http://www.fastcompany.com/1754859/how-square-is-accidentally-disrupting-the-entire-payments-industry"&gt;talks disruption&lt;/a&gt; but has already &lt;a href="http://mashable.com/2011/04/27/visa-square-investment/"&gt;accepted money from VISA&lt;/a&gt;. It's a shrewd business model: build anything reasonably credible in the payments space and the odds of being bought out by one of the incumbents are about 1000 percent. &lt;a href="https://www.dwolla.com/"&gt;Dwolla&lt;/a&gt; is a digital and mobile payments startup from Iowa, who's &lt;a href="http://www.businessweek.com/ap/financialnews/D9JQTJKG0.htm"&gt;early funding came from a credit union&lt;/a&gt;.

&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-m_eKrfgAcB4/Tga71X2S85I/AAAAAAAAC6A/VjrxOCz32ok/s1600/1921%2BPeace%2BDollar.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="207" src="http://4.bp.blogspot.com/-m_eKrfgAcB4/Tga71X2S85I/AAAAAAAAC6A/VjrxOCz32ok/s320/1921%2BPeace%2BDollar.jpeg" width="207" /&gt;&lt;/a&gt;&lt;/div&gt;
Startups aren't the only ones who want a piece of VISA's &lt;a href="http://finance.yahoo.com/q/ks?s=V"&gt;37% profit margin on 8.6 billion in revenue&lt;/a&gt;.  Big tech companies, like &lt;a href="http://www.computerworld.com/s/article/9206779/Apple_could_disrupt_mobile_payment_industry_analysts_say"&gt;Apple&lt;/a&gt; and &lt;a href="http://www.businessweek.com/technology/content/dec2010/tc20101231_087039.htm"&gt;Google&lt;/a&gt; are getting into the payments game, too. Google just rolled out &lt;a href="http://www.google.com/wallet/"&gt;Google wallet&lt;/a&gt;, &lt;a href="http://www.huffingtonpost.com/2011/05/26/google-wallet-money-data_n_867774.html"&gt;stirring privacy concerns&lt;/a&gt; and &lt;a href="http://www.bloomberg.com/news/2011-05-26/paypal-sues-google-over-trade-secret-theft-claims.html"&gt;getting sued by PayPal&lt;/a&gt;.&lt;br /&gt;
These days, the wild-eyed radicals look past these tame corporate offerings to &lt;a href="http://www.bitcoin.org/"&gt;Bitcoin&lt;/a&gt;, a peer to peer digital &lt;a href="http://www.readwriteweb.com/hack/2010/12/interview-bitcoin.php"&gt;currency&lt;/a&gt; created in 2009 by the mysterious &lt;a href="https://en.bitcoin.it/wiki/Satoshi_Nakamoto"&gt;Satoshi Nakamoto&lt;/a&gt;. &lt;a href="http://en.wikipedia.org/wiki/Bitcoin"&gt;Bitcoin&lt;/a&gt; relies on public key cryptography and digital signatures to guarantee payment and receipt. Satoshi's key insight was the means of validating transactions. Rather than clearing through a central authority, the system validates transactions through a distributed &lt;a href="http://en.wikipedia.org/wiki/Proof-of-work_system"&gt;proof-of-work&lt;/a&gt; system, relying on the majority of honest nodes on the peer to peer network to solve cryptographic puzzles faster than any attacker.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-RiS7IilB6Hw/Tga9IXrN6ZI/AAAAAAAAC6Q/WAV2XD-goNc/s1600/bitcoin530.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="140" src="http://3.bp.blogspot.com/-RiS7IilB6Hw/Tga9IXrN6ZI/AAAAAAAAC6Q/WAV2XD-goNc/s200/bitcoin530.png" width="142" /&gt;&lt;/a&gt;&lt;/div&gt;
The idea of an unregulated decentralized currency appeals to some, but don't expect the government to like it. The potential for &lt;a href="http://www.wired.com/threatlevel/2011/06/silkroad/"&gt;black markets&lt;/a&gt; and &lt;a href="http://bitcoinlaundry.com/"&gt;money laundering&lt;/a&gt; has already drawn &lt;a href="http://www.reuters.com/article/2011/06/08/us-financial-bitcoins-idUSTRE7573T320110608"&gt;scrutiny&lt;/a&gt; and calls for a &lt;a href="http://venturebeat.com/2011/06/08/government-crackdown-on-bitcoin/"&gt;crack down&lt;/a&gt;. They're undoubtedly wondering how to tax it.&lt;br /&gt;
Possibly a bigger threat, Bitcoin has also attracted the attention of thieves. Last week, Mt. Gox, the currency's largest exchange, was &lt;a href="http://www.dailytech.com/Inside+the+MegaHack+of+Bitcoin+the+Full+Story/article21942.htm"&gt;hacked&lt;/a&gt;. The system relies critically on the security of end-user machines, a shaky proposition. One Bitcoin user, aptly named, &lt;a href="https://forum.bitcoin.org/index.php?topic=16457.0"&gt;allinvain&lt;/a&gt;, reported $500,000 worth stolen. A &lt;a href="http://www.geekosystem.com/bitcoin-trojan/"&gt;Bitcoin harvesting trojan&lt;/a&gt; has already been spotted in the wild.&lt;br /&gt;
Whether Bitcoin can overcome these problems or not, it's sure to be a wild ride. Bitcoin's technical underpinnings are fascinating and an impressive ecosystem has quickly sprung up it. There are &lt;a href="https://www.bitcoin4cash.com/"&gt;dealers&lt;/a&gt;, &lt;a href="http://bitcoincharts.com/markets/"&gt;exchanges&lt;/a&gt;, &lt;a href="https://clearcoin.appspot.com/"&gt;an escrow service&lt;/a&gt;, &lt;a href="http://www.bitcoin-charity.com/"&gt;charities&lt;/a&gt; and a place to &lt;a href="http://draft.blogger.com/mybitcoin.com"&gt;keep your treasure horde online&lt;/a&gt;.&lt;br /&gt;
I'm curious to see how much of the existing financial system gets ported to Bitcoin. Is fractional reserve banking in Bitcoin possible? Or how about securities denominated in Bitcoin? If you're a sceptic, can you sell short? One thing I love about Bitcoin is the mixture of engineering and economics, and even more, the engineering &lt;i&gt;of&lt;/i&gt; economics. Of course, this comes with all the caveats and warnings of version 0.1.&lt;br /&gt;
The &lt;a href="http://pragmaticpoliticaleconomy.blogspot.com/2010/03/future-of-money.html"&gt;future of money&lt;/a&gt; is here. Are we read for it?&lt;br /&gt;
&lt;h4&gt;
PS&lt;/h4&gt;
I'm ready! Support this blog. Tips accepted here: 15Y9pepdBG9GJxyCc6HgsQS39BvsBUqi1W&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-1233838854829437161?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/1233838854829437161/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/06/future-of-money.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1233838854829437161'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1233838854829437161'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/06/future-of-money.html' title='The future of money'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-dlOzxKUVpU4/Tga8rTkVTXI/AAAAAAAAC6I/xznxg66QGTQ/s72-c/MercuryCoin.jpeg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-8929554257317411750</id><published>2011-06-24T10:40:00.000-07:00</published><updated>2011-12-14T23:58:23.762-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='stats'/><category scheme='http://www.blogger.com/atom/ns#' term='tutorial'/><category scheme='http://www.blogger.com/atom/ns#' term='analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Drawing heatmaps in R</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/TQV5SBTa2KI/AAAAAAAACxk/KgtUFtpRZ50/s1600/ferrari-dino-246-gt-5w.jpg"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 150px; height:100px;" src="http://2.bp.blogspot.com/_dbECP0yvozc/TQV5SBTa2KI/AAAAAAAACxk/KgtUFtpRZ50/s400/ferrari-dino-246-gt-5w.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5549975466298497186" /&gt;&lt;/a&gt;

&lt;p&gt;A while back, while reading chapter 4 of &lt;a href="http://wiener.math.csi.cuny.edu/UsingR/"&gt;Using R for Introductory Statistics&lt;/a&gt;, I fooled around with the &lt;i&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html"&gt;mtcars&lt;/a&gt;&lt;/i&gt; dataset giving mechanical and performance properties of cars from the early 70's. Let's plot this data as a hierarchically clustered &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/heatmap.html"&gt;heatmap&lt;/a&gt;.&lt;/p&gt;

&lt;pre class="codebox"&gt;
# scale data to mean=0, sd=1 and convert to matrix
mtscaled &amp;lt;- as.matrix(scale(mtcars))

# create heatmap and don't reorder columns
heatmap(mtscaled, Colv=F, scale='none')
&lt;/pre&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-38wWn7KD6v0/TgTJskG1ujI/AAAAAAAAC5g/0k33b0L5fL8/s1600/heatmap.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="327" width="400" src="http://2.bp.blogspot.com/-38wWn7KD6v0/TgTJskG1ujI/AAAAAAAAC5g/0k33b0L5fL8/s400/heatmap.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;By default, heatmap clusters by both rows and columns. It then &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/reorder.dendrogram.html"&gt;reorders&lt;/a&gt; the resulting dendrograms according to mean. Setting &lt;i&gt;Colv&lt;/i&gt; to false tells it not to reorder the columns, which will come in handy later. Let's also turn off the default &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/scale.html"&gt;scaling&lt;/a&gt; across rows. We've already scaled across columns, which is the sensible thing to do in this case.&lt;/p&gt;

&lt;p&gt;If our columns are already in some special order, say as a time-series or by increasing dosage, we might want to cluster only rows. We could do that by setting the &lt;i&gt;Colv&lt;/i&gt; argument to NA. One thing that clustering the columns tells us in this case is that some information is highly correlated, bordering on redundant. For example, displacement, horsepower and number of cylinders are quit similar. And the idea that to get more power (hp) and go faster (qsec) we need to burn more gas (mpg) is pretty well supported.&lt;/p&gt;

&lt;h4&gt;Separating clusters&lt;/h4&gt;

&lt;p&gt;If we'd like to separate out the clusters, I'm not sure of the best approach. One way is to use &lt;a href=""&gt;hclust&lt;/a&gt; and &lt;a href=""&gt;cutree&lt;/a&gt;, which allows you to specify &lt;i&gt;k&lt;/i&gt;, the number of clusters you want. Don't forget that &lt;i&gt;hclust&lt;/i&gt; requires a distance matrix as input.&lt;/p&gt;

&lt;pre class="codebox"&gt;
# cluster rows
hc.rows &amp;lt;- hclust(dist(mtscaled))
plot(hc.rows)

# transpose the matrix and cluster columns
hc.cols &amp;lt;- hclust(dist(t(mtscaled)))

# draw heatmap for first cluster
heatmap(mtscaled[cutree(hc.rows,k=2)==1,], Colv=as.dendrogram(hc.cols), scale='none')

# draw heatmap for second cluster
heatmap(mtscaled[cutree(hc.rows,k=2)==2,], Colv=as.dendrogram(hc.cols), scale='none')
&lt;/pre&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-sxbeFY-yzCo/TgTJsykT0kI/AAAAAAAAC5o/hZ5zF45pzs4/s1600/heatmap_cluster1.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="327" width="400" src="http://1.bp.blogspot.com/-sxbeFY-yzCo/TgTJsykT0kI/AAAAAAAAC5o/hZ5zF45pzs4/s400/heatmap_cluster1.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-AI2dxe95VHk/TgTJtEkoBgI/AAAAAAAAC5w/XCyBw3qViGA/s1600/heatmap_cluster2.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="327" width="400" src="http://3.bp.blogspot.com/-AI2dxe95VHk/TgTJtEkoBgI/AAAAAAAAC5w/XCyBw3qViGA/s400/heatmap_cluster2.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;That works, but, I'd probably advise creating one heatmap and cutting it up in Illustrator, if need be. I have a nagging feeling that the color scale will end up being slightly different between the two clusters, since the range of values in each submatrix is different. Speaking of colors, if you don't like the default &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/palettes.html"&gt;heat colors&lt;/a&gt;, try creating a new palette with &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/colorRamp.html"&gt;color ramp&lt;/a&gt;.&lt;/p&gt;

&lt;pre class="codebox"&gt;
palette &amp;lt;- colorRampPalette(c(&amp;#x27;#f0f3ff&amp;#x27;,&amp;#x27;#0033BB&amp;#x27;))(256)
heatmap(mtscaled, Colv=F, scale=&amp;#x27;none&amp;#x27;, col=palette)
&lt;/pre&gt;

&lt;h4&gt;Confusing things&lt;/h4&gt;

&lt;p&gt;Another way to separate the clusters is to get the dendrograms out of heatmap and work with those. But &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/cutree.html"&gt;Cutree&lt;/a&gt; applies to objects of class hclust, returned by hclust, and returns a map assigning each row in the original data to a cluster. Cutree takes either a height to cut at (&lt;i&gt;h&lt;/i&gt;) or the desired number of clusters (&lt;i&gt;k&lt;/i&gt;), which is nice.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/cut.html"&gt;Cut&lt;/a&gt; applies to dendrograms, which can be returned by heatmap if the keep.dendro option is set. Cut takes only &lt;i&gt;h&lt;/i&gt;, not &lt;i&gt;k&lt;/i&gt;, and returns a list with members upper and lower. Lower is a list of subtrees below the cut point.&lt;/p&gt;

&lt;p&gt;Doing graphics with R starts easy, but gets arcane quickly. There's also a heatmap.2 function in the &lt;a href="http://cran.r-project.org/web/packages/gplots/gplots.pdf"&gt;gplot&lt;/a&gt; package that adds color keys among other sparsely documented features.&lt;/p&gt;

&lt;p&gt;This all needs some serious straightening out, but the basics are easy enough. Here are a couple more resources to make your heatmaps extra-hot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www2.warwick.ac.uk/fac/sci/moac/students/peter_cock/r/heatmap/"&gt;Using R to draw a Heatmap from Microarray Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Flowing Data's tutorial &lt;a href="http://flowingdata.com/2010/01/21/how-to-make-a-heatmap-a-quick-and-easy-solution/"&gt;How to Make a Heatmap&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...&lt;a href="/p/r.html"&gt;more on R&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-8929554257317411750?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/8929554257317411750/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/06/drawing-heatmaps-in-r.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/8929554257317411750'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/8929554257317411750'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/06/drawing-heatmaps-in-r.html' title='Drawing heatmaps in R'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_dbECP0yvozc/TQV5SBTa2KI/AAAAAAAACxk/KgtUFtpRZ50/s72-c/ferrari-dino-246-gt-5w.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-316238564440272738</id><published>2011-06-07T07:51:00.000-07:00</published><updated>2011-06-07T07:51:07.841-07:00</updated><title type='text'>Ten Design Lessons</title><content type='html'>&lt;ol&gt;
&lt;li&gt;Respect &amp;#8220;the genius of a place.&amp;#8221;&lt;/li&gt;
&lt;li&gt;Subordinate details to the whole.&lt;/li&gt;
&lt;li&gt;The art is to conceal art.&lt;/li&gt;
&lt;li&gt;Aim for the unconscious.&lt;/li&gt;
&lt;li&gt;Avoid fashion for fashion’s sake.&lt;/li&gt;
&lt;li&gt;Formal training isn’t required. 
&lt;li&gt;Words matter.&lt;/li&gt;
&lt;li&gt;Stand for something.&lt;/li&gt;
&lt;li&gt;Utility trumps ornament.&lt;/li&gt;
&lt;li&gt;Never too much, hardly enough.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;i&gt;from&lt;/i&gt; &lt;a href="http://37signals.com/svn/posts/2919-ten-design-lessons-from-frederick-law-olmsted-the-father-of-american-landscape-architecture" target="_blank"&gt;Frederick Law Olmsted&lt;/a&gt;, the father of American landscape architecture via &lt;a href="http://thisisnthappiness.com/post/5780542585/ten-design-lessons"&gt;This isn't happiness&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-316238564440272738?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/316238564440272738/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/06/ten-design-lessons.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/316238564440272738'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/316238564440272738'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/06/ten-design-lessons.html' title='Ten Design Lessons'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-9213682401266899536</id><published>2011-06-06T14:02:00.000-07:00</published><updated>2011-10-08T15:22:09.698-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='biology'/><category scheme='http://www.blogger.com/atom/ns#' term='analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><category scheme='http://www.blogger.com/atom/ns#' term='links'/><category scheme='http://www.blogger.com/atom/ns#' term='Bioinformatics'/><title type='text'>Primers in Computational Biology</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://www.nature.com/nbt/journal/v29/n5/images/homecover.gif" imageanchor="1" style="clear:right; float:right; margin-left:1em; margin-bottom:1em"&gt;&lt;img border="0" height="171" width="130" src="http://www.nature.com/nbt/journal/v29/n5/images/homecover.gif" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="http://www.nature.com/nbt/current_issue/"&gt;Nature Biotechnology&lt;/a&gt; used to regularly feature primers on various topics in computational biology. Here's an incomplete listing based on what looked interesting to me. Some of these are old, but on topics that are fundamental enough not to go out of style. Lot's of these are just mini-tutorials in machine learning.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v22/n7/full/nbt0704-909.html"&gt;What is dynamic programming?&lt;/a&gt; Sean R Eddy&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v22/n9/full/nbt0904-1177.html"&gt;What is Bayesian statistics?&lt;/a&gt; Sean R Eddy&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html"&gt;What is a hidden Markov model?&lt;/a&gt; Sean R Eddy&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v23/n12/full/nbt1205-1499.html"&gt;How does gene expression clustering work?&lt;/a&gt; Patrik D'haeseleer&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v24/n1/full/nbt0106-51.html"&gt;Inference in Bayesian networks&lt;/a&gt; Chris J Needham, James R Bradford, Andrew J Bulpitt &amp;amp; David R Westhead&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v24/n4/full/nbt0406-423.html"&gt;What are DNA sequence motifs?&lt;/a&gt; Patrik D'haeseleer&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v24/n8/full/nbt0806-959.html"&gt;How does DNA sequence motif discovery work?&lt;/a&gt; Patrik D'haeseleer&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v24/n12/full/nbt1206-1565.html"&gt;What is a support vector machine?&lt;/a&gt; William S Noble&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v25/n7/full/nbt0707-755.html"&gt;How do shotgun proteomics algorithms identify proteins?&lt;/a&gt; Edward M Marcotte&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v26/n2/full/nbt1386.html"&gt;What are artificial neural networks?&lt;/a&gt; Anders Krogh&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v26/n3/full/nbt0308-303.html"&gt;What is principal component analysis?&lt;/a&gt; Markus Ringnér&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html"&gt;What is the expectation maximization algorithm?&lt;/a&gt; Chuong B Do &amp;amp; Serafim Batzoglou&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v26/n9/full/nbt0908-1011.html"&gt;What are decision trees?&lt;/a&gt; Carl Kingsford &amp;amp; Steven L Salzberg&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v27/n2/full/nbt0209-153.html"&gt;Understanding genome browsing&lt;/a&gt; Melissa S Cline &amp;amp; W James Kent&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v27/n5/full/nbt0509-455.html"&gt;How to map billions of short reads onto genomes&lt;/a&gt; Cole Trapnell &amp;amp; Steven L Salzberg&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v27/n10/full/nbt.1567.html"&gt;How to visually interpret biological data using networks&lt;/a&gt; Daniele Merico, David Gfeller &amp;amp; Gary D Bader&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v27/n12/full/nbt1209-1135.html"&gt;How does multiple testing correction work?&lt;/a&gt; William S Noble&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v28/n3/full/nbt.1614.html"&gt;What is flux balance analysis?&lt;/a&gt; Jeffrey D Orth, Ines Thiele &amp;amp; Bernhard Ø Palsson&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nbt/journal/v28/n4/full/nbt.1619.html"&gt;Analyzing 'omics data using hierarchical models&lt;/a&gt; Hongkai Ji &amp;amp; X Shirley Liu&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...just in case you're in need of some bed-time reading or some mad comp-bio skillz. Sorry if some of these are behind a pay-wall, but there's usually a way around, under or over such walls.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-9213682401266899536?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/9213682401266899536/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/06/primers-in-computational-biology.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/9213682401266899536'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/9213682401266899536'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/06/primers-in-computational-biology.html' title='Primers in Computational Biology'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-1508986024451151974</id><published>2011-06-04T15:51:00.000-07:00</published><updated>2011-06-04T20:38:42.758-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Programming languages'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Environments in R</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.r-project.org/"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 100px; height: 76px;" src="http://www.r-project.org/Rlogo.jpg" border="0" alt="The R Project" /&gt;&lt;/a&gt;

&lt;p&gt;One interesting thing about R is that you can get down into the insides fairly easily. You're allowed to see more of how things are put together than in most languages. One of the ways R does this is by having &lt;a href="http://funcall.blogspot.com/2009/09/first-class-environments.html"&gt;first-class environments&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;At first glance, &lt;a href="http://cran.r-project.org/doc/manuals/R-lang.html#Environment-objects"&gt;&lt;i&gt;environments&lt;/i&gt;&lt;/a&gt; are simple enough. An environment is just a place to store variables - a set of bindings between symbols and objects. If you start up R and make an assignment, you're adding an entry in the global environment.&lt;/p&gt;
&lt;pre class="codebox"&gt;
&amp;gt; a &amp;lt;- 1234
&amp;gt; e &amp;lt;- globalenv()
&amp;gt; ls()
[1] &amp;quot;a&amp;quot; &amp;quot;e&amp;quot;
&amp;gt; ls(e)
[1] &amp;quot;a&amp;quot; &amp;quot;e&amp;quot;
&amp;gt; e$a
[1] 1234
&amp;gt; class(e)
[1] &amp;quot;environment&amp;quot;
&lt;/pre&gt;

&lt;p&gt;Hmmm, the variable &lt;i&gt;e&lt;/i&gt; is part of the global environment and it refers to the global environment, too, which is kind-of circular.&lt;/p&gt;
&lt;pre class="codebox"&gt;
&amp;gt; ls(e$e$e$e$e$e$e$e)
[1] &amp;quot;a&amp;quot; &amp;quot;e&amp;quot;
&lt;/pre&gt;

&lt;p&gt;We'd better cut that out, before we're sucked into a cosmic vortex.&lt;/p&gt;
&lt;pre class="codebox"&gt;
&amp;gt; rm(e)
&lt;/pre&gt;

&lt;p&gt;Most functional languages have some concept of environments, which serves as a higher level of abstraction over implementation details like allocating variables on the heap or stack. Saying that environments are &lt;i&gt;first-class&lt;/i&gt; means that you can manipulate them from within the language, which is less common. Several advanced language features of R are built out of environments. We'll look at functions, packages and namespaces, and point out several Scheme-like features in R.&lt;/p&gt;

&lt;p&gt;But first, the basics. The &lt;a href="http://cran.r-project.org/doc/manuals/R-lang.html"&gt;R Language Definition&lt;/a&gt; gives this definition:&lt;/p&gt;

&lt;blockquote&gt;&lt;a href="http://cran.r-project.org/doc/manuals/R-lang.html#Environment-objects"&gt;Environments can be thought of as consisting of two things: a frame, which is a set of symbol-value pairs, and an enclosure, a pointer to an enclosing environment. When R looks up the value for a symbol the frame is examined and if a matching symbol is found its value will be returned. If not, the enclosing environment is then accessed and the process repeated. Environments form a tree structure in which the enclosures play the role of parents. The tree of environments is rooted in an empty environment, available through emptyenv(), which has no parent.&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;You can make a new environment with &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/environment.html"&gt;&lt;i&gt;new.env&lt;/i&gt;()&lt;/a&gt; and assign a couple variables. The &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/assign.html"&gt;&lt;i&gt;assign&lt;/i&gt;&lt;/a&gt; function works, as does the odd but convenient dollar sign notation. Think of the dollar sign as equivalent to the 'dot' operator that dereferences object members in Java-ish languages.&lt;/p&gt;
&lt;pre class="codebox"&gt;
&amp;gt; my.env &amp;lt;- new.env()
&amp;gt; my.env
&amp;lt;environment: 0x114a9d940&amp;gt;
&amp;gt; ls(my.env)
character(0)
&amp;gt; assign(&amp;quot;a&amp;quot;, 999, envir=my.env)
&amp;gt; my.env$foo = &amp;quot;This is the variable foo.&amp;quot;
&amp;gt; ls(my.env)
[1] &amp;quot;a&amp;quot;   &amp;quot;foo&amp;quot;
&lt;/pre&gt;

&lt;p&gt;Now we have two variables named &lt;i&gt;a&lt;/i&gt;, one in the global environment, the other in our new environment. Let's stick another variable &lt;i&gt;b&lt;/i&gt; in the global environment, just for kicks.&lt;/p&gt;
&lt;pre class="codebox"&gt;
&amp;gt; a
[1] 1234
&amp;gt; my.env$a
[1] 999
&amp;gt; b &amp;lt;- 4567
&lt;/pre&gt;

&lt;p&gt;Also, note that the parent environment of &lt;i&gt;my.env&lt;/i&gt; is the global environment.&lt;/p&gt;
&lt;pre class="codebox"&gt;
&amp;gt; parent.env(my.env)
&amp;lt;environment: R_GlobalEnv&amp;gt;
&lt;/pre&gt;

&lt;p&gt;A variable can be accessed using &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/get.html"&gt;&lt;i&gt;get&lt;/i&gt;&lt;/a&gt; or the dollar operator. By default, get continues up the chain of parents until it either finds a binding or reaches the empty environment. The dollar operator looks specifically in the given environment.&lt;/p&gt;
&lt;pre class="codebox"&gt;
&amp;gt; get(&amp;#x27;a&amp;#x27;, envir=my.env)
[1] 999
&amp;gt; get(&amp;#x27;b&amp;#x27;, envir=my.env)
[1] 4567
&amp;gt; my.env$a
[1] 999
&amp;gt; my.env$b
NULL
&lt;/pre&gt;


&lt;h4&gt;Functions and environments&lt;/h4&gt;

&lt;p&gt;Functions have their own environments. This is the key to implementing closures. If you've never heard of a closure, it's just a function packaged up with some state. In fact, some say, &lt;a href="http://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg03277.html"&gt;closures are a poor man's object&lt;/a&gt;, while other insist it's the other way 'round. The &lt;a href="http://cran.r-project.org/doc/manuals/R-lang.html"&gt;R Language Definition&lt;/a&gt; explains the relationship between functions and environments like this:&lt;/p&gt;

&lt;blockquote&gt;&lt;a href="http://cran.r-project.org/doc/manuals/R-lang.html#Function-objects"&gt;Functions (or more precisely, function closures) have three basic components: a formal argument list, a body and an environment. [...] A function's environment is the environment that was active at the time that the function was created. [...] When a function is called, a new environment (called the evaluation environment) is created, whose enclosure is the environment from the function closure. This new environment is initially populated with the unevaluated arguments to the function; as evaluation proceeds, local variables are created within it.&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;When a function is &lt;a href="http://cran.r-project.org/doc/manuals/R-lang.html#Evaluation"&gt;evaluated&lt;/a&gt;, R looks in a series of environments for any variables in &lt;a href="http://cran.r-project.org/doc/manuals/R-intro.html#Scope"&gt;scope&lt;/a&gt;. The evaluation environment is first, then the function's enclosing environment, which will be the global environment for functions defined in the workspace. So, the global variable &lt;i&gt;a&lt;/i&gt;, which had the value 1234 last time we looked, can be referenced inside a function.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; f &amp;lt;- function(x) { x + a }
&amp;gt; environment(f)
&amp;lt;environment: R_GlobalEnv&amp;gt;
&amp;gt; f(4321)
[1] 5555
&lt;/pre&gt;

&lt;p&gt;We can change a function's environment if we want to.&lt;/p&gt;
&lt;pre class="codebox"&gt;
&amp;gt; environment(f) &amp;lt;- my.env
&amp;gt; environment(f)
&amp;lt;environment: 0x114a9d940&amp;gt;
&amp;gt; my.env$a
[1] 999
&amp;gt; f(1)
[1] 1000
&lt;/pre&gt;
  
&lt;p&gt;Suppose we wanted a counter to keep track of progress of some kind. That could be written and applied like so:&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; createCounter &amp;lt;- function(value) { function(i) { value &amp;lt;&amp;lt;- value+i} }
&amp;gt; counter &amp;lt;- createCounter(0)
&amp;gt; counter(1)
&amp;gt; a &amp;lt;- counter(0)
&amp;gt; a
[1] 1
&amp;gt; counter(1)
&amp;gt; counter(1)
&amp;gt; a &amp;lt;- counter(1)
&amp;gt; a
[1] 4
&amp;gt; a &amp;lt;- counter(5)
&amp;gt; a
[1] 9
&lt;/pre&gt;

&lt;p&gt;Notice the special &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/assignOps.html"&gt;&amp;lt;&amp;lt;- assignment operator&lt;/a&gt;. If we had used the normal &amp;lt;- assignment operator, we would have created a new variable '&lt;i&gt;value&lt;/i&gt;' in the evaluation environment of the function masking the &lt;i&gt;value&lt;/i&gt; in the function closure environment. That environment disappears as soon as the function returns, sending our new &lt;i&gt;value&lt;/i&gt; into the ether. What we want to do is change the &lt;i&gt;value&lt;/i&gt; in the function closure environment, so that assignments to &lt;i&gt;value&lt;/i&gt; will be persistent across invocations of our counter. Mutable state is generally not the default in functional languages, so we have to use the special assignment operator.&lt;/p&gt;

&lt;p&gt;Just to look under the covers, where is that mutable state? In the counter function's enclosing environment.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&gt; ls(environment(counter))
[1] "value"
&gt; environment(counter)$value
[1] 9
&lt;/pre&gt;

&lt;p&gt;For those that geek out on this stuff, this is an implementation of Paul Graham's &lt;a href="http://www.paulgraham.com/accgen.html"&gt;Accumulator Generator&lt;/a&gt; from his article &lt;a href="http://www.paulgraham.com/icad.html"&gt;Revenge of the Nerds&lt;/a&gt;, which, years ago, I struggled to &lt;a href="http://www.cbare.org/writing/accumulator/accumulator_generator.html"&gt;implement in Java&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Inspired by Scheme, lexical scoping is R's major point of departure from the S language. Gentleman and Ihaka's papers &lt;a href="http://www.jstor.org/pss/1390807"&gt;R: A Language for Data Analysis and Graphics&lt;/a&gt; (&lt;a href="http://biostat.mc.vanderbilt.edu/twiki/pub/Main/JeffreyHorner/JCGSR.pdf"&gt;pdf&lt;/a&gt;) and &lt;a href="http://www.jstor.org/pss/1390942"&gt;Lexical Scope and Statistical Computing&lt;/a&gt; (&lt;a href="http://www.stat.auckland.ac.nz/~ihaka/downloads/lexical.pdf"&gt;pdf&lt;/a&gt;) describe some of their language design decisions around this point.&lt;/p&gt;

&lt;p&gt;For functions defined in a package, the situation gets a bit more interesting. The various parts of the &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/plot.html"&gt;plot&lt;/a&gt; function are visible below, including a parameter list, (&lt;i&gt;x&lt;/i&gt;, &lt;i&gt;y&lt;/i&gt;, and some &lt;a href="http://cran.r-project.org/doc/manuals/R-intro.html#The-three-dots-argument"&gt;other junk&lt;/a&gt;), a block of code, elided here, and an environment, which is the namespace for the graphics package. Packages and namespaces are our next topic.&lt;/p&gt;
&lt;pre class="codebox"&gt;
&amp;gt; plot
function (x, y, ...) 
{
  ...blah, blah, blah...
}
&amp;lt;environment: namespace:graphics&amp;gt;
&lt;/pre&gt;


&lt;h4&gt;Packages and namespaces&lt;/h4&gt;

&lt;p&gt;Walking up the chain of environments starting with the global environment, we see the packages loaded into R.&lt;/p&gt;
&lt;pre class="codebox"&gt;
&amp;gt; globalenv()
&amp;lt;environment: R_GlobalEnv&amp;gt;
&amp;gt; g &amp;lt;- globalenv()
&amp;gt; while (environmentName(g) != &amp;#x27;R_EmptyEnv&amp;#x27;) { g &amp;lt;- parent.env(g); cat(str(g, give.attr=F)) }
&amp;lt;environment: 0x100fdf078&amp;gt;
&amp;lt;environment: package:stats&amp;gt;
&amp;lt;environment: package:graphics&amp;gt;
&amp;lt;environment: package:grDevices&amp;gt;
&amp;lt;environment: package:utils&amp;gt;
&amp;lt;environment: package:datasets&amp;gt;
&amp;lt;environment: package:methods&amp;gt;
&amp;lt;environment: 0x101a19f58&amp;gt;
&amp;lt;environment: base&amp;gt;
&amp;lt;environment: R_EmptyEnv&amp;gt;
&lt;/pre&gt;

&lt;p&gt;Oddly, you can't test environments for equality. If you try, it says, "comparison (1) is possible only for atomic and list types". That's why we test for the end of the chain by name.&lt;/p&gt;

&lt;p&gt;This same information can be had in slightly nicer form using &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/search.html"&gt;&lt;i&gt;search&lt;/i&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;pre class="codebox"&gt;
&amp;gt; search()
 [1] &amp;quot;.GlobalEnv&amp;quot;        &amp;quot;tools:RGUI&amp;quot;        &amp;quot;package:stats&amp;quot;     &amp;quot;package:graphics&amp;quot; 
 [5] &amp;quot;package:grDevices&amp;quot; &amp;quot;package:utils&amp;quot;     &amp;quot;package:datasets&amp;quot;  &amp;quot;package:methods&amp;quot;  
 [9] &amp;quot;Autoloads&amp;quot;         &amp;quot;package:base&amp;quot;
&lt;/pre&gt;

&lt;p&gt;By now, you can guess how &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/attach.html"&gt;&lt;i&gt;attach&lt;/i&gt;&lt;/a&gt; works. It creates an environment and slots it into the list right after the global environment, then populates it with the objects we're attaching.&lt;/p&gt;
&lt;pre class="codebox"&gt;
beatles &amp;lt;- list(&amp;#x27;george&amp;#x27;=&amp;#x27;guitar&amp;#x27;,&amp;#x27;ringo&amp;#x27;=&amp;#x27;drums&amp;#x27;,&amp;#x27;paul&amp;#x27;=&amp;#x27;bass guitar&amp;#x27;,&amp;#x27;john&amp;#x27;=&amp;#x27;guitar&amp;#x27;)
&amp;gt; attach(beatles)
&amp;gt; search()
 [1] &amp;quot;.GlobalEnv&amp;quot;        &amp;quot;beatles&amp;quot;           &amp;quot;tools:RGUI&amp;quot;        &amp;quot;package:stats&amp;quot;    
 [5] &amp;quot;package:graphics&amp;quot;  &amp;quot;package:grDevices&amp;quot; &amp;quot;package:utils&amp;quot;     &amp;quot;package:datasets&amp;quot; 
 [9] &amp;quot;package:methods&amp;quot;   &amp;quot;Autoloads&amp;quot;         &amp;quot;package:base&amp;quot;     
&amp;gt; john
[1] &amp;quot;guitar&amp;quot;
&amp;gt; paul
[1] &amp;quot;bass guitar&amp;quot;
&amp;gt; george
[1] &amp;quot;guitar&amp;quot;
&amp;gt; ringo
[1] &amp;quot;drums&amp;quot;
&lt;/pre&gt;

&lt;p&gt;Attaching a package using &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/library.html"&gt;library&lt;/a&gt; adds an entry to the chain of environments. A package can optionally have another environment, a &lt;a href="http://cran.r-project.org/doc/manuals/R-intro.html#Namespaces"&gt;namespace&lt;/a&gt;, whose purpose is to prevent naming clashes between packages and hide internal implementation details. &lt;i&gt;R Internals&lt;/i&gt; explains it like this:&lt;/p&gt;

&lt;blockquote&gt;&lt;a href="http://cran.r-project.org/doc/manuals/R-ints.html#Environments-and-variable-lookup"&gt;A package &lt;i&gt;pkg&lt;/i&gt; with a name space defines two environments &lt;i&gt;namespace:pkg&lt;/i&gt; and &lt;i&gt;package:pkg&lt;/i&gt;. It is &lt;i&gt;package:pkg&lt;/i&gt; that can be attached and form part of the search path.&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;When a namespaced package is loaded, a new environment is created and all exported items are copied into it. That's &lt;i&gt;package:pkg&lt;/i&gt; in the example above and is what you see in the search path. The namespace becomes the environment for the functions in that package. The parent environment of the namespace holds all the imports declared by the package. And the parent of that is a special copy of the base environment whose parent is the global environment.&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-4EOIShdwnmY/Teq1azurvfI/AAAAAAAAC5Y/GdCxBkVmmUI/s1600/environments.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="290" width="320" src="http://3.bp.blogspot.com/-4EOIShdwnmY/Teq1azurvfI/AAAAAAAAC5Y/GdCxBkVmmUI/s320/environments.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;We can see what namespaces are loaded using &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/ns-load.html"&gt;loadedNamespaces&lt;/a&gt;.&lt;/p&gt;
&lt;pre class="codebox"&gt;
&gt; loadedNamespaces()
[1] "base"      "graphics"  "grDevices" "methods"   "stats"     "tools"    
[7] "utils"
&lt;/pre&gt;

&lt;p&gt;What if the same name is used in multiple environments? In general, R walks up the chain of environments and uses the first binding for a symbol it finds. R is smart enough to distinguish functions from other types. Here we try to mask the &lt;i&gt;mean&lt;/i&gt; function, but R can still find it, knowing that we're trying to apply a function.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&gt; z = list(mean='fluffernutter')
&gt; attach(z)
&gt; mean
[1] "fluffernutter"
&gt; mean
[1] "fluffernutter"
&gt; mean(c(1,2,3,4))
[1] 2.5
&gt; detach(z)
&lt;/pre&gt;

&lt;p&gt;We can mask a function with another function. Now, the mean of any list of numbers is "flapdoodle".&lt;/p&gt;

&lt;pre class="codebox"&gt;
&gt; z = list(mean=function(x){ return("flapdoodle") })
&gt; attach(z)
The following object(s) are masked from 'package:base':
    mean
&gt; mean(c(4,5,6,7))
[1] "flapdoodle"
&lt;/pre&gt;

&lt;p&gt;The &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/ns-dblcolon.html"&gt;double-colon operator&lt;/a&gt; will let us specify which mean function we want. And, if you like to break the rules, the triple-colon operator lets you reach inside namespaces and touch private non-exported elements.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&gt; base::mean(c(6,7,8,9))
[1] 7.5
&lt;/pre&gt;

&lt;p&gt;So, there you have two fairly advanced language features built on the simple abstraction of environments. Thrown in for free is a nice look at R's functional side.&lt;/p&gt;

&lt;p&gt;Is that everything you wanted to know about environments but were afraid to ask? Be warned that I'm just figuring this stuff out myself. If I've gotten anything bass-ackwards, please let me know. There's more information below, in case you can't get enough.&lt;/p&gt;

&lt;h4&gt;More Information&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;R Language Definition &lt;a href="http://cran.r-project.org/doc/manuals/R-lang.html#Environment-objects"&gt;Environments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/environment.html"&gt;Environment Access&lt;/a&gt; functions&lt;/li&gt;
&lt;li&gt;R Internals on &lt;a href="http://cran.r-project.org/doc/manuals/R-ints.html#Environments-and-variable-lookup"&gt;Environments and variable lookup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=""&gt;Scope&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://cran.r-project.org/doc/manuals/R-lang.html#Evaluation"&gt;Evaluation&lt;/a&gt; of functions&lt;/li&gt;
&lt;li&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/ns-internal.html"&gt;Name Space Internals&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/ns-load.html"&gt;Loading and Unloading Name Spaces&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/ns-dblcolon.html"&gt;Double Colon and Triple Colon Operators&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://cran.r-project.org/doc/manuals/R-exts.html#Package-name-spaces"&gt;Package name spaces&lt;/a&gt; from &lt;i&gt;Writing R Extensions&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.cs.uiowa.edu/~luke/R/namespaces/morenames.pdf"&gt;A Simple Implementation of Name Spaces for R&lt;/a&gt;, Luke Tierney, 2003&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.scidav.org/techno/r_environments"&gt;R environments and formula objects&lt;/a&gt; teaches you about "reaching back up the call stack with your zombie programmer hand to eat the brains of the code that called you". Who could resist that?&lt;/li&gt;
&lt;li&gt;For an education on functional programming and closures, see &lt;a href="http://mitpress.mit.edu/sicp/"&gt;SICP&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-1508986024451151974?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/1508986024451151974/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/06/environments-in-r.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1508986024451151974'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1508986024451151974'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/06/environments-in-r.html' title='Environments in R'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-4EOIShdwnmY/Teq1azurvfI/AAAAAAAAC5Y/GdCxBkVmmUI/s72-c/environments.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-4372966665570908563</id><published>2011-05-14T19:38:00.000-07:00</published><updated>2011-11-25T17:24:43.905-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rant'/><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><title type='text'>HTC Incredible internal memory</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-wDj1uw6RnAY/Tc88pdQ-dPI/AAAAAAAAC5A/5HniloJm90Q/s1600/DROID%2BINCREDIBLE%2Bby%2BHTC.png" imageanchor="1" style="clear:right; float:right; margin-left:1em; margin-bottom:1em"&gt;&lt;img border="0" height="200" width="142" src="http://1.bp.blogspot.com/-wDj1uw6RnAY/Tc88pdQ-dPI/AAAAAAAAC5A/5HniloJm90Q/s200/DROID%2BINCREDIBLE%2Bby%2BHTC.png" /&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;My phone, an HTC Droid Incredible, may be hopelessly antiquated by the standards of true mobile hipsters. Still, it came with a generous 8GB internal storage. Too bad the SD card is a chintzy 2GB. These days, you get more than 2 gigs on an &lt;i&gt;abacus&lt;/i&gt;. It seems like Android wants to use internal storage for the OS and apps, reserving the SD card for media, which makes that 8GB/2GB split a little awkward. I filled that 2GB right up with choice sides of Miles and 'Trane in no time. And, what do I need with 8 gigs worth of apps? What am I running, &lt;i&gt;Bloatpad 2.0&lt;/i&gt;? So, anyway, I decided I wanted to use the empty 6 plus gigs on the internal storage for some Thelonious. So, can I do that?&lt;/p&gt;

&lt;p&gt;Well, the &lt;a href="http://www.htc.com/us/support/droid-incredible-verizon/help/multimedia"&gt;Help/How to&lt;/a&gt; thing at HTC says, "...Music only plays audio files saved on the storage card...". Well, I use another media player anyway - Meridian. Then there's an article called &lt;a href="http://stackoverflow.com/questions/2673323/programmitically-accessing-internal-storage-not-sd-card-on-verizon-htc-droid-in"&gt;&lt;i&gt;Programmitically&lt;/i&gt; accessing internal storage (not SD card) on Verizon HTC Droid Incredible (Android)&lt;/a&gt;, which says, "To be quite honest, the internal storage is a joke. Just think of it as a flash drive..."&lt;/p&gt;

&lt;p&gt;But, it turns out, you &lt;i&gt;can&lt;/i&gt; access music and other media on the internal storage. You just have to know that the internal storage is mounted as &lt;b&gt;&lt;i&gt;/emmc&lt;/i&gt;&lt;/b&gt;. Maybe they shoulda called it /WTF.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-4372966665570908563?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/4372966665570908563/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/05/htc-incredible-useless-internal-memory.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4372966665570908563'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4372966665570908563'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/05/htc-incredible-useless-internal-memory.html' title='HTC Incredible internal memory'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-wDj1uw6RnAY/Tc88pdQ-dPI/AAAAAAAAC5A/5HniloJm90Q/s72-c/DROID%2BINCREDIBLE%2Bby%2BHTC.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-5636870903256848268</id><published>2011-05-07T22:05:00.000-07:00</published><updated>2011-06-01T10:29:09.437-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='NoSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='db'/><title type='text'>Rails 3 MongoDB recipe</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://media.mongodb.org/logo-mongodb.png"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 217px; height: 90px;" src="http://media.mongodb.org/logo-mongodb.png" border="0" alt="" /&gt;&lt;/a&gt;
&lt;/div&gt;

&lt;p&gt;Here's a quick and dirty recipe for &lt;a href="http://www.mongodb.org/display/DOCS/Rails+3+-+Getting+Started"&gt;getting started&lt;/a&gt; with &lt;a href="http://www.mongodb.org/"&gt;MongoDB&lt;/a&gt; and Rails 3 using &lt;a href="http://mongomapper.com/"&gt;mongo_mapper&lt;/a&gt;. We'll get set up and test out a Restful web service.&lt;/p&gt;

&lt;p&gt;I installed MongoDB with &lt;a href="http://www.macports.org/"&gt;MacPorts&lt;/a&gt;.&lt;/p&gt;
&lt;pre class="codebox"&gt;
sudo port selfupdate
sudo port install mongodb
&lt;/pre&gt;

&lt;p&gt;Start a rails project. You may want to leave out ActiveRecord using the switch --skip-active-record as suggested by the &lt;a href="http://www.mongodb.org/display/DOCS/Rails+3+-+Getting+Started"&gt;Getting Started&lt;/a&gt; guide. That's optional. You might want to leave it in, if you intend to use a relational DB alongside Mongo.&lt;/p&gt;
&lt;pre class="codebox"&gt;
  rails new &amp;lt;myproject&amp;gt;
&lt;/pre&gt;

&lt;p&gt;Add to Gemfile&lt;/p&gt;
&lt;pre class="codebox"&gt;
source 'http://gemcutter.org'
&lt;/pre&gt;
&lt;pre class="codebox"&gt;
gem 'mongrel'
gem 'mongo_mapper'
gem 'bson'
gem 'bson_ext'
gem 'SystemTimer'
gem 'rails3-generators'
gem 'json'
&lt;/pre&gt;

&lt;p&gt;&lt;a href="http://gembundler.com/"&gt;Bundler&lt;/a&gt; happily installs all this stuff.&lt;/p&gt;
&lt;pre class="codebox"&gt;
bundle install
&lt;/pre&gt;

&lt;p&gt;Create config/initializers/mongodb.rb&lt;/p&gt;
&lt;pre class="codebox"&gt;
MongoMapper.connection = Mongo::Connection.new('localhost', 27017)
MongoMapper.database = "#myapp-#{Rails.env}"

if defined?(PhusionPassenger)
   PhusionPassenger.on_event(:starting_worker_process) do |forked|
     MongoMapper.connection.connect if forked
   end
end
&lt;/pre&gt;

&lt;p&gt;Start the MongoDB server, giving it a place to put data files:&lt;/p&gt;
&lt;pre class="codebox"&gt;
mkdir &amp;lt;path&amp;gt;/data
mongod --dbpath &amp;lt;path&amp;gt;/data
&lt;/pre&gt;

&lt;p&gt;Create an entity to be stored in MongoDB. I like to use a database of &lt;i&gt;Nerds&lt;/i&gt; as my test application.&lt;/p&gt;
&lt;pre class="codebox"&gt;
rails generate scaffold Nerd name:string description:string iq:integer --orm=mongo_mapper 
&lt;/pre&gt;

&lt;p&gt;I originally gave description type "text", but that didn't work for me, producing, "NameError in NerdsController#show, uninitialized constant Nerd::Text". So, I edited the model file to look like this:&lt;/p&gt;

&lt;pre class="codebox"&gt;
class Nerd
  include MongoMapper::Document         
  key :name, String
  key :description, String
  key :iq, Integer
end
&lt;/pre&gt;

&lt;h3&gt;Getting mongodb objects as XML doesn't work?&lt;/h3&gt;
&lt;p&gt;I saw an error: &lt;i&gt;undefined method `each'&lt;/i&gt; with mongo_mapper (0.8.6), which was fixed by upgrading to (0.9.0)&lt;/p&gt;

&lt;pre class="codebox"&gt;
Started GET &amp;quot;/nerds/4dc5c41a1ff2367744000004.xml&amp;quot; for 127.0.0.1 at Sat May 07 15:41:54 -0700 2011
  Processing by NerdsController#show as XML
  Parameters: {&amp;quot;id&amp;quot;=&amp;gt;&amp;quot;4dc5c41a1ff2367744000004&amp;quot;}
Completed 200 OK in 9ms (Views: 2.6ms | ActiveRecord: 0.0ms)
Sat May 07 15:41:54 -0700 2011: Read error: #&amp;lt;NoMethodError: undefined method `each&amp;#x27; for #&amp;lt;Nerd:0x1040bfb80&amp;gt;&amp;gt;
&lt;/pre&gt;

&lt;h3&gt;JSON web services in Rails&lt;/h3&gt;
&lt;p&gt;You'll likely want to serve up and accept JSON in HTTP requests. For some crazy reason, Rails doesn't generate code for JSON in the controller, only HTML and XML. You have to add a line to the respond_to code block, like this:&lt;/p&gt;

&lt;pre class="codebox"&gt;
class NerdsController &amp;lt; ApplicationController
  # GET /nerds
  # GET /nerds.xml
  def index
    @nerds = Nerd.all

    respond_to do |format|
      format.html # index.html.erb
      format.xml  { render :xml =&amp;gt; @nerds }
      &lt;span style="color:red;"&gt;format.json { render :json =&amp;gt; @nerds }&lt;/span&gt;
    end
  end

  # GET /nerds/1
  # GET /nerds/1.xml
  def show
    @nerd = Nerd.find(params[:id])

    respond_to do |format|
      format.html # show.html.erb
      format.xml  { render :xml =&amp;gt; @nerd }
      &lt;span style="color:red;"&gt;format.json { render :json =&amp;gt; @nerds }&lt;/span&gt;
    end
  end
  ...
end
&lt;/pre&gt;

&lt;p&gt;Use &lt;a href="http://curl.haxx.se/"&gt;curl&lt;/a&gt; to test that this works. First, get HTML, then ask for JSON using the &lt;i&gt;accept&lt;/i&gt; header.&lt;/p&gt;

&lt;pre class="codebox"&gt;
curl --request GET -H "accept: application/json" http://localhost:3000/nerds/4dc5c41a1ff2367744000004
&lt;/pre&gt;

&lt;p&gt;OK, now we can serve JSON, how about receiving it in POST requests? Another couple quick additions and we're in business.&lt;/p&gt;

&lt;pre class="codebox"&gt;
# POST /nerds
# POST /nerds.xml
def create
  @nerd = Nerd.new(params[:nerd])

  respond_to do |format|
    if @nerd.save
      format.html { redirect_to(@nerd, :notice =&gt; 'Nerd was successfully created.') }
      format.xml  { render :xml =&gt; @nerd, :status =&gt; :created, :location =&gt; @nerd }
      &lt;span style="color:red"&gt;format.json  { render :json =&gt; @nerd, :status =&gt; :created, :location =&gt; @nerd }&lt;/span&gt;
    else
      format.html { render :action =&gt; "new" }
      format.xml  { render :xml =&gt; @nerd.errors, :status =&gt; :unprocessable_entity }
      &lt;span style="color:red"&gt;format.json  { render :json =&gt; @nerd.errors, :status =&gt; :unprocessable_entity }&lt;/span&gt;
    end
  end
end
&lt;/pre&gt;

&lt;pre class="codebox"&gt;
curl --request POST -H "Content-Type: application/json" -H "Accept: application/json" --data '{"nerd":{"name":"Donald Knuth", "iq":199, "description":"Writes hard books."}}' http://localhost:3000/nerds
&lt;/pre&gt;

&lt;p&gt;If you have an entity named "nerd" it's pretty reasonable to expect the XML representation to look something like &lt;i&gt;&amp;lt;nerd&amp;gt;...&amp;lt;/nerd&amp;gt;&lt;/i&gt;. By analogy to that, they expect your JSON to look like this &lt;i&gt;{"nerd":{...}}&lt;/i&gt;, which I'm not sure I like. I coded around the issue, which is might be unwise, by making the controller's create method look like this:&lt;/p&gt;

&lt;pre class="codebox"&gt;
# POST /nerds
# POST /nerds.xml
def create

  input = params[:nerd] || request.body.read
  if request.content_type == 'application/json'
    @nerd = Nerd.new(JSON.parse(input))
  else
    @nerd = Nerd.new(input)
  end

  respond_to do |format|
    if @nerd.save
      format.html { redirect_to(@nerd, :notice =&gt; 'Nerd was successfully created.') }
      format.xml  { render :xml =&gt; @nerd, :status =&gt; :created, :location =&gt; @nerd }
      format.json  { render :json =&gt; @nerd, :status =&gt; :created, :location =&gt; @nerd }
    else
      format.html { render :action =&gt; "new" }
      format.xml  { render :xml =&gt; @nerd.errors, :status =&gt; :unprocessable_entity }
      format.json  { render :json =&gt; @nerd.errors, :status =&gt; :unprocessable_entity }
    end
  end
end
&lt;/pre&gt;

&lt;p&gt;...which you can test with curl like so:&lt;/p&gt;

&lt;pre class="codebox"&gt;
curl --request POST -H "content-type: application/json" --data '{"name":"Bozo", "iq":178, "description":"A very smart clown."}' http://localhost:3000/nerds
&lt;/pre&gt;

&lt;p&gt;So, there you have it - Rails 3 and MongoDB playing ReSTfully together.&lt;/p&gt;

&lt;p&gt;&lt;span style="font-weight:bold; color:red;"&gt;Warning&lt;/span&gt;: Current release versions of MongoDB have a &lt;a href="http://stackoverflow.com/questions/4667597/understanding-mongodb-bson-document-size-limit"&gt;4MB maximum document size&lt;/a&gt;. This makes some sense as documents are often serialized and deserialized in memory. Apparently, this is being &lt;a href="https://jira.mongodb.org/browse/SERVER-431"&gt;raised in later versions&lt;/a&gt;. Fully streaming APIs would certainly help, but that brings up the question of how much of the XML dog-pile of technologies will (or should) be replicated in JSON. Guess we'll see.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-5636870903256848268?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/5636870903256848268/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/05/rails-3-mongodb-recipe.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5636870903256848268'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5636870903256848268'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/05/rails-3-mongodb-recipe.html' title='Rails 3 MongoDB recipe'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-6147973667315560133</id><published>2011-04-21T09:27:00.000-07:00</published><updated>2011-04-22T08:23:43.452-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='software engineering'/><category scheme='http://www.blogger.com/atom/ns#' term='Bioinformatics'/><title type='text'>Genome Browser's Anonymous</title><content type='html'>&lt;p&gt;As you may know, I'm starting a support group for those afflicted with the tragedy of having written a &lt;a href="/2008/12/browsing-genomes.html"&gt;genome browser&lt;/a&gt;. Mine is called the &lt;a href="http://gaggle.systemsbiology.net/docs/geese/genomebrowser/"&gt;Gaggle Genome Browser&lt;/a&gt;. About the time I was writing it, everyone and their uncle's dog decided to write a genome browser. New instruments with new data types were coming into the lab. Computers had more memory and CPU cores than ever. It seemed like a good idea at the time.&lt;/p&gt;

&lt;p&gt;The Broad Institute's &lt;a href="http://www.broadinstitute.org/software/igv/"&gt;Integrative genomics viewer&lt;/a&gt; (shown below) got a &lt;a href="http://www.nature.com/nbt/journal/v29/n1/full/nbt.1754.html"&gt;write-up&lt;/a&gt; in the January Nature Biotechnology. IGV seems particularly well developed for next-gen sequencing data, nicely displaying coverage plots and alignments of short reads, with attention to the nuances of paired-ends.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-Ah9NZuGMAP0/Ta93M6s2hYI/AAAAAAAAC4A/fDd_7ROgxFg/s1600/View%2Bof%2Baligned%2Breads%2Bat%2B20-kb%2Bresolution.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 241px;" src="http://3.bp.blogspot.com/-Ah9NZuGMAP0/Ta93M6s2hYI/AAAAAAAAC4A/fDd_7ROgxFg/s400/View%2Bof%2Baligned%2Breads%2Bat%2B20-kb%2Bresolution.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5597823925644330370" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;IGV is a Java desktop app that pulls data down from a server component, the &lt;a href="http://www.broadinstitute.org/software/igv/DataServer"&gt;IGV Data Server&lt;/a&gt;. In my case, I cooked up a two level hierarchy for caching chunks of data in memory backed by SQLite. It's probably smart to add a server as a third level. IGV's multi-resolution data mode precomputes aggregations for zoomed-out views in which data is denser than the pixels in which to display it. IGV splits data into "tiles" stored in a custom indexed binary file format. "Hence a single tile at the lowest resolution, which spans the entire genome, has the same memory footprint as a tile at the very high zoom levels, which might span only a few kilobases." My &lt;a href="/2010/07/gaggle-genome-browser.html"&gt;GGB&lt;/a&gt; aggregates on the fly, which hurts performance in zoomed out views.&lt;/p&gt;

&lt;p&gt;The IGV Data Server seems to derive a lot of it's data from the &lt;a href="http://genome.ucsc.edu"&gt;UCSC Genome Browser&lt;/a&gt;, which maintains nicely curated data mapped to genomic coordinates for a bunch of eukaryotes and also &lt;a href="http://microbes.ucsc.edu"&gt;microbes&lt;/a&gt;. One thing I enjoyed hacking with on GGB was integration with R. I wonder if that would be worthwhile for IGV.&lt;/p&gt;

&lt;p&gt;Which functionality to put on the client vs. which in the server is debatable. We considered building a browser based implementation, &lt;a href="/2010/03/protovis-data-visualization-in-browser.html"&gt;experimenting&lt;/a&gt; a bit with the super-cool &lt;a href="http://vis.stanford.edu/protovis/"&gt;protovis&lt;/a&gt; visualization library. We went with desktop. &lt;a href="http://xmap.picr.man.ac.uk/"&gt;X:map&lt;/a&gt; is a nice counter-point, an interactive web-based genome browser. In their approach, the Google Maps API serves up pre-rendered image tiles, keeping the big data and heavy-weight computing tasks on the server. They also have an R and Java program that lets you plot custom data. &lt;a href="http://jbrowse.org/"&gt;JBrowse&lt;/a&gt;, from CSHL, does the rendering in the browser. Putting a data intensive and graphically interactive app in the browser is still somewhere near the edge of the envelope, but browsers are improving like crazy, as are programming models for this type of development.&lt;/p&gt;

&lt;p&gt;For what it's worth, I like the format of the IGV paper. It concisely covers motivation, what the software does, a few unique features and a couple figures showing example applications, all at a high level overview in just two pages. A supplement contains the technical detail of interest to software developers along with more example applications. I like that better than trying to awkwardly shoehorn biology and software engineering together.&lt;/p&gt;

&lt;p&gt;Anyway... Nice work, IGV team! Let me know if you'd like to join the support group. We're here to help.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-6147973667315560133?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/6147973667315560133/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/04/genome-browsers-anonymous.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/6147973667315560133'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/6147973667315560133'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/04/genome-browsers-anonymous.html' title='Genome Browser&apos;s Anonymous'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-Ah9NZuGMAP0/Ta93M6s2hYI/AAAAAAAAC4A/fDd_7ROgxFg/s72-c/View%2Bof%2Baligned%2Breads%2Bat%2B20-kb%2Bresolution.gif' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-8735240206815310477</id><published>2011-04-17T20:18:00.001-07:00</published><updated>2011-09-28T12:01:45.655-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='biology'/><category scheme='http://www.blogger.com/atom/ns#' term='crackpot theory'/><category scheme='http://www.blogger.com/atom/ns#' term='analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='Bioinformatics'/><title type='text'>You can't optimize what you can't predict</title><content type='html'>&lt;p&gt;In a &lt;a href="http://www.harlan.harris.name/2011/04/on-analytics-and-related-fields/"&gt;post&lt;/a&gt; about the relationship between predictive analytics and operations research, Harlan Harris says, "&lt;a href="http://www.harlan.harris.name/2011/04/on-analytics-and-related-fields/"&gt;You can't optimize what you can't predict.&lt;/a&gt;" Predictive analytics is using statistical and machine-learning tools on large data sets to find complex relationships in the data and predict future trends. Operations research is the process of optimizing supply chains and industrial systems.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-KHvgjKlB_jk/Tautn2c81GI/AAAAAAAAC34/zGYWjSrUUqE/s1600/Repressilator.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 281px;" src="http://1.bp.blogspot.com/-KHvgjKlB_jk/Tautn2c81GI/AAAAAAAAC34/zGYWjSrUUqE/s400/Repressilator.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5596757862081418338" /&gt;&lt;/a&gt;
&lt;p style="font-size:30%"&gt;A synthetic oscillatory network of transcriptional regulators, Elowitz and Leibler, Nature, 1999&lt;/p&gt;

&lt;p&gt;It's interesting because the same relationship exists between systems biology and synthetic biology. (At least &lt;a href="http://arstechnica.com/science/news/2010/08/systems-and-synthetic-biology-neither-models-nor-miracles.ars"&gt;we hope it does&lt;/a&gt;.) That is, understanding, modeling and predicting a system will eventually let you bend it towards your own ends. Same techniques, different domains. Systems biology is essentially predictive analytics on biological data. It hopes to build models and discover principles that will guide &lt;a href="http://www.nature.com/scientificamerican/journal/v294/n6/full/scientificamerican0606-44.html"&gt;synthetic biology&lt;/a&gt;, which re-engineers biological systems toward novel and useful functions - everything from cleaning up toxic waste to &lt;a href="http://www.amyrisbiotech.com/"&gt;producing energy&lt;/a&gt;. And the process of building entirely new biological processes inevitably feeds back into better understanding of natural biological systems.&lt;/p&gt;

&lt;p&gt;It would be a great validation of systems biology methods to do a blind analysis of a synthetic biological circuit. Even better would be to predict the behavior of a synthetic system, then build it and see how well we did. If we do that enough times, we can't help but improve our ability to predict and optimize biological systems.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nature/journal/v403/n6767/full/403335a0.html"&gt;A synthetic oscillatory network of transcriptional regulators&lt;/a&gt; Elowitz and Leibler, Nature, 1999&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/scientificamerican/journal/v294/n6/full/scientificamerican0606-44.html"&gt;Engineering Life: Building a FAB for Biology (&lt;a href="http://arep.med.harvard.edu/pdf/BioFab06.pdf"&gt;pdf&lt;/a&gt;)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://syntheticbiology.org/"&gt;syntheticbiology.org&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nature/journal/v438/n7067/full/nature04342.html"&gt;Foundations for engineering biology&lt;/a&gt; Drew Endy, Nature 2005&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/msb/journal/v1/n1/full/msb4100007.html"&gt;From systems biology to synthetic biology&lt;/a&gt;, George M Church, Nature 2005&lt;/li&gt;
&lt;li&gt;&lt;a href="http://dx.doi.org/10.1016/j.copbio.2006.08.001"&gt;Systems biology as a foundation for genome-scale synthetic biology&lt;/a&gt;, Current Opinion in Biotechnology, 2006&lt;/li&gt;
&lt;li&gt;&lt;a href="http://arstechnica.com/science/news/2010/08/systems-and-synthetic-biology-neither-models-nor-miracles.ars"&gt;Neither models nor miracles: a look at synthetic biology&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nature.com/nrmicro/journal/v7/n4/abs/nrmicro2107.html"&gt;The role of predictive modelling in rationally re-engineering biological systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.youtube.com/watch?v=PO1vnspIX4c"&gt;Engineering Gene Networks: Integrating Synthetic Biology &amp;amp; Systems Biology&lt;/a&gt;, a video of James Collins speaking at the NIH&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-8735240206815310477?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/8735240206815310477/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/04/you-cant-optimize-what-you-cant-predict.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/8735240206815310477'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/8735240206815310477'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/04/you-cant-optimize-what-you-cant-predict.html' title='You can&apos;t optimize what you can&apos;t predict'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-KHvgjKlB_jk/Tautn2c81GI/AAAAAAAAC34/zGYWjSrUUqE/s72-c/Repressilator.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-971961891346881533</id><published>2011-04-10T17:21:00.000-07:00</published><updated>2011-05-07T22:07:59.528-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='NoSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='db'/><title type='text'>Thinking about CRUD has damaged your karma</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://media.mongodb.org/logo-mongodb.png"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 217px; height: 90px;" src="http://media.mongodb.org/logo-mongodb.png" border="0" alt="" /&gt;&lt;/a&gt;


&lt;p&gt;I'm spending some time trying out &lt;a href="http://www.mongodb.org/"&gt;MongoDB&lt;/a&gt;. Mongo is a &lt;a href="http://nosql-database.org/"&gt;NoSQL&lt;/a&gt; database that stores documents in a binary variant of JSON called &lt;a href="http://bsonspec.org/"&gt;BSON&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Mongo is often &lt;a href="http://nosql.mypopescu.com/post/298557551/couchdb-vs-mongodb"&gt;compared&lt;/a&gt; to CouchDB. But, aside from the fact that they both store JSON documents, their approaches are quite different.&lt;/p&gt;

&lt;p&gt;Mongo stores documents in collections, which are vaguely like tables in a SQL database. Mongo queries are partial JSON documents matched against existing documents in the database. Couch builds views with map-reduce and is, in general, a little more conceptually heavy while Mongo is more straight forward with direct analogs to most SQL features.&lt;/p&gt;

&lt;p&gt;One feature I like is the mongo console. It's a full javascript interpreter, which means you can easily script bulk updates and maintenance tasks.&lt;/p&gt;

&lt;p&gt;The MongoDB site has a &lt;a href="http://www.mongodb.org/display/DOCS/Quickstart"&gt;quickstart guide&lt;/a&gt; for popular OS's and a &lt;a href="http://www.mongodb.org/display/DOCS/Tutorial"&gt;tutorial&lt;/a&gt; that will get you started using the console, as well as specific &lt;a href="http://www.mongodb.org/display/DOCS/Drivers"&gt;guides&lt;/a&gt; for loads of client languages.&lt;/p&gt;

&lt;h4&gt;You have not yet reached enlightenment...&lt;/h4&gt;
&lt;p&gt;From within Ruby, we can communicate with MongoDB with the &lt;a href="http://api.mongodb.org/ruby/current/file.TUTORIAL.html"&gt;mongo, bson and bson_ext gems&lt;/a&gt;. One fun way to learn about using MongoDB from Ruby is through the nicely kooky &lt;a href="https://github.com/chicagoruby/MongoDB_Koans"&gt;MongoDB_Koans&lt;/a&gt;, a series of unit tests with small omissions or bugs. You fix the bugs and make the tests pass, while the test harness gently urges "Please meditate on the following code...".&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-971961891346881533?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/971961891346881533/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/04/thinking-about-crud-has-damaged-your.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/971961891346881533'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/971961891346881533'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/04/thinking-about-crud-has-damaged-your.html' title='Thinking about CRUD has damaged your karma'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-1813674738047258067</id><published>2011-04-07T15:16:00.000-07:00</published><updated>2011-12-10T12:02:11.129-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='bug'/><title type='text'>Installing the Ruby mysql gem</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.mysql.com/common/logos/logo_mysql_sun_a.gif"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 114px; height: 68px;" src="http://www.mysql.com/common/logos/logo_mysql_sun_a.gif" border="0" alt="" /&gt;&lt;/a&gt;

&lt;p&gt;I've had issues installing the ruby mysql gem a couple of times, so I thought I'd document what finally worked here. I had Rails 2.3.8 running on the pre-installed Ruby 1.8.7 that comes with OS X 10.6. I needed to install Rails 3.x for another project. Having multiple versions of Rails is supposed to work OK, so I just did:&lt;/p&gt;

&lt;pre class="codebox"&gt;
sudo gem install rails
&lt;/pre&gt;

&lt;p&gt;Problems began here, but I flailed rather than keeping careful notes. At some point, I ended up thinking an fresh install of MySQL might help, so I installed version 5.5.9 from the DMG on &lt;a href="http://dev.mysql.com/downloads/mysql/"&gt;dev.mysql.com&lt;/a&gt;. Maybe, I shoulda used MacPorts, cause that made things worse. After that, neither version of rails could connect to MySQL and my cubical-neighbors had new respect for my colorful vocabulary. Rails would fail, croaking up this cryptic message:&lt;/p&gt;

&lt;pre class="codebox"&gt;
uninitialized constant MysqlCompat::MysqlRes
&lt;/pre&gt;

&lt;p&gt;&lt;a href="http://stackoverflow.com/questions/1332207/uninitialized-constant-mysqlcompatmysqlres-using-mms2r-gem"&gt;This thread&lt;/a&gt; helped. You need to be careful that MySQL, Ruby and MySQL/Ruby are compiled to a common architecture. You can do this by specifying ARCHFLAGS in the environment.&lt;/p&gt;

&lt;pre class="codebox"&gt;
sudo env ARCHFLAGS="-arch x86_64" gem install --no-rdoc --no-ri mysql -- --with-mysql-config=/usr/local/mysql/bin/mysql_config
&lt;/pre&gt;

&lt;p&gt;Compiling the gem properly is one step. The gem depends at run time on the mysql client library, which it needs to be able to find. Specify that, like so:&lt;/p&gt;

&lt;pre class="codebox"&gt;
export DYLD_LIBRARY_PATH="/usr/local/mysql/lib:$DYLD_LIBRARY_PATH"
&lt;/pre&gt;

&lt;p&gt;With the combination of these two clues, both versions of Rails seem to work happily.&lt;/p&gt;

&lt;h4&gt;Diagnosing problems with Ruby and Gems&lt;/h4&gt;

&lt;p&gt;A couple key commands for debugging &lt;a href="http://rubygems.org/"&gt;RubyGems&lt;/a&gt; are &lt;i&gt;gem list&lt;/i&gt; and &lt;i&gt;gem env&lt;/i&gt;. Also, &lt;i&gt;gem uninstall&lt;/i&gt; for removing the wreckage of failed attempts. Some recommend RVM, which I may try out some time. Also, I use the deprecated practice of installing gems with sudo. I guess I should learn to install them in my user directory.&lt;/p&gt;

&lt;p&gt;The Ruby that comes with OS X 10.6, at least on my machine, looks like this:&lt;/p&gt;

&lt;pre class="codebox"&gt;
$ /usr/bin/ruby --version
ruby 1.8.7 (2009-06-12 patchlevel 174) [universal-darwin10.0]

$ file /usr/bin/ruby
/usr/bin/ruby: Mach-O universal binary with 3 architectures
/usr/bin/ruby (for architecture x86_64): Mach-O 64-bit executable x86_64
/usr/bin/ruby (for architecture i386): Mach-O executable i386
/usr/bin/ruby (for architecture ppc7400): Mach-O executable ppc
&lt;/pre&gt;

&lt;p&gt;Installing a fresh Ruby from MacPorts probably wasn't necessary, but that didn't stop me:&lt;/p&gt;

&lt;pre class="codebox"&gt;
$ which ruby
/opt/local/bin/ruby

$ /opt/local/bin/ruby --version
ruby 1.8.7 (2011-02-18 patchlevel 334) [i686-darwin10]

$ file /opt/local/bin/ruby 
/opt/local/bin/ruby: Mach-O 64-bit executable x86_64
&lt;/pre&gt;

&lt;p&gt;I hope this helps someone. This is a pain, but the same thing in Python is worse.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://geryit.com/blog/2011/01/installing-mysql-with-rails-on-mac-os-x-snow-leopard/"&gt;Ruby on Rails With Apache/MySQL on Mac OS X&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-1813674738047258067?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/1813674738047258067/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/04/installing-ruby-mysql-gem.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1813674738047258067'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1813674738047258067'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/04/installing-ruby-mysql-gem.html' title='Installing the Ruby mysql gem'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-2746347878489598641</id><published>2011-04-04T19:46:00.000-07:00</published><updated>2011-04-04T20:40:13.234-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='crackpot theory'/><title type='text'>Art house video games</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-K_2IvPpU354/TZqOZDtODGI/AAAAAAAAC3w/kaun8krm6-o/s1600/myst.jpg"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 200px; height: 167px;" src="http://3.bp.blogspot.com/-K_2IvPpU354/TZqOZDtODGI/AAAAAAAAC3w/kaun8krm6-o/s200/myst.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5591938448476736610" /&gt;&lt;/a&gt;

&lt;p&gt;Where are the art house video games? I remember reading that the novel was once considered a time waster for idlers well beneath the level of serious art. &lt;a href="http://books.google.com/books?id=0SYVAAAAYAAJ"&gt;What is art&lt;/a&gt;, anyway? TV spent decades in &lt;a href="http://www.theshallowsbook.com/nicholascarr/Nicholas_Carrs_The_Shallows.html"&gt;the shallows&lt;/a&gt; before growing artistic pretensions. These days, you can take a &lt;a href="http://www.slate.com/id/2245788/"&gt;university class about The Wire&lt;/a&gt;. Maybe soon we'll be able to take a class in Halo studies, or the semiotics of Grand Theft Auto?&lt;/p&gt;

&lt;p&gt;Silly, maybe, but games show a lot more potential than, say, Twitter. Unless someone starts tweeting profound insights in haiku. Have you ever seen a Facebook page you'd describe as raw, edgy, or deep?&lt;/p&gt;

&lt;p&gt;Games are, at least, amenable to a &lt;i&gt;Lord of the Rings&lt;/i&gt; style quest where the main point is to explore a rich fantasy world. What games are lacking, so far as I know, is the ability to be transformative. How does the writer develop characters when the protagonist, or protagonists, are real people with a will of their own? To induce a change - growth, learning - in a character outside of the writer's control... that would be the real trick.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-vlQ6MWflmW4/TZqNsVhWmlI/AAAAAAAAC3o/Y1-jahEzbCI/s1600/gta.jpg"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 200px; height: 112px;" src="http://4.bp.blogspot.com/-vlQ6MWflmW4/TZqNsVhWmlI/AAAAAAAAC3o/Y1-jahEzbCI/s200/gta.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5591937680164690514" /&gt;&lt;/a&gt;

&lt;p&gt;But the potential of games is there as well. The visuals and audio are already well developed. Interactivity with the game world and the shared experience of multiplayer games is where the untapped potential lies. The medium may be the message, but to succeed on an artistic level games need a message. They need more to say than, "Let's blow shit up!"&lt;/p&gt;

&lt;p&gt;Know any games that rise to the level of real art? Put your nominations in comments...&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-2746347878489598641?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/2746347878489598641/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/04/art-house-video-games.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2746347878489598641'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2746347878489598641'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/04/art-house-video-games.html' title='Art house video games'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-K_2IvPpU354/TZqOZDtODGI/AAAAAAAAC3w/kaun8krm6-o/s72-c/myst.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-7938198303694654906</id><published>2011-03-21T21:53:00.000-07:00</published><updated>2011-03-24T20:55:54.181-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='UsingR'/><category scheme='http://www.blogger.com/atom/ns#' term='stats'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Using R for Introductory Statistics 6, Simulations</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://wiener.math.csi.cuny.edu/UsingR/"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 54px; height: 80px;" src="http://3.bp.blogspot.com/-ki_slMYNMkk/TYguOdEL7_I/AAAAAAAAC3c/qHy28OqoZvI/s200/UsingR.jpeg" border="0" alt=""id="BLOGGER_PHOTO_ID_5586766163608334322" /&gt;&lt;/a&gt;

&lt;p&gt;R can easily generate random samples from a whole library of probability distributions. We might want to do this to gain insight into the distribution's shape and properties. A tricky aspect of statistics is that results like the central limit theorem come with caveats, such as "...for sufficiently large &lt;i&gt;n&lt;/i&gt;...". Getting a feel for how large is sufficiently large, or better yet, testing it, is the purpose of the simulations in chapter 6 of John Verzani's &lt;a href="http://wiener.math.csi.cuny.edu/UsingR/"&gt;Using R for Introductory Statistics&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The central limit theorem says that the sample mean drawn from any parent population will tend towards being normally distributed as n, the size of the sample, grows larger. The sample size needed so that the normal distribution reasonably approximates the sample mean varies from one type of distribution to another.&lt;/p&gt;

&lt;p&gt;A general form of a test that allows us to eyeball the accuracy of the approximation would look something like the code below, which generates and takes the mean of 300 samples drawn from any of R's random number generation functions (rnorm, runif, rbinom, rexp, rt, rgeom, etc.), scaling the sample means to a mean of zero and standard deviation of one. In a two panel plot the function first plots a histogram and a density curve along with the standard normal curve for comparison, and then does a quantile-quantile plot comparing the sample to a normal distribution.&lt;/p&gt;

&lt;pre class="codebox"&gt;
plot_sample_means &amp;lt;- function(f_sample, n, m=300,title=&amp;quot;Histogram&amp;quot;, ...) {

  # define a vector to hold our sample means
  means &amp;lt;- double(0)

  # generate 300 samples of size n and store their means
  for(i in 1:m) means[i] = mean(f_sample(n,...))

  # scale sample means to plot against standard normal
  scaled_means &amp;lt;- scale(means)

  # set up a two panel plot
  par(mfrow=c(1,2))
  par(mar=c(5,2,5,1)+0.1)

  # plot histogram and density of scaled means
  hist(scaled_means, prob=T, col=&amp;quot;light grey&amp;quot;, border=&amp;quot;grey&amp;quot;, main=NULL, ylim=c(0,0.4))
  lines(density(scaled_means))

  # overlay the standard normal curve in blue for comparison
  curve(dnorm(x,0,1), -3, 3, col=&amp;#x27;blue&amp;#x27;, add=T)

  # adjust margins and draw the quantile-quantile plot
  par(mar=c(5,1,5,2)+0.1)
  qqnorm(means, main=&amp;quot;&amp;quot;)

  # return margins to normal and go back to one panel
  par(mar=c(5,4,4,2)+0.1)
  par(mfrow=c(1,1))

  # add a title
  par(omi=c(0,0,0.75,0))
  title(paste(title, &amp;quot;, n=&amp;quot;, n, sep=&amp;quot;&amp;quot;), outer=T)
  par(omi=c(0,0,0,0))

  # return unscaled means (without printing)
  return(invisible(means))
}
&lt;/pre&gt;

&lt;p&gt;This function shows off some of the goodness that the R programming language adopts from the functional school of programming languages. First, note that it's a function that takes another function as an argument. Like most modern languages, R treats &lt;a href="http://cran.r-project.org/doc/manuals/R-intro.html#Writing-your-own-functions"&gt;functions&lt;/a&gt; as first-class objects. Also note the use of the ellipsis to pass arbitrary arguments onward to the f_sample function. All of the &lt;i&gt;r_&lt;/i&gt; functions take &lt;i&gt;n&lt;/i&gt; as a parameter. But, aside from that, the parameters differ. The ability to handle situations like this makes R's ellipsis a bit more powerful than the similar looking &lt;i&gt;varargs&lt;/i&gt; functions in C-style languages.&lt;/p&gt;

&lt;p&gt;Note what happens for &lt;i&gt;n&lt;/i&gt;=1. In this case, the sample mean is just a sample. So, as &lt;i&gt;n&lt;/i&gt; increases, we can morph any distribution into the normal. For instance, here's a series of plots of sample means drawn from the uniform distribution. We start at &lt;i&gt;n&lt;/i&gt;=1, which looks flat as expected. At &lt;i&gt;n&lt;/i&gt;=2 we already get a pretty good fit to the normal curve, except at the tails. The &lt;i&gt;n&lt;/i&gt;=10 case closely fits to the normal.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; plot_sample_means(runif, n=1, title="Sample means from uniform distribution")
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/--vvZM4lrS-g/TYgtCcZEqdI/AAAAAAAAC2s/q7dH7qabyok/s1600/sample_means_unif_n_1.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 235px;" src="http://4.bp.blogspot.com/--vvZM4lrS-g/TYgtCcZEqdI/AAAAAAAAC2s/q7dH7qabyok/s400/sample_means_unif_n_1.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5586764857757444562" /&gt;&lt;/a&gt;

&lt;pre class="codebox"&gt;
&amp;gt; plot_sample_means(runif, n=2, title="Sample means from uniform distribution")
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-sx0AtvpF-s8/TYgtCXvrwrI/AAAAAAAAC20/eIrPtZeyBlk/s1600/sample_means_unif_n_2.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 235px;" src="http://3.bp.blogspot.com/-sx0AtvpF-s8/TYgtCXvrwrI/AAAAAAAAC20/eIrPtZeyBlk/s400/sample_means_unif_n_2.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5586764856510104242" /&gt;&lt;/a&gt;

&lt;pre class="codebox"&gt;
&amp;gt; plot_sample_means(runif, n=10, title="Sample means from uniform distribution")
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-QloSld7JM80/TYgtC6ylqwI/AAAAAAAAC28/RdaUuS-UPGc/s1600/sample_means_unif_n_10.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 235px;" src="http://3.bp.blogspot.com/-QloSld7JM80/TYgtC6ylqwI/AAAAAAAAC28/RdaUuS-UPGc/s400/sample_means_unif_n_10.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5586764865917528834" /&gt;&lt;/a&gt;

&lt;p&gt;Trying the same trick with other distributions yields different results. The exponential distribution takes a while to loose it's skew. Here are plots for &lt;i&gt;n&lt;/i&gt;=6, 12, and 48.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; plot_sample_means(rexp, n=6, title="Sample means from the exponential distribution", rate=1)
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-W9488hU4HPU/TYgth6WTB5I/AAAAAAAAC3E/qiS-VlBdDb8/s1600/sample_means_exp_n_6.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 235px;" src="http://1.bp.blogspot.com/-W9488hU4HPU/TYgth6WTB5I/AAAAAAAAC3E/qiS-VlBdDb8/s400/sample_means_exp_n_6.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5586765398374811538" /&gt;&lt;/a&gt;

&lt;pre class="codebox"&gt;
&amp;gt; plot_sample_means(rexp, n=12, title="Sample means from the exponential distribution", rate=1)
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-TD67dwLDOOE/TYgtiXeQlgI/AAAAAAAAC3M/wRRYA1zAcmY/s1600/sample_means_exp_n_12.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 235px;" src="http://4.bp.blogspot.com/-TD67dwLDOOE/TYgtiXeQlgI/AAAAAAAAC3M/wRRYA1zAcmY/s400/sample_means_exp_n_12.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5586765406192834050" /&gt;&lt;/a&gt;

&lt;pre class="codebox"&gt;
&amp;gt; plot_sample_means(rexp, n=48, title="Sample means from the exponential distribution", rate=1)
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-Vd68L6ukcs4/TYgtiWwCKMI/AAAAAAAAC3U/8CMYFSuSjzQ/s1600/sample_means_exp_n_48.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 235px;" src="http://2.bp.blogspot.com/-Vd68L6ukcs4/TYgtiWwCKMI/AAAAAAAAC3U/8CMYFSuSjzQ/s400/sample_means_exp_n_48.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5586765405998950594" /&gt;&lt;/a&gt;

&lt;p&gt;I'm guessing that Verzani's purpose with the simulations in this chapter is for the reader to get some intuitive sense of the properties of these distributions and their relationships, as an alternative to delving too deeply into theory. This lets us stick to applying statistical tools, while giving us some handle on how things might go wrong.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-7938198303694654906?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/7938198303694654906/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/03/using-r-for-introductory-statistics-6.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/7938198303694654906'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/7938198303694654906'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/03/using-r-for-introductory-statistics-6.html' title='Using R for Introductory Statistics 6, Simulations'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-ki_slMYNMkk/TYguOdEL7_I/AAAAAAAAC3c/qHy28OqoZvI/s72-c/UsingR.jpeg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-1815722033370694297</id><published>2011-03-18T17:15:00.000-07:00</published><updated>2011-03-19T22:15:49.763-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='db'/><title type='text'>Indexes are good</title><content type='html'>&lt;p&gt;I have a MySQL database, on top of which we're trying to do a little data mining. I'm going to transpose domains, to protect the guilty. So, here's the (transposed) schema:&lt;/p&gt;

&lt;pre class="codebox"&gt;
mysql&gt; describe shoppingbaskets;
+-------------------+---------+
| Field             | Type    |
+-------------------+---------+
| id                | int(11) |
  ...etc...                    
| count_items       | int(11) |
+-------------------+---------+

mysql&gt; describe shoppingbasket_items;
+-------------------+---------+
| Field             | Type    |
+-------------------+---------+
| item_id           | int(11) |
| shoppingbasket_id | int(11) |
+-------------------+---------+

mysql&gt; describe items;
+-------------------+---------+
| Field             | Type    |
+-------------------+---------+
| id                | int(11) |
  ...etc...
+-------------------+---------+
&lt;/pre&gt;

&lt;p&gt;There are 155,588 shopping baskets and 3,153,517 items. Baskets typically hold about 20 items.&lt;/p&gt;

&lt;p&gt;It was kinda pokey to count the number of items in a basket on the fly, so I added the count_items column to shoppingbaskets to cache that information. To populate that column, I cooked up a little query like so:&lt;/p&gt;

&lt;pre class="codebox"&gt;mysql&gt; update shoppingbaskets b set count_items = (select count(*) from shoppingbasket_items bi where bi.shoppingbasket_id=b.id);&lt;/pre&gt;

&lt;p&gt;So, I popped that in and waited... Looking at the query, it seems to me that it would be linear in the number of shopping baskets. But, I guess it's scanning all of the items for each basket.&lt;/p&gt;

&lt;p&gt;Hmmm, how long was this going to take? I tried a few small test cases and came up with this bit of &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; code:&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; qt = read.table(&amp;#x27;query_timing.txt&amp;#x27;, sep=&amp;quot;\t&amp;quot;, quote=&amp;quot;&amp;quot;, header=T)
&amp;gt; qt
    n seconds
1   1    0.57
2   2    6.71
3  10   55.20
4  11   61.76
5  20  116.72
6  50  297.83
7  80  481.17
8 100  606.04
9 118  709.70
&amp;gt; plot(seconds ~ n, data=qt, main=&amp;quot;Time to count items in n baskets&amp;quot;)
&amp;gt; model &amp;lt;- lm(seconds ~ n, data=qt)
&amp;gt; model

Call:
lm(formula = seconds ~ n, data = qt)

Coefficients:
(Intercept)            n  
     -5.331        6.081  

&amp;gt; abline(model, col=&amp;#x27;blue&amp;#x27;, lty=2)
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-qS9jRyClQ20/TYP268KQrKI/AAAAAAAAC2k/Rse6kNsMU1Y/s1600/time_to_count_items_in_n_baskets.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 400px;" src="http://1.bp.blogspot.com/-qS9jRyClQ20/TYP268KQrKI/AAAAAAAAC2k/Rse6kNsMU1Y/s400/time_to_count_items_in_n_baskets.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5585579455311555746" /&gt;&lt;/a&gt;

&lt;p&gt;Nicely linear, OK. But at 6 seconds per basket and 155,588 baskets, we're looking at 10 days. &lt;em&gt;mysqladmin shutdown&lt;/em&gt;! Ok, now let's start up a brain cell or two.&lt;/p&gt;

&lt;h4&gt;Duh... index&lt;/h4&gt;

&lt;pre class="codebox"&gt;
create index idx_by_shoppingbasket_id on shoppingbasket_items (shoppingbasket_id);
&lt;/pre&gt;

&lt;p&gt;Now filling the entire count_items column for all 155,588 rows takes 11.06 seconds. But, now that we can quickly count items on the fly, why bother caching the counts? Column dropped. Problem solved.&lt;/p&gt;

&lt;p&gt;Conclusion: Indexes are good. Note to self: Don't forget to think now and then, you knucklehead. &lt;a href="http://www2.sqlonrails.org/"&gt;This web page describes exactly my situation&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-1815722033370694297?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/1815722033370694297/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/03/indexes-are-good.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1815722033370694297'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1815722033370694297'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/03/indexes-are-good.html' title='Indexes are good'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-qS9jRyClQ20/TYP268KQrKI/AAAAAAAAC2k/Rse6kNsMU1Y/s72-c/time_to_count_items_in_n_baskets.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-9024183693401798665</id><published>2011-03-13T13:00:00.000-07:00</published><updated>2011-03-13T14:59:39.390-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='UsingR'/><category scheme='http://www.blogger.com/atom/ns#' term='stats'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Using R for Introductory Statistics, The Geometric distribution</title><content type='html'>&lt;p&gt;We've already seen two discrete probability distributions, the &lt;b&gt;binomial&lt;/b&gt; and the &lt;b&gt;hypergeometric&lt;/b&gt;. The &lt;a href="http://digitheadslabnotebook.blogspot.com/2011/02/using-r-for-introductory-statistics.html"&gt;binomial distribution&lt;/a&gt; describes the number of successes in a series of independent trials &lt;i&gt;with replacement&lt;/i&gt;. The &lt;a href="http://digitheadslabnotebook.blogspot.com/2011/02/using-r-for-introductory-statistics_21.html"&gt;hypergeometric distribution&lt;/a&gt; describes the number of successes in a series of independent trials &lt;i&gt;without replacement&lt;/i&gt;. Chapter 6 of Using R introduces the &lt;b&gt;geometric distribution&lt;/b&gt; - the time to first success in a series of independent trials.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-pG0S1DAuBYk/TX0wyaKzvFI/AAAAAAAAC2U/uh3da0Dyp5I/s1600/geom_dist.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 400px;" src="http://4.bp.blogspot.com/-pG0S1DAuBYk/TX0wyaKzvFI/AAAAAAAAC2U/uh3da0Dyp5I/s400/geom_dist.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5583672755585530962" /&gt;&lt;/a&gt;

&lt;p&gt;Specifically, the probability the first success occurs after &lt;i&gt;k&lt;/i&gt; failures is:&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-OcmavX_7MX4/TX0qbJNKG6I/AAAAAAAAC18/MF5ejSxN31M/s1600/p_geometric.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 173px; height: 21px;" src="http://2.bp.blogspot.com/-OcmavX_7MX4/TX0qbJNKG6I/AAAAAAAAC18/MF5ejSxN31M/s200/p_geometric.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5583665758825225122" /&gt;&lt;/a&gt;

&lt;p&gt;Note that this formulation is consistent with R's &lt;i&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Geometric.html"&gt;[r|d|p|q]geom&lt;/a&gt;&lt;/i&gt; functions, while the book defines the distribution slightly differently as the probability that the first success occurs on the &lt;i&gt;k&lt;/i&gt;th trial, changing the formula to:&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-wzhivY7qakg/TX0qbeFpjhI/AAAAAAAAC2E/FDrDoADYbzs/s1600/p_geom_2.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 190px; height: 21px;" src="http://2.bp.blogspot.com/-wzhivY7qakg/TX0qbeFpjhI/AAAAAAAAC2E/FDrDoADYbzs/s200/p_geom_2.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5583665764430876178" /&gt;&lt;/a&gt;

&lt;p&gt;We'll use the first formula, so k ∈ 0,1,2,..., where 0 means no failures - success on the first try. The intuition is that the probability of failure is (&lt;i&gt;1-p&lt;/i&gt;), so the probability of &lt;i&gt;k&lt;/i&gt; failure is (&lt;i&gt;1-p&lt;/i&gt;) to the &lt;i&gt;k&lt;/i&gt;th power.&lt;/p&gt;

&lt;p&gt;Let's generate 100 random samplings where the probability of success on any given trial is 1/2, like we were repeatedly flipping a coin and recording how many heads we got before we got a tail.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; sample &amp;lt;- rgeom(100, 1/2)
&amp;gt; summary(sample)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     0.0     0.9     1.0     5.0 
&amp;gt; sd(sample)
[1] 1.184922
&amp;gt; hist(sample, breaks=seq(-0.5,6.5, 1), col=&amp;#x27;light grey&amp;#x27;, border=&amp;#x27;grey&amp;#x27;, xlab=&amp;quot;&amp;quot;)
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-om_4OYPftfM/TX0wyB9aBMI/AAAAAAAAC2M/mu4dyXZNo5Y/s1600/hist_geom.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 400px;" src="http://2.bp.blogspot.com/-om_4OYPftfM/TX0wyB9aBMI/AAAAAAAAC2M/mu4dyXZNo5Y/s400/hist_geom.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5583672749086868674" /&gt;&lt;/a&gt;

&lt;p&gt;As expected, we get success on the first try about half the time, and the frequency drops in half for every increment of k after that.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-MwPOHS-QU4o/TX05VIztCrI/AAAAAAAAC2c/NIEdF1VQEnY/s1600/expected_and_var.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://1.bp.blogspot.com/-MwPOHS-QU4o/TX05VIztCrI/AAAAAAAAC2c/NIEdF1VQEnY/s200/expected_and_var.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5583682148313664178" /&gt;&lt;/a&gt;

&lt;p&gt;The median is 0, because about 1/2 the samples are 0. The mean is, of course, higher because of the one-sidedness of the distribution. The mean of our sample is 0.9, which is not too far from the expected value of 1. Likewise, the standard deviation is not far from the theoretical value of √2 or 1.414214.&lt;/p&gt;

&lt;p&gt;This is part of an ultra-slow-motion reading of John Verzani's &lt;a href="http://wiener.math.csi.cuny.edu/UsingR/"&gt;Using R for Introductory Statistics&lt;/a&gt;. Notes on previous chapters can be found here:&lt;/p&gt;

&lt;p&gt;Chapters 1 and 2&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="/2010/04/using-r-for-introductory-statistics.html"&gt;Univariate data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Chapter 3&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="/2010/05/using-r-for-introductory-statistics-31.html"&gt;Categorical data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/06/using-r-for-introductory-statistics-32.html"&gt;Comparing independent samples&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/08/using-r-for-introductory-statistics-33.html"&gt;Relationships in numeric data, correlation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/08/using-r-for-introductory-statistics.html"&gt;Simple linear regression&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Chapter 4&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="/2010/12/using-r-for-introductory-statistics.html"&gt;Multivariate data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2011/01/using-r-for-introductory-statistics.html"&gt;Model formulae&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Chapter 5&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="/2011/01/using-r-for-introductory-statistics_23.html"&gt;Basic probability&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2011/02/using-r-for-introductory-statistics.html"&gt;Probability distributions&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2011/02/using-r-for-introductory-statistics_21.html"&gt;Hypergeometric distribution&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-9024183693401798665?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/9024183693401798665/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/03/using-r-for-introductory-statistics.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/9024183693401798665'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/9024183693401798665'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/03/using-r-for-introductory-statistics.html' title='Using R for Introductory Statistics, The Geometric distribution'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-pG0S1DAuBYk/TX0wyaKzvFI/AAAAAAAAC2U/uh3da0Dyp5I/s72-c/geom_dist.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-6166941779290360995</id><published>2011-03-01T21:11:00.000-08:00</published><updated>2011-11-09T19:26:35.373-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='links'/><title type='text'>Learning data science skills</title><content type='html'>&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/R8pTOY7zSZI/AAAAAAAAAxU/VuGWdTuknPE/s1600-h/Sclr+1533.jpg"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;" src="http://1.bp.blogspot.com/_dbECP0yvozc/R8pTOY7zSZI/AAAAAAAAAxU/VuGWdTuknPE/s200/Sclr+1533.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5173038628664986002" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;According to &lt;a href="http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286"&gt;Hal Varian&lt;/a&gt; and just about everyone these days, the hot skills to have are some combination of programming, &lt;a href="http://www.nytimes.com/2009/08/06/technology/06stats.html"&gt;statistics&lt;/a&gt;, machine learning, and visualization. Here are a pile of resources that'll help you get some mad data science skills.&lt;/p&gt;

&lt;h4&gt;Programming&lt;/h4&gt;
&lt;p&gt;There seems to be a few main platforms widely used for data intensive programming. &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, is a statistical environment that is to statisticians what MatLab is to engineers. It's a weird beast, but it's open source and very powerful, plus has a great community. Python also makes a strong showing, with the help of &lt;a href="http://numpy.scipy.org/"&gt;NumPy&lt;/a&gt;, &lt;a href="http://scipy.org/"&gt;SciPy&lt;/a&gt; and &lt;a href="http://matplotlib.sourceforge.net/"&gt;matplotlib&lt;/a&gt;. An intriguing new entry is the combination of the Lisp dialect &lt;a href="http://clojure.org/"&gt;Clojure&lt;/a&gt; and &lt;a href="http://incanter.org/"&gt;Incanter&lt;/a&gt;. All these tools mix numerical libraries with functional and scripting programming styles in varying proportions. You'll also want to look into Hadoop, to do your big data analytics map-reduce style in the cloud.&lt;/p&gt;

&lt;h4&gt;Statistics&lt;/h4&gt;
&lt;li&gt;John Verzani's &lt;a href="http://wiener.math.csi.cuny.edu/UsingR/"&gt;Using R for Introductory Statistics&lt;/a&gt;, which I'm &lt;a href="/2011/02/using-r-for-introductory-statistics_21.html"&gt;working my way through&lt;/a&gt;.&lt;/li&gt;

&lt;h4&gt;Machine Learning&lt;/h4&gt;
&lt;li&gt;Toby Segaran's &lt;a href="http://oreilly.com/catalog/9780596529321"&gt;Programming Collective Intelligence&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/"&gt;The Elements of 
Statistical Learning: Data Mining, Inference, and Prediction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Bishop's &lt;a href="http://research.microsoft.com/en-us/um/people/cmbishop/prml/"&gt;Pattern Recognition and Machine Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.cs.cmu.edu/~tom/mlbook.html"&gt;Machine Learning, Tom Mitchell&lt;/a&gt;&lt;/li&gt;

&lt;table&gt;&lt;tr&gt;
&lt;td&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://oreilly.com/catalog/9780596529321"&gt;&lt;img style="margin:10px 0px; border: none; width: 153px; height: 200px;" src="http://2.bp.blogspot.com/-9VxFXqjwDxQ/TW3Xpg3kbsI/AAAAAAAAC1k/N5g-qzFKPI8/s200/Book%2Bcover%2Bof%2BProgramming%2BCollective%2BIntelligence.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5579352621579529922" /&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/"&gt;&lt;img style="margin:10px 0px; border: none; width: 132px; height: 200px;" src="http://3.bp.blogspot.com/-a6mCMhmb-4I/TW3Xpw8ENkI/AAAAAAAAC1s/sNt3_Bw9Lg4/s200/CoverII_small.jpeg" border="0" alt=""id="BLOGGER_PHOTO_ID_5579352625893357122" /&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://research.microsoft.com/en-us/um/people/cmbishop/prml/"&gt;&lt;img style="margin:10px 0px; border: none; width: 148px; height: 200px;" src="http://3.bp.blogspot.com/-4Bmsknd9I2w/TW3Xp0cAA7I/AAAAAAAAC10/2Zd93md_W1o/s200/Pattern%2Brecognition%2Band%2Bmachine%2Blearning%2B%255BBook%255D.jpeg" border="0" alt=""id="BLOGGER_PHOTO_ID_5579352626832606130" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;h4&gt;Visualization&lt;/h4&gt;
&lt;li&gt;Tufte's books, especially &lt;a href="http://www.edwardtufte.com/tufte/books_vdqi"&gt;The Visual Display of Quantitative Information&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.processing.org/learning/"&gt;Processing&lt;/a&gt;, along with Ben Fry's book, &lt;a href="http://oreilly.com/catalog/9780596514556"&gt;Visualizing Data&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://hci.stanford.edu/jheer/"&gt;Jeffrey Heer&lt;/a&gt;'s papers, especially &lt;a href="http://vis.stanford.edu/papers/infovis-design-patterns"&gt;Software Design Patterns for Information Visualization&lt;/a&gt;. Heer is one of the creators of several toolkits: Prefuse, Flare and Protovis.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://fellinlovewithdata.com/guides/7-classic-foundational-vis-papers"&gt;7 Classic Foundational Vis Papers&lt;/a&gt; and &lt;a href="http://visualizeit.wordpress.com/2009/06/05/seminal-information-visualization-papers/"&gt;Seminal information visualization papers&lt;/a&gt;&lt;/li&gt;

&lt;h4&gt;Classes&lt;/h4&gt;

&lt;blockquote&gt;&lt;a href="http://blog.revolutionanalytics.com/2011/02/course-machine-learning-with-r.html"&gt;Starting on March 5 at the Hacker Dojo in Mountain View (CA), Mike Bowles and Patricia Hoffmann will present a course on Machine Learning where R will be the "lingua franca" for looking at homework problems, discussing them and comparing different solution approaches. The class will begin at the level of elementary probability and statistics and from that background survey a broad array of machine learning techniques including: Unsupervised Learning, Clustering Techniques, and Fault Detection.&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="http://blog.revolutionanalytics.com/2011/02/r-cou.html"&gt;R courses from Statistics.com&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;&lt;strong&gt;Feb 11&lt;/strong&gt;:&amp;#0160;&amp;#0160;&lt;a href="http://www.statistics.com/courses/using-r/modelingR" target="_self"&gt;Modeling in R&lt;/a&gt; (Sudha Purohit --&amp;#0160;more details after the jump)&lt;br /&gt;&lt;strong&gt;Mar 4&lt;/strong&gt;:&amp;#0160;&amp;#0160;&lt;a href="http://www.statistics.com/index.php?page=R#page=page-1" target="_self"&gt;Introduction to R - Data Handling&lt;/a&gt;&amp;#0160;(Paul Murrell)&lt;br /&gt;&lt;strong&gt;Apr 15&lt;/strong&gt;:&amp;#0160;&amp;#0160;&lt;a href="http://www.statistics.com/index.php?page=Rprogramming#page=page-1" target="_self"&gt;Programming in R&lt;/a&gt; (Hadley Wickham)&lt;br /&gt;&lt;strong&gt;Apr 29&lt;/strong&gt;:&amp;#0160;&amp;#0160;&lt;a href="http://www.statistics.com/courses/using-r/graphicsR/#page=page-1" target="_self"&gt;Graphics in R&lt;/a&gt; (Paul Murrell)&lt;br /&gt;&lt;strong&gt;May 20&lt;/strong&gt;:&amp;#0160;&amp;#0160;&lt;a href="http://www.statistics.com/index.php?page=Rstatistics#page=page-1" target="_self"&gt;Introduction to R – Statistical Analysis&lt;/a&gt; (John Verzani)&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="http://strataconf.com/strata2011/public/schedule/detail/17164"&gt;Data bootcamp&lt;/a&gt; (&lt;a href="https://github.com/drewconway/strata_bootcamp"&gt;slides and code&lt;/a&gt;) from the &lt;a href="http://www.sauria.com/blog/2011/02/07/strata-2011/"&gt;Strata Conference&lt;/a&gt;. Tutorials covering a handful of example problems using R and python.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Plotting data on maps&lt;/li&gt;
&lt;li&gt;Classifying emails&lt;/li&gt;
&lt;li&gt;A classification problem in image analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="http://www.stat.cmu.edu/~cshalizi/"&gt;Cosma Shalizi&lt;/a&gt; at CMU teaches a class: &lt;a href="http://www.stat.cmu.edu/~cshalizi/402/"&gt;Undergraduate Advanced Data Analysis&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;More resources&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.autonlab.org/tutorials/"&gt;A great list of machine learning tutorials&lt;/a&gt; by Andrew Moore.&lt;/li&gt;

&lt;li&gt;There are so many classes, books and &lt;a href="/2008/03/lecture-videos-online.html"&gt;lecture videos online&lt;/a&gt; these days, you're only limit is the rate at which you can absorb it.&lt;/li&gt;

&lt;li&gt;Hadley Wickham's &lt;a href="https://github.com/hadley/devtools/wiki/data-philosophy"&gt;A philosophy of clean data&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="http://abhishek-tiwari.com/2010/10/why-it-is-best-time-to-be-bioinformatician.html"&gt;Abhishek Tiwari&lt;/a&gt; points us to a Quora thread: &lt;a href="http://www.quora.com/How-do-I-become-a-data-scientist"&gt;How do I become a data scientist?&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Drew Conway's &lt;a href="http://www.drewconway.com/zia/?p=2378"&gt;Data Science Venn Diagram&lt;/a&gt;, which he expands on in &lt;a href="http://www.drewconway.com/zia/wp-content/uploads/2011/04/IQT-Quarterly_Spring-2011_Conway.pdf"&gt;Data science in the US intelligence community&lt;/a&gt;. I like Conway's emphasis on the scientific method and hypothesis testing. Drew is coming out with a book soon, &lt;a href="http://shop.oreilly.com/product/0636920018483.do"&gt;Machine Learning for Hackers&lt;/a&gt;, that sounds promising.&lt;/li&gt;

&lt;li&gt;&lt;a href="http://www.quora.com/Machine-Learning/What-are-some-good-resources-for-learning-about-machine-learning-Why"&gt;Good resources for learning about machine learning&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="http://www.manning.com/pharrington/"&gt;Machine Learning in Action&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-6166941779290360995?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/6166941779290360995/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/03/learning-data-science-skills.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/6166941779290360995'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/6166941779290360995'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/03/learning-data-science-skills.html' title='Learning data science skills'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_dbECP0yvozc/R8pTOY7zSZI/AAAAAAAAAxU/VuGWdTuknPE/s72-c/Sclr+1533.jpg' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-2083754931948115134</id><published>2011-02-21T12:03:00.000-08:00</published><updated>2011-02-22T06:10:48.100-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='UsingR'/><category scheme='http://www.blogger.com/atom/ns#' term='stats'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Using R for Introductory Statistics, Chapter 5, hypergeometric distribution</title><content type='html'>&lt;p&gt;This is a little digression from Chapter 5 of &lt;a href="http://wiener.math.csi.cuny.edu/UsingR/"&gt;Using R for Introductory Statistics&lt;/a&gt; that led me to the hypergeometric distribution.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-y0dtC-jnmWU/TWLGBPzQi1I/AAAAAAAAC1M/uGWUE_pfTfQ/s1600/hypergeometric_dist.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 320px;" src="http://1.bp.blogspot.com/-y0dtC-jnmWU/TWLGBPzQi1I/AAAAAAAAC1M/uGWUE_pfTfQ/s320/hypergeometric_dist.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5576237013361789778" /&gt;&lt;/a&gt;

&lt;p&gt;&lt;b&gt;Question 5.13&lt;/b&gt; A sample of 100 people is drawn from a population of 600,000. If it is known that 40% of the population has a specific attribute, what is the probability that 35 or fewer in the sample have that attribute.&lt;/p&gt;

&lt;p&gt;I'm pretty sure that you're supposed to reason that 600,000 is sufficiently large that the draws from the population are close enough to independent. The answer is then computed like so:&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; pbinom(35,100,0.4)
[1] 0.1794694
&lt;/pre&gt;

&lt;p&gt;Although this is close enough for practical purposes, the real way to answer this question is with the &lt;a href="http://stats.stackexchange.com/questions/7408/sampling-from-a-fixed-population"&gt;hypergeometric distribution&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;&lt;a href="http://en.wikipedia.org/wiki/Hypergeometric_distribution"&gt;The hypergeometric distribution is a discrete probability distribution that describes the number of successes in a sequence of k draws from a finite population &lt;i&gt;without&lt;/i&gt; replacement, just as the binomial distribution describes the number of successes for draws &lt;i&gt;with&lt;/i&gt; replacement.&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;The situation is usually described in terms of balls and urns. There are &lt;i&gt;N&lt;/i&gt; balls in an urn, &lt;i&gt;m&lt;/i&gt; white balls and &lt;i&gt;n&lt;/i&gt; black balls. We draw &lt;i&gt;k&lt;/i&gt; balls without replacement. &lt;i&gt;X&lt;/i&gt; represents the number of white balls drawn.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-sc1ArfNmjec/TWLGA8Zu8WI/AAAAAAAAC1E/VstJX28aUtk/s1600/hypergeometric_equation.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 174px; height: 51px;" src="http://2.bp.blogspot.com/-sc1ArfNmjec/TWLGA8Zu8WI/AAAAAAAAC1E/VstJX28aUtk/s320/hypergeometric_equation.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5576237008154456418" /&gt;&lt;/a&gt;

&lt;p&gt;R gives us the function &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Hypergeometric.html"&gt;phyper&lt;/a&gt;(x, m, n, k, lower.tail = TRUE, log.p = FALSE), which does indeed show that our approximation was close enough.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; phyper(35,240000,360000, 100)
[1] 0.1794489
&lt;/pre&gt;

&lt;p&gt;Since we're down with OCD, let's explore a bit further. First, since our population is defined and not too huge, let's just try it empirically. First, create our population.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; pop &amp;lt;- rep(c(0,1),c(360000, 240000))
&amp;gt; length(pop)
[1] 600000
&amp;gt; mean(pop)
[1] 0.4
&amp;gt; sd(pop)
[1] 0.4898984
&lt;/pre&gt;

&lt;p&gt;Next, generate a boatload of samples and see how many of them have 35 or fewer of the special members.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; sums &amp;lt;- sapply(1:200000, function(x) { sum(sample(pop,100))})
&amp;gt; sum(sums &amp;lt;= 35) / 200000
[1] 0.17935
&lt;/pre&gt;

&lt;p&gt;Pretty close to our computed results. I thought I might be able to compute an answer using the central limit theorem, using the distribution of sample means, which should be approximately normal.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-VhTMXLRkRi4/TWLGAkBAN_I/AAAAAAAAC08/pUWWpU1aeYI/s1600/sampling_distribution.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 317px; height: 38px;" src="http://3.bp.blogspot.com/-VhTMXLRkRi4/TWLGAkBAN_I/AAAAAAAAC08/pUWWpU1aeYI/s320/sampling_distribution.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5576237001608280050" /&gt;&lt;/a&gt;

&lt;pre class="codebox"&gt;
&amp;gt; means &amp;lt;- sapply(1:2000, function(x) { mean(sample(pop,100))})
&amp;gt; mean(means)
[1] 0.40154
&amp;gt; sd(means)
[1] 0.0479998
&amp;gt; curve(dnorm(x, 0.4, sd(pop)/sqrt(100)), 0.2, 0.6, col=&amp;#x27;blue&amp;#x27;)
&amp;gt; lines(density(means), lty=2)
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-Jz9qSHKcSp4/TWLGBQZitgI/AAAAAAAAC1U/82mFyobBeZE/s1600/sample_means_hyper.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 320px;" src="http://4.bp.blogspot.com/-Jz9qSHKcSp4/TWLGBQZitgI/AAAAAAAAC1U/82mFyobBeZE/s320/sample_means_hyper.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5576237013522363906" /&gt;&lt;/a&gt;

&lt;p&gt;Shouldn't I be able to compute how many of my samples will have 35 or fewer special members? This seems to be a ways off, but I don't know why. Maybe it's just the error due to approximating a discreet distribution with a continuous one?&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; pnorm(0.35, 0.4, sd(pop)/sqrt(100))
[1] 0.1537173
&lt;/pre&gt;

&lt;p&gt;This fudge gets us closer, but still not as close as our initial approximation.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; pnorm(0.355, 0.4, sd(pop)/sqrt(100))
[1] 0.1791634
&lt;/pre&gt;

&lt;p&gt;If anyone knows what's up with this, that's what comments are for. Help me out.&lt;/p&gt;

&lt;h4&gt;Notes on Using R for Introductory Statistics&lt;/h4&gt;
&lt;p&gt;Chapters 1 and 2&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="/2010/04/using-r-for-introductory-statistics.html"&gt;Univariate data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Chapter 3&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="/2010/05/using-r-for-introductory-statistics-31.html"&gt;Categorical data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/06/using-r-for-introductory-statistics-32.html"&gt;Comparing independent samples&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/08/using-r-for-introductory-statistics-33.html"&gt;Relationships in numeric data, correlation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/08/using-r-for-introductory-statistics.html"&gt;Simple linear regression&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Chapter 4&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="/2010/12/using-r-for-introductory-statistics.html"&gt;Multivariate data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2011/01/using-r-for-introductory-statistics.html"&gt;Model formulae&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Chapter 5&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="/2011/01/using-r-for-introductory-statistics_23.html"&gt;Basic probability&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2011/02/using-r-for-introductory-statistics.html"&gt;Probability distributions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-2083754931948115134?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/2083754931948115134/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/02/using-r-for-introductory-statistics_21.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2083754931948115134'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2083754931948115134'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/02/using-r-for-introductory-statistics_21.html' title='Using R for Introductory Statistics, Chapter 5, hypergeometric distribution'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-y0dtC-jnmWU/TWLGBPzQi1I/AAAAAAAAC1M/uGWUE_pfTfQ/s72-c/hypergeometric_dist.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-4246038148906898424</id><published>2011-02-13T17:26:00.000-08:00</published><updated>2011-02-21T10:48:50.541-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rant'/><title type='text'>The Tiger Mom and A Clockwork Orange</title><content type='html'>&lt;blockquote&gt;True disciple is doing what you want.&lt;/blockquote&gt;

&lt;p&gt;A wise friend once told me that. Amy Chua, better known as the &lt;b&gt;Tiger Mother&lt;/b&gt;, wrote about discipline (from a different point of view) in &lt;a href="http://online.wsj.com/article/SB10001424052748704111504576059713528698754.html"&gt;Why Chinese Mothers Are Superior&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;What Chinese parents understand is that nothing is fun until you're good at it. To get good at anything you have to work, and children on their own never want to work, which is why it is crucial to override their preferences. [...] Tenacious practice, practice, practice is crucial for excellence; rote repetition is underrated in America. Once a child starts to excel at something -- whether it's math, piano, pitching or ballet -- he or she gets praise, admiration and satisfaction. This builds confidence and makes the once not-fun activity fun. This in turn makes it easier for the parent to get the child to work even more.&lt;/blockquote&gt;

&lt;p&gt;For what it's worth, Chua's &lt;a href="http://www.amazon.com/Battle-Hymn-Tiger-Mother-Chua/dp/1594202842"&gt;book&lt;/a&gt; is apparently &lt;a href="http://articles.sfgate.com/2011-01-13/entertainment/27026230_1_chinese-parents-asian-american-jeff-yang"&gt;less strident&lt;/a&gt; and more nuanced than the WSJ article. Anyway, like her methods or not, I have a lot of sympathy for a parent trying to teach her kids about delayed gratification, that you can do difficult things if you try, and that &lt;b&gt;hard work pays off&lt;/b&gt;.&lt;/p&gt;

&lt;p&gt;If it's true that mastering a complex skill takes &lt;a href="http://www.nytimes.com/2006/05/07/magazine/07wwln_freak.html"&gt;10,000 hours of practice&lt;/a&gt;, then the persistence to push through those hours is a fairly important lesson to learn early. Recent research has caused a reappraisal in how much talent arrises from innate genius versus how much is the product of effort, practice and persistence.&lt;/p&gt;

&lt;p&gt;What brought to mind my old friend's remark about disciple was &lt;a href="http://paulbuchheit.blogspot.com/2011/02/two-paths-to-success.html"&gt;Paul Buchheit's take&lt;/a&gt;: motivation can be either intrinsic or extrinsic. Amy Chua is teaching her kids to be &lt;i&gt;extrinsically&lt;/i&gt; motivated, to respond to the praise and admiration of others. You do it because you are told to. You put your energy into chasing the approval of external authorities. In contrast, he describes &lt;i&gt;intrinsic&lt;/i&gt; motivation like this:&lt;/p&gt;

&lt;blockquote&gt;To the greatest extent possible, do whatever is most fun, interesting, and personally rewarding (and not evil).&lt;/blockquote&gt;

&lt;p&gt;Follow your heart, as hippy moms tell their children. Buccheit says, "I'm kind of lazy, or maybe I lack will power or discipline or something. Either way, it's very difficult for me to do anything that I don't feel like doing." Sounds familiar. "The intrinsic path to success is to focus on being the person that you are, and put all of your energy and drive into being the best possible version of yourself."&lt;/p&gt;

&lt;p&gt;The difference between intrinsic and extrinsic motivation is easily recognized in the moral dimension. In &lt;i&gt;A Clockwork Orange&lt;/i&gt;, Anthony Burgess imagines the transfer of aesthetic sense from creation to violence. Deprived of outlet, creativity turns destructive. The main thrust of the story is an examination of attempts to impose an external morality by force versus growing an internal morality.&lt;/p&gt;

&lt;p&gt;Paul Buccheit was the &lt;b&gt;software developer&lt;/b&gt; that originated Google's gmail. For myself, and I'm sure lots of others, a key attraction to programming was the ability to create in a powerful medium without asking anyone's permission. The creative freedom, the feeling that the authorities hadn't (yet) figured out how to lock things down was incredibly inspiring.&lt;/p&gt;

&lt;p&gt;It's easy to see how that attraction to technology meshes with &lt;a href="http://www.danpink.com/"&gt;Daniel Pink&lt;/a&gt;'s &lt;a href="http://www.ted.com/talks/dan_pink_on_motivation.html"&gt;elements of motivation&lt;/a&gt; — &lt;b&gt;autonomy&lt;/b&gt;, &lt;b&gt;mastery&lt;/b&gt;, and &lt;b&gt;purpose&lt;/b&gt;. If you've got root and a compiler, you've got autonomy. And it's all about a pissing contest of mastery. (This might partially explain the gender ratio in the field.) And technology is rife with appeals to higher purpose, from the open-source movement to the digital media that helped fuel the revolutions in Tunisia and Egypt.&lt;/p&gt;

&lt;p&gt;The Tiger Mom demands mastery before autonomy, leaving purpose firmly in the hands of the parent. Buccheit puts autonomy first, trusting in a natural sense of direction to lead to mastery and purpose. In terms of Maslow's hierarchy of needs, Amy Chua has the "esteem" level covered, but stops short of the top level - intrinsic self-directed creativity.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Maslow's_Hierarchy_of_Needs.svg/500px-Maslow's_Hierarchy_of_Needs.svg.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 300px;" src="http://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Maslow's_Hierarchy_of_Needs.svg/500px-Maslow's_Hierarchy_of_Needs.svg.png" border="0" alt="" /&gt;&lt;/a&gt;

&lt;p&gt;&lt;a href="http://thelastpsychiatrist.com/2010/08/this_is_why_the_american_dream.html"&gt;Some&lt;/a&gt; suggest that American society erects a border fence at the entry to the highest level. Well, nobody ever mentions why it's always drawn as a pyramid, but there's probably a reason. Not everyone gets to be at the top. Usually, that's the realm only of the &lt;a href="http://www.theatlantic.com/magazine/archive/2011/01/the-rise-of-the-new-global-elite/8343/"&gt;elite&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;David Cain of &lt;a href="http://www.raptitude.com/"&gt;Raptitude&lt;/a&gt; writes under the title &lt;i&gt;&lt;a href="http://www.raptitude.com/2011/01/how-to-make-trillions-of-dollars/"&gt;How to Make Trillions of Dollars&lt;/a&gt;&lt;/i&gt;, that the fundamentals of being a self-directed person are these:&lt;/p&gt;

&lt;blockquote&gt;&lt;a href="http://www.raptitude.com/2011/01/how-to-make-trillions-of-dollars/"&gt;Creativity. Curiosity. Resilience to distraction. Patience with others.&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;Cain defines self-reliance as &amp;ldquo;an unswerving willingness to take responsibility for your life, regardless of who had a hand in making it the way it is&amp;rdquo;.&lt;/p&gt;

&lt;p&gt;Again, we're basically talking about the top levels of the pyramid. But, I like the addition of resilience to distraction. Discipline is a hard sell in a culture the promotes immediate gratification and nonstop indulgence. It's hard to hear your own voice over the clamor of consumer culture and expectations from family, boss and everyone else. Hearing it is nearly impossible while drowning in distractions like twitter and facebook. And here's something else to remember about online amusements:&lt;/p&gt;

&lt;blockquote&gt;If you're not paying, you're not the customer; you're the product.&lt;/blockquote&gt;

&lt;p&gt;At a recent &lt;a href="http://strataconf.com/strata2011"&gt;data mining conference&lt;/a&gt; I saw rooms full of marketers ready to slice, dice and mash up your personal data to more precisely target advertising. Resisting this attack is an essential skill of modern life. A healthy cynicism is a necessary defense mechanism. Hearing yourself think is only going to get harder.&lt;/p&gt;

&lt;p&gt;The f&lt;a href="http://measuringmeasures.com/blog/2010/3/5/consumerism-kills.html"&gt;lawed idea implicit in consumer culture&lt;/a&gt; is &lt;i&gt;what you consume is what you are&lt;/i&gt;. But, valuing consuming over creating or doing is inevitably a dead end. The reason I'm not so hot on the iPad is that it's a device for consuming. The old macs were (marketed as) tools for programmers, graphic artists, musicians and film-makers -- in other works doers, builders and creators.&lt;/p&gt;

&lt;p&gt;Purpose has to come from values. The recent travails of the financial sector show what happens when motivations or at least incentives become disconnected from morals and values.&lt;/p&gt;

&lt;p&gt;Of course, lots of technology has a purpose no higher than selling golf clubs on the internet. And technology, itself, can be a distraction. It's easy to get caught up in a rat race of the latest whizzy buzzword laden language, tool or application framework dujour. For years, I've had a half-joking theory that the true purpose of the internet is to absorb the excess productivity of mankind.&lt;/p&gt;

&lt;p&gt;Creative, conceptual work driven by autonomy, mastery and purpose pursued with uninterrupted concentration. That begins to answer the question, how do we get some motivation, apply it to something good and inspire the same in those around us, especially our ungrateful screaming offspring.&lt;/p&gt;

&lt;p&gt;You can argue one way or another about whether a 13 year old has the foresight to be intrinsically motivated. I certainly didn't have the wherewithal at that age to set a long term goal. And pushing intrinsic motivation on your kids sounds to me something like imposing democracy by force. I doesn't make that much sense.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-Clxs5Nd3Fz0/TViFeV3j4pI/AAAAAAAAC0s/NWIL5xkKSYc/s1600/underachiever.gif"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 207px; height: 260px;" src="http://3.bp.blogspot.com/-Clxs5Nd3Fz0/TViFeV3j4pI/AAAAAAAAC0s/NWIL5xkKSYc/s320/underachiever.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5573351295183741586" /&gt;&lt;/a&gt;

&lt;p&gt;It takes determination to undertake exhausting frustrating efforts whose payoff is distant and uncertain. Most things that are worth doing are hard. The drive and courage to try anyway is what makes real progress possible. That doesn't come easily, and neither does the judgement necessary to gauge what is worth while against the scale of your own values.&lt;/p&gt;

&lt;p&gt;From my current position, I don't particularly want to lecture anyone on how to succeed in life. I respond negatively to coercion and can be a champion slacker, both of which were much to the detriment of my academic career. Still, here it is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Value creation over consumption.&lt;/li&gt;
&lt;li&gt;Surround yourself with creators.&lt;/li&gt;
&lt;li&gt;Do what you love, do it a lot, and do it hard.&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-4246038148906898424?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/4246038148906898424/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/02/tiger-mom-and-clockwork-orange.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4246038148906898424'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4246038148906898424'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/02/tiger-mom-and-clockwork-orange.html' title='The Tiger Mom and A Clockwork Orange'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-Clxs5Nd3Fz0/TViFeV3j4pI/AAAAAAAAC0s/NWIL5xkKSYc/s72-c/underachiever.gif' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-6872969038868711380</id><published>2011-02-08T22:15:00.000-08:00</published><updated>2011-02-08T23:21:59.685-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='UsingR'/><category scheme='http://www.blogger.com/atom/ns#' term='stats'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Using R for Introductory Statistics, Chapter 5, Probability Distributions</title><content type='html'>&lt;p&gt;In Chapter 5 of &lt;a href="http://wiener.math.csi.cuny.edu/UsingR/"&gt;Using R for Introductory Statistics&lt;/a&gt; we get a brief introduction to probability and, as part of that, a few common &lt;a href="http://cran.r-project.org/doc/manuals/R-intro.html#Probability-distributions"&gt;probability distributions&lt;/a&gt;. Specifically, the normal, binomial, exponential and lognormal distributions make an appearance.&lt;/p&gt;

&lt;p&gt;For each &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Distributions.html"&gt;distribution&lt;/a&gt;, R provides four functions whose names start with the letters &lt;b&gt;&lt;i&gt;d&lt;/i&gt;&lt;/b&gt;, &lt;b&gt;&lt;i&gt;p&lt;/i&gt;&lt;/b&gt;, &lt;b&gt;&lt;i&gt;q&lt;/i&gt;&lt;/b&gt; or &lt;b&gt;&lt;i&gt;r&lt;/i&gt;&lt;/b&gt; followed by the family name of the distribution. For example, &lt;i&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Normal.html"&gt;rnorm&lt;/a&gt;&lt;/i&gt; produces random numbers drawn from a normal distribution. The letters stand for:&lt;/p&gt;

&lt;table&gt;
&lt;tr&gt;&lt;td&gt;&lt;b&gt;d&lt;/b&gt;&lt;/td&gt;&lt;td&gt;density/mass function&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;b&gt;p&lt;/b&gt;&lt;/td&gt;&lt;td&gt;probability (cumulative distribution function) P(X &amp;lt;= x)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;b&gt;q&lt;/b&gt;&lt;/td&gt;&lt;td&gt;quantiles, given q, the smallest x such that P(X &amp;lt;= x) &amp;gt; q&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;b&gt;r&lt;/b&gt;&lt;/td&gt;&lt;td&gt;random number generation&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

&lt;h4&gt;Normal&lt;/h4&gt;
&lt;p&gt;The Gaussian or &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Normal.html"&gt;normal distribution&lt;/a&gt; has a prominent place due to the &lt;a href=""&gt;central limit theorem&lt;/a&gt;. It is widely used to model natural phenomena like variations in height or weight as well as noise and error. The &lt;b&gt;68-95-99.7 rule&lt;/b&gt; says:&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_dbECP0yvozc/TVIxdGEn3EI/AAAAAAAAC0M/poqBjQCEx1Y/s1600/normal.png"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 200px; height: 117px;" src="http://4.bp.blogspot.com/_dbECP0yvozc/TVIxdGEn3EI/AAAAAAAAC0M/poqBjQCEx1Y/s200/normal.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5571570064926694466" /&gt;&lt;/a&gt;

&lt;ul&gt; 
&lt;li&gt;68% of the data falls within 1 standard deviation of the mean&lt;/li&gt; 
&lt;li&gt;95% of the data falls within 2 standard deviations of the mean&lt;/li&gt; 
&lt;li&gt;99.7% of the data falls within 3 standard deviations of the mean&lt;/li&gt; 
&lt;/ul&gt;

&lt;p&gt;To plot a normal distribution, define some points &lt;i&gt;x&lt;/i&gt;, and use &lt;i&gt;dnorm&lt;/i&gt; to generate the density at those points.&lt;/p&gt;

&lt;pre class="codebox"&gt;
x &amp;lt;- seq(-3,3,0.1)
plot(x=x, y=dnorm(x, mean=0, sd=1), type=&amp;#x27;l&amp;#x27;)
&lt;/pre&gt;

&lt;h4&gt;Binomial&lt;/h4&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_dbECP0yvozc/TVIxpjYw7mI/AAAAAAAAC0U/_1QH4mHxfvY/s1600/binom.png"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 200px; height: 120px;" src="http://3.bp.blogspot.com/_dbECP0yvozc/TVIxpjYw7mI/AAAAAAAAC0U/_1QH4mHxfvY/s200/binom.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5571570278954233442" /&gt;&lt;/a&gt;

&lt;p&gt;A Bernoulli trial is an experiment which can have one of two possible outcomes. Independent repeated Bernoulli trials give rise to the &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Binomial.html"&gt;Binomial&lt;/a&gt; distribution, which is the probability distribution of the number of successes in &lt;i&gt;n&lt;/i&gt; independent Bernoulli trials. Although the binomial distribution is discrete, in the limit as n gets larger, it approaches the normal distribution.&lt;/p&gt;

&lt;pre class="codebox"&gt;
x &amp;lt;- seq(0,20,1)
plot(x=x, y=dbinom(x,20,0.5))
&lt;/pre&gt;

&lt;h4&gt;Uniform&lt;/h4&gt;
&lt;p&gt;A &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Uniform.html"&gt;Uniform&lt;/a&gt; distribution just says that all allowable values are equally likely, which comes up in dice or cards. Uniform distributions come in either continuous or discrete flavors.&lt;/p&gt;

&lt;h4&gt;Log-normal&lt;/h4&gt;
&lt;p&gt;The &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Lognormal.html"&gt;log-normal&lt;/a&gt; distribution is a probability distribution of a random variable whose logarithm is normally distributed. If X is a random variable with a normal distribution, then Y = exp(X) has a log-normal distribution. Analogously to the central limit theorem, the &lt;i&gt;product&lt;/i&gt; of many independent random variables multiplied together tends toward a lognormal distribution. It can be used to model continuous random quantities whose distribution is skewed and non-negative, for example income or survival.&lt;/p&gt;

&lt;pre class="codebox"&gt;
samples &amp;lt;- rlnorm(100, meanlog=0, sdlog=1)
par(fig=c(0,1,0,0.35))
boxplot(samples, horizontal=T, bty=&amp;quot;n&amp;quot;, xlab=&amp;quot;log-normal distribution&amp;quot;)
par(fig=c(0,1,0.25,1), new=T)
s &amp;lt;- seq(0,max(samples),0.1)
d &amp;lt;- dlnorm(s, meanlog=0, sdlog=1)
hist(samples, prob=T, main=&amp;quot;&amp;quot;, col=gray(0.9), ylim=c(0,max(d)))
lines(density(samples), lty=2)
curve(dlnorm(x, meanlog=0, sdlog=1), lwd=2, add=T)
rug(samples)
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/TVIx0OukeCI/AAAAAAAAC0c/TuQ2fUm4Qms/s1600/log-normal.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 381px;" src="http://1.bp.blogspot.com/_dbECP0yvozc/TVIx0OukeCI/AAAAAAAAC0c/TuQ2fUm4Qms/s400/log-normal.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5571570462387107874" /&gt;&lt;/a&gt;

&lt;h4&gt;Exponential&lt;/h4&gt;
&lt;p&gt;The &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Exponential.html"&gt;exponential&lt;/a&gt; distribution is used to model the time interval between successive random events such as time between failures arising from constant failure rates. The following plot is generated by essentially the same code as above.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/TVIx9khndpI/AAAAAAAAC0k/MJXimOQiaD4/s1600/exp2.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 381px;" src="http://2.bp.blogspot.com/_dbECP0yvozc/TVIx9khndpI/AAAAAAAAC0k/MJXimOQiaD4/s400/exp2.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5571570622857180818" /&gt;&lt;/a&gt;


&lt;h4&gt;More on probability distributions&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;R's d, p, q and r methods for families of &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Distributions.html"&gt;distributions&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;An Introduction to R, Chapter 8: &lt;a href="http://cran.r-project.org/doc/manuals/R-intro.html#Probability-distributions"&gt;Probability distributions&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;CRAN Task View: &lt;a href="http://cran.r-project.org/web/views/Distributions.html"&gt;Probability Distributions&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;John D Cook's map of the relationships between &lt;a href="http://www.johndcook.com/distribution_chart.html"&gt;probability distributions&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;



&lt;h4&gt;More Using R for Introductory Statistics&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="/2010/04/using-r-for-introductory-statistics.html"&gt;Using R for Introductory Statistics, Chapters 1 and 2&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/05/using-r-for-introductory-statistics-31.html"&gt;Using R for Introductory Statistics 3.1&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/06/using-r-for-introductory-statistics-32.html"&gt;Using R for Introductory Statistics 3.2&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/08/using-r-for-introductory-statistics-33.html"&gt;Using R for Introductory Statistics 3.3&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/08/using-r-for-introductory-statistics.html"&gt;Using R for Introductory Statistics, Chapter 3.4&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/12/using-r-for-introductory-statistics.html"&gt;Using R for Introductory Statistics, Chapter 4&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2011/01/using-r-for-introductory-statistics.html"&gt;Using R for Introductory Statistics, Chapter 4, Model Formulae&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2011/01/using-r-for-introductory-statistics_23.html"&gt;Using R for Introductory Statistics, Chapter 5&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-6872969038868711380?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/6872969038868711380/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/02/using-r-for-introductory-statistics.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/6872969038868711380'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/6872969038868711380'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/02/using-r-for-introductory-statistics.html' title='Using R for Introductory Statistics, Chapter 5, Probability Distributions'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_dbECP0yvozc/TVIxdGEn3EI/AAAAAAAAC0M/poqBjQCEx1Y/s72-c/normal.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-5498718270462818274</id><published>2011-02-01T21:37:00.000-08:00</published><updated>2011-02-01T22:10:42.315-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='Ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='software engineering'/><category scheme='http://www.blogger.com/atom/ns#' term='clojure'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Annotated source code</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/TUj1Bj3xcyI/AAAAAAAAC0A/gTw2chFjGTk/s1600/scribe_full.gif"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 200px; height: 112px;" src="http://2.bp.blogspot.com/_dbECP0yvozc/TUj1Bj3xcyI/AAAAAAAAC0A/gTw2chFjGTk/s200/scribe_full.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5568970346401461026" /&gt;&lt;/a&gt;

&lt;p&gt;We programmers are told that &lt;a href="http://www.spinellis.gr/codereading/"&gt;reading code&lt;/a&gt; is a &lt;a href="http://www.skorks.com/2010/05/why-i-love-reading-other-peoples-code-and-you-should-too/"&gt;good idea&lt;/a&gt;. It may be &lt;i&gt;good for you&lt;/i&gt;, but it's hard work. &lt;a href="https://github.com/jashkenas"&gt;Jeremy Ashkenas&lt;/a&gt; has come up with a simple tool that makes it easier: &lt;a href="http://jashkenas.github.com/docco/"&gt;docco&lt;/a&gt;. Ashkenas is also behind &lt;a href="http://documentcloud.github.com/underscore/"&gt;underscore.js
&lt;/a&gt; and &lt;a href="http://jashkenas.github.com/coffee-script/"&gt;coffeescript&lt;/a&gt;, a dialect of javascript in which docco is written.&lt;/p&gt;

&lt;p&gt;Interesting ways to mix prose and code have appealed to me ever since I first discovered &lt;b&gt;Mathematica's live notebook&lt;/b&gt;, which lets you author documents that combine executable source code, typeset text and interactive graphics. For those who remember the early 90's chiefly for their potty training, running Mathematica on the Next pizza boxes was like a trip to the future. Combining the quick cycles of a Read-evaluate-print-loop with complete word processing and mathematical typesetting encourages you to keep lovely notes on your thinking and trials and errors.&lt;/p&gt;

&lt;p&gt;Along the same lines, there's &lt;a href="http://www.stat.umn.edu/~charlie/Sweave/"&gt;Sweave&lt;/a&gt; for R and &lt;a href="http://www.sagemath.org/"&gt;sage&lt;/a&gt; for Python.&lt;/p&gt;

&lt;p&gt;Likewise, one of the great innovations of Java was &lt;a href="http://download.oracle.com/javase/6/docs/api/index.html"&gt;Javadoc&lt;/a&gt;. Javadoc doesn't get nearly enough credit for the success of Java as a language. It made powerful API's like the collections classes a snap and even helped navigate the byzantine complexities of Swing and AWT.&lt;/p&gt;

&lt;p&gt;These days, automated documentation is expected for any language. Nice examples are: &lt;a href="http://ruby-doc.org/"&gt;RubyDoc&lt;/a&gt;, &lt;a href="http://www.scala-lang.org/api/"&gt;scaladoc&lt;/a&gt;, &lt;a href="http://www.haskell.org/haddock/"&gt;Haddock&lt;/a&gt; (for Haskell). &lt;a href="www.doxygen.org/"&gt;Doxygen&lt;/a&gt; works with a number of languages. Python has &lt;a href="http://docs.python.org/library/pydoc.html"&gt;pydoc&lt;/a&gt;, but in practice seems to rely more on the &lt;a href="http://docs.python.org/library/index.html"&gt;library reference&lt;/a&gt;. Anyway, there are &lt;a href="http://en.wikipedia.org/wiki/Comparison_of_documentation_generators"&gt;a bunch&lt;/a&gt;, and if your favorite language doesn't have one, start coding now.&lt;/p&gt;

&lt;p&gt;The grand-daddy of these ideas is Donald Knuth's &lt;span style="font-weight:bold;"&gt;literate programming&lt;/span&gt;.&lt;/p&gt;

&lt;blockquote&gt;&lt;a href="http://www.literateprogramming.com/"&gt;&lt;p&gt;I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: "Literate Programming."&lt;/p&gt;

&lt;p&gt;Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.&lt;/p&gt;

&lt;p&gt;The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.&lt;/p&gt;&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;Indeed, Ashkenas references Knuth, calling docco "quick-and-dirty, hundred-line-long, literate-programming".&lt;/p&gt;

&lt;p&gt;This goodness needs to come to more language. There's a ruby port called &lt;a href="http://rtomayko.github.com/rocco/"&gt;rocco&lt;/a&gt; by Ryan Tomayko. And for Clojure there's &lt;a href="http://fogus.me/fun/marginalia/"&gt;marginalia&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I love the quick-and-dirty aspect and that will be the key to encouraging programmers to do more documentation that &lt;a href="http://jashkenas.github.com/docco/"&gt;looks like this&lt;/a&gt;. I hope they build docco, or something like it, into github. Maybe one day there will be a Norton's anthology of annotated source code.&lt;/p&gt;

&lt;h4&gt;Vaguely related&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://timeless.judofyr.net/literate-programming.html"&gt;literate-programming.rb&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.americanscientist.org/issues/id.3489,y.0,no.,content.true,page.1,css.print/issue.aspx"&gt;The Semicolon Wars&lt;/a&gt; Every programmer knows there is one true programming language. A new one every week&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-5498718270462818274?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/5498718270462818274/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/02/annotated-source-code.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5498718270462818274'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/5498718270462818274'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/02/annotated-source-code.html' title='Annotated source code'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_dbECP0yvozc/TUj1Bj3xcyI/AAAAAAAAC0A/gTw2chFjGTk/s72-c/scribe_full.gif' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-3997724039033699249</id><published>2011-01-23T13:07:00.000-08:00</published><updated>2011-01-24T15:31:38.820-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='UsingR'/><category scheme='http://www.blogger.com/atom/ns#' term='stats'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Using R for Introductory Statistics, Chapter 5</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/TTyZJCwLssI/AAAAAAAACzk/9rau33GO7TQ/s1600/dice.jpg"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 200px; height: 134px;" src="http://2.bp.blogspot.com/_dbECP0yvozc/TTyZJCwLssI/AAAAAAAACzk/9rau33GO7TQ/s200/dice.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5565491620159926978" /&gt;&lt;/a&gt;

&lt;p&gt;Any good stats book has to cover a bit of basic probability. That's the purpose of Chapter 5 of &lt;a href="http://wiener.math.csi.cuny.edu/UsingR/"&gt;Using R for Introductory Statistics&lt;/a&gt;, starting with a few definitions:&lt;/p&gt;

&lt;dl&gt;
  &lt;dt&gt;Random variable&lt;/dt&gt;
  &lt;dd&gt;A random number drawn from a population. A random variable is a variable for which we define a range of possible values and a probability distribution. The probability distribution specifies the probability that the variable assumes any value in its range.&lt;/dd&gt;
&lt;/dl&gt;

&lt;p&gt;The range of a discrete random variable is a finite list of values. For any value &lt;i&gt;k&lt;/i&gt; in the range, &lt;i&gt;0 &amp;#8804; P(X=k) &amp;#8804; 1&lt;/i&gt;. The sum over all values &lt;i&gt;k&lt;/i&gt; in the range is 1.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/TTktTIJdqkI/AAAAAAAACyk/sIUtdoSchiQ/s1600/sum_probabilities_is_1.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 139px; height: 40px;" src="http://1.bp.blogspot.com/_dbECP0yvozc/TTktTIJdqkI/AAAAAAAACyk/sIUtdoSchiQ/s200/sum_probabilities_is_1.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5564528621220899394" /&gt;&lt;/a&gt;

&lt;p&gt;R &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/sample.html"&gt;sample&lt;/a&gt; function implements these definitions. For example, let's ask R to create a six-sided die for us.&lt;/p&gt;

&lt;pre class="codebox"&gt;
p.die &amp;lt;- rep(1/6,6)
sum(p.die)
&lt;/pre&gt;

&lt;p&gt;Now, let's roll it 10 times.&lt;/p&gt;

&lt;pre class="codebox"&gt;
die &amp;lt;- 1:6
sample(die, size=10, prob=p.die, replace=T)
&lt;/pre&gt;

&lt;p&gt;Now, let's roll 1000 dice and plot the results.&lt;/p&gt;

&lt;pre class="codebox"&gt;
s &amp;lt;- table(sample(die, size=1000, prob=p.die, replace=T))
lbls = sprintf(&amp;quot;%0.1f%%&amp;quot;, s/sum(s)*100)
barX &amp;lt;- barplot(s, ylim=c(0,200))
text(x=barX, y=s+10, label=lbls)
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/TTktwTFEcII/AAAAAAAACys/wgNmWR1pR0s/s1600/rolling_dice.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 287px;" src="http://1.bp.blogspot.com/_dbECP0yvozc/TTktwTFEcII/AAAAAAAACys/wgNmWR1pR0s/s400/rolling_dice.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5564529122371465346" /&gt;&lt;/a&gt;

&lt;dl&gt;
  &lt;dt&gt;Expected value&lt;/dt&gt;
  &lt;dd&gt;Expected value (or population mean) of a discrete random variable &lt;i&gt;X&lt;/i&gt; is the weighted average of the values in the range of &lt;i&gt;X&lt;/i&gt;.&lt;/dd&gt;
&lt;/dl&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/TTkuARhhn0I/AAAAAAAACy0/p1F4AhsC10A/s1600/expected_value.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 184px; height: 40px;" src="http://2.bp.blogspot.com/_dbECP0yvozc/TTkuARhhn0I/AAAAAAAACy0/p1F4AhsC10A/s200/expected_value.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5564529396831854402" /&gt;&lt;/a&gt;

&lt;p&gt;In the case of our six-sided die, the expected value is 3.5, computed like so:&lt;/p&gt;
&lt;pre class="codebox"&gt;sum(die*p.die)&lt;/pre&gt;

&lt;p&gt;Things change a bit when we move from &lt;b&gt;discrete&lt;/b&gt; to &lt;b&gt;continuous&lt;/b&gt; random variables. A &lt;b&gt;continuous&lt;/b&gt; random variable is described by a probability density function. If &lt;i&gt;f(x)&lt;/i&gt; is the probability density of a random variable &lt;i&gt;X&lt;/i&gt;, &lt;i&gt;P(X&amp;#8804;b)&lt;/i&gt; is the area under &lt;i&gt;f(x)&lt;/i&gt; and to the left of &lt;i&gt;b&lt;/i&gt;. The total area under &lt;i&gt;f(x) = 1&lt;/i&gt;. &lt;i&gt;f(x) &amp;#8805; 0&lt;/i&gt; for all possible values of &lt;i&gt;X&lt;/i&gt;. &lt;i&gt;P(X&amp;gt;b) = 1 - P(X&amp;#8804;b)&lt;/i&gt;. The expected value is the balance point where exactly half of the total area under &lt;i&gt;f(x)&lt;/i&gt; is to the right and half is to the left.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_dbECP0yvozc/TTkuPr9-szI/AAAAAAAACy8/DdgGg9RoNIE/s1600/p_x_lteq_b.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 299px;" src="http://4.bp.blogspot.com/_dbECP0yvozc/TTkuPr9-szI/AAAAAAAACy8/DdgGg9RoNIE/s400/p_x_lteq_b.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5564529661628560178" /&gt;&lt;/a&gt;

&lt;p&gt;A &lt;b&gt;random sample&lt;/b&gt; is a sequence of independent identically distributed random variables. A value derived from a random sample, such as sample mean, sample standard deviation, etc. is called a &lt;b&gt;statistic&lt;/b&gt;. When we compute statistics of samples, our hope is that the sample statistic is not too far off from the equivalent measurement of the whole population.&lt;/p&gt;

&lt;p&gt;The interesting thing is that derived statistics are also random variables. If we role our die several times, we have taken a random sample of size &lt;i&gt;n&lt;/i&gt;. That sample can be summarized by computing its mean, denoted by &lt;i&gt;X bar&lt;/i&gt;.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_dbECP0yvozc/TTo-1cvGGNI/AAAAAAAACzE/rCcQPXXNnww/s1600/x_bar.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 200px; height: 18px;" src="http://4.bp.blogspot.com/_dbECP0yvozc/TTo-1cvGGNI/AAAAAAAACzE/rCcQPXXNnww/s200/x_bar.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5564829377537382610" /&gt;&lt;/a&gt;

&lt;p&gt;The sample mean is, itself, a random variable with its own distribution. So, let's take a look at that distribution. Because we'll use it later, let's define a function that generates a bunch of samples of size &lt;i&gt;n&lt;/i&gt; and computes their means. It returns a vector of sample means.&lt;/p&gt;

&lt;pre class="codebox"&gt;
generate.sample.means &amp;lt;- function(n) {
  sample.means &amp;lt;- numeric()
  for (i in 1:1000) { 
    sample.means &amp;lt;- append(sample.means, sum(sample(die, size=n, prob=p.die, replace=T))/n)
  }
  return (sample.means)
}
&lt;/pre&gt;

&lt;pre class="codebox"&gt;
sample.means &amp;lt;- generate.sample.means(100)
plot(density(sample.means), main=&amp;quot;Distribution of sample means&amp;quot;,xlab=&amp;quot;sample mean&amp;quot;, col=&amp;quot;orange&amp;quot;)
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_dbECP0yvozc/TTp-NbjdJdI/AAAAAAAACzU/DXRT3cPQSNw/s1600/sample_means.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 264px;" src="http://3.bp.blogspot.com/_dbECP0yvozc/TTp-NbjdJdI/AAAAAAAACzU/DXRT3cPQSNw/s400/sample_means.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5564899058769536466" /&gt;&lt;/a&gt;

&lt;p&gt;Not coincidentally, it fits pretty closely to a normal distribution (dashed blue line). It's mean is about the same as the parent population, namely right around 3.5. The standard deviation of the sample means can be derived by dividing the standard deviation of the parent population by the square root of the sample size: &lt;i&gt;&amp;sigma; / &amp;radic; n&lt;/i&gt;.&lt;/p&gt;

&lt;p&gt;Let's compare the mean and &lt;i&gt;sd&lt;/i&gt; of our sample with predicted values. We compute the standard deviation of our parent population of die rolls by squaring the deviation from the mean for each possible value and averaging that. Dividing that by the square root of our sample size gives the predicted &lt;i&gt;sd&lt;/i&gt; of our sample means, about 0.17, which is about spot on with the actual &lt;i&gt;sd&lt;/i&gt;.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&gt; mean(sample.means)
[1] 3.49002
&gt; sd(sample.means)
[1] 0.1704918
&gt; sqrt(sum( (1:6-3.5)^2 ) / 6) / sqrt(100)
[1] 0.1707825
&lt;/pre&gt;

&lt;p&gt;To overlay the normal distribution on the plot above, we used R's dnorm function like this:&lt;/p&gt;

&lt;pre class="codebox"&gt;
x = seq(3,4,0.01)
lines(x=x,y=dnorm(x,mean=3.5,sd=0.1707825), col=rgb(0x33,0x66,0xAA,0x90,maxColorValue=255), type="l", lty=2)
&lt;/pre&gt;

&lt;p&gt;Inspection of the formula for the standard deviation of sample means supports our common sense intuition that a bigger sample will more likely reflect the whole population. In particular, as the size of our sample goes up, our estimated mean is more likely to be closer to the parent population mean. This idea is known as the &lt;b&gt;law of large numbers&lt;/b&gt;. We can show that it works by creating similar plots with increasing &lt;i&gt;n&lt;/i&gt;.&lt;/p&gt;

&lt;pre class="codebox"&gt;
sample.means &amp;lt;- generate.sample.means(100)
plot(density(sample.means), main=&amp;quot;Distribution of sample means&amp;quot;, xlab=&amp;quot;sample mean&amp;quot;, col=&amp;quot;yellow&amp;quot;, xlim=c(3.2,3.8), ylim=c(0,8))
sample.means &amp;lt;- generate.sample.means(500)
lines(density(sample.means), col=&amp;quot;orange&amp;quot;)
sample.means &amp;lt;- generate.sample.means(1000)
lines(density(sample.means), col=&amp;quot;red&amp;quot;)
legend(3.6,7,c(&amp;quot;n=100&amp;quot;,&amp;quot;n=500&amp;quot;,&amp;quot;n=1000&amp;quot;), fill=c(&amp;quot;yellow&amp;quot;, &amp;quot;orange&amp;quot;, &amp;quot;red&amp;quot;))
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/TTp-jQar1JI/AAAAAAAACzc/C-aY7TO5WHQ/s1600/sample_means_converging.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 264px;" src="http://1.bp.blogspot.com/_dbECP0yvozc/TTp-jQar1JI/AAAAAAAACzc/C-aY7TO5WHQ/s400/sample_means_converging.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5564899433737082002" /&gt;&lt;/a&gt;

&lt;p&gt;What we've just discovered is the &lt;b&gt;central limit theorem&lt;/b&gt;, which states that for any parent population with mean &lt;i&gt;&amp;mu;&lt;/i&gt; and standard deviation &lt;i&gt;&amp;sigma;&lt;/i&gt;, the sampling distribution for large &lt;i&gt;n&lt;/i&gt; is a normal distribution with mean &lt;i&gt;&amp;mu;&lt;/i&gt; and standard deviation &lt;i&gt;&amp;sigma; / &amp;radic; n&lt;/i&gt;. Putting that in terms of the standard normal distribution &lt;i&gt;Z&lt;/i&gt; gives:&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_dbECP0yvozc/TTp-FI5baEI/AAAAAAAACzM/Xq9TfHCbJk8/s1600/central_limit_theorem.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 200px; height: 40px;" src="http://4.bp.blogspot.com/_dbECP0yvozc/TTp-FI5baEI/AAAAAAAACzM/Xq9TfHCbJk8/s200/central_limit_theorem.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5564898916322469954" /&gt;&lt;/a&gt;

&lt;p&gt;It's a little surprising that the normal distribution arises out of a means of samples from a discrete uniform distribution. More surprisingly, samples from &lt;i&gt;any&lt;/i&gt; parent distribution give rise to the normal distribution in exactly the same way. Next, we'll look at several widely used families of distributions and R's functions for working with them.&lt;/p&gt;

&lt;h4&gt;Notes&lt;/h4&gt;

&lt;p&gt;The graph above shaded in orange serving as an example probability density function is produced with the following R code:&lt;/p&gt;

&lt;pre class="codebox"&gt;
plot(x=c(seq(0,5,0.01)), y=c(dlnorm(x=seq(0,5,0.01),meanlog=0,sdlog=1)), type="l", xlab='', ylab='', yaxt="n", xaxt="n", bty="n")
polygon(x=c(seq(0,2,0.01),2), y=c(dlnorm(x=seq(0,2,0.01),meanlog=0,sdlog=1),0), col="orange")
mtext('b', at=c(2), side=1)
text(0.6,0.2,"P(X≤b)")
abline(c(0,0),c(0,5))
&lt;/pre&gt;

&lt;h4&gt;More Using R for Introductory Statistics&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="/2010/04/using-r-for-introductory-statistics.html"&gt;Using R for Introductory Statistics, Chapters 1 and 2&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/05/using-r-for-introductory-statistics-31.html"&gt;Using R for Introductory Statistics 3.1&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/06/using-r-for-introductory-statistics-32.html"&gt;Using R for Introductory Statistics 3.2&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/08/using-r-for-introductory-statistics-33.html"&gt;Using R for Introductory Statistics 3.3&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/08/using-r-for-introductory-statistics.html"&gt;Using R for Introductory Statistics, Chapter 3.4&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2010/12/using-r-for-introductory-statistics.html"&gt;Using R for Introductory Statistics, Chapter 4&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="/2011/01/using-r-for-introductory-statistics.html"&gt;Using R for Introductory Statistics, Chapter 4, Model Formulae&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-3997724039033699249?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/3997724039033699249/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/01/using-r-for-introductory-statistics_23.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3997724039033699249'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3997724039033699249'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/01/using-r-for-introductory-statistics_23.html' title='Using R for Introductory Statistics, Chapter 5'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_dbECP0yvozc/TTyZJCwLssI/AAAAAAAACzk/9rau33GO7TQ/s72-c/dice.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-6504452388717551877</id><published>2011-01-09T22:40:00.000-08:00</published><updated>2011-01-09T23:06:44.314-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='UsingR'/><category scheme='http://www.blogger.com/atom/ns#' term='stats'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Using R for Introductory Statistics, Chapter 4, Model Formulae</title><content type='html'>&lt;p&gt;Several R functions take &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html"&gt;model formulae&lt;/a&gt; as parameters. Model formulae are symbolic expressions. They define a relationship between variables rather than an arithmetic expression to be evaluated immediately. Model formulae are defined with the &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/tilde.html"&gt;tilde operator&lt;/a&gt;. A simple model formula looks like this:&lt;/p&gt;

&lt;pre class="codebox"&gt;response ~ predictor&lt;/pre&gt;

&lt;p&gt;Functions that accept formulae typically also take a &lt;b&gt;&lt;i&gt;data&lt;/i&gt;&lt;/b&gt; argument to specify a data frame in which to look up model variables and a &lt;b&gt;&lt;i&gt;subset&lt;/i&gt;&lt;/b&gt; argument to select certain rows in the data frame.&lt;/p&gt;

&lt;p&gt;We've already seen model formula used for &lt;a href="/2010/08/using-r-for-introductory-statistics.html"&gt;simple linear regression&lt;/a&gt; and with &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/plot.formula.html"&gt;plot&lt;/a&gt; and &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/boxplot.html"&gt;boxplot&lt;/a&gt;, to show that &lt;a href="/2010/12/using-r-for-introductory-statistics.html"&gt;American cars are heavy gas guzzlers&lt;/a&gt;. Two common uses of formula are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;y ~ x where x and y are numeric&lt;/li&gt;
&lt;li&gt;x ~ f where x is numeric and f is a factor&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;i&gt;Lattice&lt;/i&gt; graphics package can accept more complicated model formulas of this form:&lt;/p&gt;

&lt;pre class="codebox"&gt;response ~ predictor | condition&lt;/pre&gt;

&lt;p&gt;We'll try this out with a dataset called &lt;i&gt;kid.weights&lt;/i&gt; from the &lt;i&gt;UsingR&lt;/i&gt; package. We get age, weight, height and gender for 250 kids ranging from 3 month to 12 years old.&lt;/p&gt;
&lt;pre class="codebox"&gt;
library(UsingR)
library(lattice)
dim(kid.weights)
[1] 250   4
&lt;/pre&gt;

&lt;p&gt;We expect weight and height to be related, but we're wondering if this relationship changes over time as kids grow. Often, when we want to condition on a quantitative variable (like age), we turn it into a categorical variable by binning. Here, we'll create 4 bins by taking age in 3 year intervals.&lt;/p&gt;

&lt;pre&gt;
age.classes = cut(kid.weights$age/12, 3*(0:4))
unique(age.classes)
[1] (3,6]  (6,9]  (9,12] (0,3] 
Levels: (0,3] (3,6] (6,9] (9,12]
&lt;/pre&gt;

&lt;p&gt;With age as a factor, we can express our question as the model formula:&lt;/p&gt;

&lt;pre class="codebox"&gt;height ~ weight | age.classes&lt;/pre&gt;

&lt;p&gt;The lattice graphics function &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/lattice/html/xyplot.html"&gt;xyplot&lt;/a&gt; accepts this kind of formula and draws a panel for each level of the conditioning variable. The panels contain scatterplots of the response and predictor, in this case height and weight, divided into subsets by the conditioning variable. The book shows a little trick that let's us customize xyplot, adding a regression line to each scatterplots.&lt;/p&gt;

&lt;pre class="codebox"&gt;
plot.regression = function(x,y) {
  panel.xyplot(x,y)
  panel.abline(lm(y~x))
}
&lt;/pre&gt;

&lt;p&gt;We pass the helper function &lt;i&gt;plot.regression&lt;/i&gt; as a custom panel function in xyplot.&lt;/p&gt;

&lt;pre class="codebox"&gt;
xyplot( height ~ weight | age.classes, data=kid.weights, panel=plot.regression)
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_dbECP0yvozc/TSqvXva5LjI/AAAAAAAACyM/ypRVPkQ3JOw/s1600/height_weight_age.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 400px;" src="http://4.bp.blogspot.com/_dbECP0yvozc/TSqvXva5LjI/AAAAAAAACyM/ypRVPkQ3JOw/s400/height_weight_age.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5560449512343416370" /&gt;&lt;/a&gt;

&lt;p&gt;There's quite a bit more to model formulae, but that's all I've figured out so far.&lt;/p&gt;

&lt;h4&gt;More on formulae&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Chapter 11 &lt;a href="http://cran.r-project.org/doc/manuals/R-intro.html#Statistical-models-in-R"&gt;Statistical models in R&lt;/a&gt; from &lt;i&gt;An Introduction to R&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://wiener.math.csi.cuny.edu/st/stRmanual/ModelFormula.html"&gt;R's model formula&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/AsIs.html"&gt;I&lt;/a&gt;() can be used insulate arithmetic expressions within formulae.&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-6504452388717551877?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/6504452388717551877/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/01/using-r-for-introductory-statistics.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/6504452388717551877'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/6504452388717551877'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2011/01/using-r-for-introductory-statistics.html' title='Using R for Introductory Statistics, Chapter 4, Model Formulae'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_dbECP0yvozc/TSqvXva5LjI/AAAAAAAACyM/ypRVPkQ3JOw/s72-c/height_weight_age.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-7719671436088463406</id><published>2010-12-27T12:24:00.000-08:00</published><updated>2010-12-27T12:53:44.958-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='seattle'/><category scheme='http://www.blogger.com/atom/ns#' term='analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='Bioinformatics'/><title type='text'>Cloud bioinformatics</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_dbECP0yvozc/TRj2TLSU3vI/AAAAAAAACx8/PdJuKI1aTmc/s1600/cloud_computing_kitchen_sink.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 286px;" src="http://3.bp.blogspot.com/_dbECP0yvozc/TRj2TLSU3vI/AAAAAAAACx8/PdJuKI1aTmc/s400/cloud_computing_kitchen_sink.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5555460949669109490" /&gt;&lt;/a&gt;

&lt;p&gt;Personally, the thing I love about cloud computing is never having to ask permission. There's no ops guy or pointy-haired boss between me and the launch-instance button. As lovely as that is, the cloud is also a powerful tool for &lt;b&gt;scientific computing&lt;/b&gt;, particularly bioinformatics.&lt;/p&gt;

&lt;p&gt;Next-gen sequencing, which can produce gigabytes per day, is one factor pushing bioinformatics into the cloud. Data analysis is now the major bottleneck for sequencing-based experiments. Labs are finding out that generating sequencing data is getting to be cheaper than processing it. According to &lt;a href="http://labs.pathology.wisc.edu/oconnor/"&gt;Dave O’Connor Lab&lt;/a&gt; at the University of Wisconsin's Department of Pathology and Laboratory Medicine, "&lt;span style="font-style:italic;"&gt;There is a real disconnect between the ability to collect next-generation sequence data (easy) and the ability to analyze it meaningfully (hard)&lt;/span&gt;."&lt;/p&gt;

&lt;p&gt;O'Connor's group works with &lt;a href="http://www.labkey.com/"&gt;LabKey Software&lt;/a&gt;, a Seattle-based bioinformatics software company founded by the &lt;a href="http://fhcrc.org/"&gt;Fred Hutchinson Cancer Research Center&lt;/a&gt;. LabKey develops open-source data management software for proteomics, flow cytometry, plate-based assay, and HIV vaccine study data, described in a &lt;a href="http://www.labkey.com/news/news-releases/nr-11-30-2010"&gt;presentation&lt;/a&gt; by Lead Developer Adam Rauch. Their technology stack seems to include: Java, Spring, GWT, Lucene and Gauva (aka Google Collections). LabKey integrates with the impressive &lt;a href="http://galaxy.psu.edu/"&gt;Galaxy&lt;/a&gt; genomics workflow system and the &lt;a href="http://tools.proteomecenter.org/wiki/index.php?title=Software:TPP"&gt;Trans-Proteomic Pipeline (TPP)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A good part of modern biology boils down to &lt;strong&gt;mining biological data&lt;/strong&gt;, with the goal of correlating sequence, transcription or peptides to outputs like function, phenotype or disease. Machine learning and statistical modeling tend toward long-running CPU-intensive jobs that get run intermittently as new data arrives, making them ideal candidates for the cloud.&lt;/p&gt;

&lt;p&gt;Amazon's &lt;a href="http://aws.amazon.com/ec2/"&gt;EC2&lt;/a&gt; seems to be better positioned than either Microsoft's Azure or Google's &lt;a href="http://code.google.com/appengine/"&gt;AppEngine&lt;/a&gt; for scientific computing. Amazon has been ahead of the curve in seeing the &lt;a href="http://www.xconomy.com/national/2010/07/06/amazon-with-rented-server-space-in-the-cloud-sees-opportunity-in-genomic-data-overload/"&gt;opportunity in genomic data overload&lt;/a&gt;. Microsoft has made some welcome efforts to attract scientific computing, including the &lt;a href="http://blogs.msdn.com/b/msr_er/archive/2010/07/09/microsoft-biology-foundation-available-for-free-download.aspx"&gt;Microsoft Biology Foundation&lt;/a&gt; and &lt;a href="http://blogs.msdn.com/b/windowsazure/archive/2010/02/04/microsoft-and-nsf-announce-client-cloud-computing-project-to-accelerate-scientific-discovery-and-foster-collaborative-research.aspx"&gt;grants for scientific computing in Azure&lt;/a&gt;. But they're fighting a headwind arising from proprietary licensing and a closed ecosystem. Oddly, considering Google's reputation for openness, AppEngine looks surprisingly restrictive. Research computing typically involves building and installing binaries, programming in an odd patchwork of languages and long running CPU intensive tasks, none of which is particularly welcome on AppEngine. Maybe Google has a better offering in the works?&lt;/p&gt;

&lt;p&gt;It's worth noting that open-source works without friction in cloud environments while many proprietary vendors have been slow to adapt their licensing models to on-demand scaling. For example, lots of folks are using R for machine learning in the cloud, while MatLab is still bogged down in &lt;a href="http://aws.typepad.com/aws/2008/11/parallel-comput.html?cid=139736866"&gt;licensing issues&lt;/a&gt;. The not-having-to-ask-permission aspect is lost.&lt;/p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://radar.oreilly.com/2010/08/points-of-control-the-web-20-s.html"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 312px;" src="http://1.bp.blogspot.com/_dbECP0yvozc/TRj7-dPVQSI/AAAAAAAACyE/OkuR_fCyD7Q/s400/points_of_control_map.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5555467190780903714" /&gt;&lt;/a&gt;

&lt;p&gt;According to &lt;a href="http://www.xconomy.com/seattle/"&gt;Xconomy&lt;/a&gt;, Seattle has a &lt;a href="http://www.xconomy.com/seattle/2010/08/03/seattles-growing-advantage-in-the-cloud/"&gt;growing advantage in the cloud&lt;/a&gt;. There are several Seattle companies operating in the bioinformatics and cloud spaces. Sage Bionetworks, also linked to FHCRC, was founded by Eric Schadt, &lt;a href="http://www.pacificbiosciences.com/aboutus/leadership_team/executives/eric_e_schadt"&gt;also of Pacific Biosciences&lt;/a&gt;, and &lt;a href="http://www.xconomy.com/seattle/2009/08/06/stephen-friend-leaving-high-powered-merck-gig-lights-the-fire-for-open-source-biology-movement/"&gt;Stephen Friend&lt;/a&gt; former founder of Rosetta Inpharmatics. &lt;a href="http://www.revolutionanalytics.com/"&gt;Revolution Analytics&lt;/a&gt; sells a scalable variant of R for all kinds of applications including life sciences. &lt;a href="/2010/01/analytics-in-seattle.html"&gt;Seattle hosts a lot of activity in analytics&lt;/a&gt;, cloud computing and biotechnology, which will keep Seattle on the technology map for some time to come.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-7719671436088463406?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/7719671436088463406/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/12/cloud-bioinformatics.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/7719671436088463406'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/7719671436088463406'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/12/cloud-bioinformatics.html' title='Cloud bioinformatics'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_dbECP0yvozc/TRj2TLSU3vI/AAAAAAAACx8/PdJuKI1aTmc/s72-c/cloud_computing_kitchen_sink.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-4325928617471395210</id><published>2010-12-12T17:35:00.000-08:00</published><updated>2011-06-17T11:34:52.558-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='UsingR'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Using R for Introductory Statistics, Chapter 4</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/TQV5SBTa2KI/AAAAAAAACxk/KgtUFtpRZ50/s1600/ferrari-dino-246-gt-5w.jpg"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 150px; height:100px;" src="http://2.bp.blogspot.com/_dbECP0yvozc/TQV5SBTa2KI/AAAAAAAACxk/KgtUFtpRZ50/s400/ferrari-dino-246-gt-5w.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5549975466298497186" /&gt;&lt;/a&gt;

&lt;p&gt;Chapter 4 of &lt;a href="http://wiener.math.csi.cuny.edu/UsingR/"&gt;Using R for Introductory Statistics&lt;/a&gt; gets us started working with multivariate data. The question is: what are the relationships among the variables? One way to go about answering it is by pairwise comparison of variables. Another technique is to divide the data into categories by the values of some variables and analyze the remaining variables within each category. Different facets of the data can be encoded with color, shape and position to create visualizations that show graphically the relationships between several variables.&lt;/p&gt;

&lt;p&gt;Taking variables one or two at a time, we can rely on our previous experience and apply our toolbox of &lt;a href="http://digitheadslabnotebook.blogspot.com/2010/04/using-r-for-introductory-statistics.html"&gt;univariate&lt;/a&gt; and &lt;a href="http://digitheadslabnotebook.blogspot.com/2010/08/using-r-for-introductory-statistics-31.html"&gt;bivariate&lt;/a&gt; &lt;a href="http://digitheadslabnotebook.blogspot.com/2010/08/using-r-for-introductory-statistics-32.html"&gt;techniques&lt;/a&gt;, such as histograms, &lt;a href="http://digitheadslabnotebook.blogspot.com/2010/08/using-r-for-introductory-statistics-33.html"&gt;correlation&lt;/a&gt; and &lt;a href="http://digitheadslabnotebook.blogspot.com/2010/08/using-r-for-introductory-statistics.html"&gt;linear regression&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We can also hold some variables constant and analyze the remaining variables in that context. Often, this involves conditioning on a categorical variable, as we did in Chapter 3 by &lt;a href="http://digitheadslabnotebook.blogspot.com/2010/08/using-r-for-introductory-statistics-33.html"&gt;splitting marathon finishing time into gender and age classes&lt;/a&gt;. As another example, the distribution of top speeds of italian sports cars describes a dependent variable, top speed, conditioned on two categorical variables, country of origin (Italy) and category (sports car). Because they're so familiar, cars make a great example.&lt;/p&gt;

&lt;p&gt;R comes with a dataset called &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html"&gt;mtcars&lt;/a&gt; based on Motor Trend road tests for 32 cars in the 1973-74 model year. They recorded 11 statistics about each model of car. We can get a quick initial look using the &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/pairs.html"&gt;pairs&lt;/a&gt; function which plots a thumbnail scatterplot for every pair of variables. Pairs is designed to work on numbers, but they've coded categorical values as integers in this data, so it works.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; names(mtcars)
 [1] &amp;quot;mpg&amp;quot;  &amp;quot;cyl&amp;quot;  &amp;quot;disp&amp;quot; &amp;quot;hp&amp;quot;   &amp;quot;drat&amp;quot; &amp;quot;wt&amp;quot;   &amp;quot;qsec&amp;quot; &amp;quot;vs&amp;quot;   &amp;quot;am&amp;quot;   &amp;quot;gear&amp;quot; &amp;quot;carb&amp;quot;
&amp;gt; pairs(mtcars)
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_dbECP0yvozc/TQV5RUoyLtI/AAAAAAAACxE/bYSIfD8f9NU/s1600/pairs.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 333px;" src="http://3.bp.blogspot.com/_dbECP0yvozc/TQV5RUoyLtI/AAAAAAAACxE/bYSIfD8f9NU/s400/pairs.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5549975454308511442" /&gt;&lt;/a&gt;

&lt;p&gt;Question 4.7 asks us to describe any trends relating weight, fuel efficiency, and number of cylinders. They also make the distinction between American made cars and imports, another categorical value along with cylinders. Let's make a two-panel plot. First, we'll make a &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/boxplot.html"&gt;boxplot&lt;/a&gt; comparing the distribution of mileage for imports and domestics. In the second panel, we'll combine all four variables.&lt;/p&gt;

&lt;pre class="codebox"&gt;
# make a copy of mtcars &amp;#x27;cause we&amp;#x27;re going to add a column
cars &amp;lt;- mtcars

# add origin column as a factor to cars data frame
imports &amp;lt;- c(1:3, 8:14, 18:21, 26:28, 30:32)
origin &amp;lt;- rep(&amp;quot;domestic&amp;quot;, nrow(mtcars))
origin[imports] &amp;lt;- &amp;quot;import&amp;quot;
cars$origin &amp;lt;- factor(origin, levels=c(&amp;#x27;import&amp;#x27;, &amp;#x27;domestic&amp;#x27;))

# make a vector of colors to color the data points
us.col &amp;lt;- rgb(0,0,255,192, maxColorValue=255)
im.col &amp;lt;- rgb(0,255,0,192, maxColorValue=255)
col &amp;lt;- rep(us.col, nrow(cars))
col[imports] &amp;lt;- im.col

# set up a two panel plot
par(mfrow=c(1,2))
par(mar=c(5,4,5,1)+0.1)

# draw boxplot in first panel
boxplot(mpg ~ origin, data=cars, col=c(im.col, us.col), outpch=19, outcol=c(us.col, im.col), ylab=&amp;quot;mpg&amp;quot;)
grid(nx=NA, ny=NULL)

# draw scatterplot in second panel
par(mar=c(5,0.5,5,2)+0.1)
plot(mpg~wt, data=cars, col=col, yaxt=&amp;#x27;n&amp;#x27;, pch=as.character(cars$cyl), xlab=&amp;quot;weight (thousands of lbs)&amp;quot;)
grid(nx=NA, ny=NULL)

# fit a line describing mpg as a function of weight
res &amp;lt;- lm(mpg ~ wt, data=cars)
abline(res, col=rgb(255,0,0,64, maxColorValue=255), lty=2)

# return parameters to defaults
par(mar=c(5,4,4,2)+0.1)
par(mfrow=c(1,1))
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/TQV5Rrr_5ZI/AAAAAAAACxM/kKzIXXHe3xY/s1600/mpg_3.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 333px;" src="http://1.bp.blogspot.com/_dbECP0yvozc/TQV5Rrr_5ZI/AAAAAAAACxM/kKzIXXHe3xY/s400/mpg_3.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5549975460496008594" /&gt;&lt;/a&gt;

&lt;p&gt;Domestics fair worse, but why? For one thing, the most fuel efficient cars are all light imports with 4 cylinder engines. Domestic cars are heavier with bigger engines and get worse milage. Other factors are certainly involved, but weight does a pretty good job of explaining fuel consumption, with a correlation of almost 87%.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; cor(cars$wt, cars$mpg)
[1] -0.8676594
&lt;/pre&gt;

&lt;p&gt;Of course, fuel economy is not all there is to life on the open road. What about speed? We have quarter mile times, which probably measure acceleration better than top speed. We might hunt for variables that explain quarter mile performance by doing scatterplots and looking for correlation. The single factor that best correlates with &lt;i&gt;qsec&lt;/i&gt; is horsepower. But a combination of factors does noticeably better - the power to weight ratio.&lt;/p&gt;

&lt;pre class="codebox"&gt;
rm(origin)
attach(cars)

# transmission type encoded by color
palette &amp;lt;- c(&amp;quot;#23809C&amp;quot;,&amp;quot;#7A1305&amp;quot;)
col &amp;lt;- sapply(cars$am, function(x) palette[as.integer(x)+1])

# 5 panels
par(mfrow=c(1,5), omi=c(0,0.6,0.5,0.2))
par(mar=c(5,0.5,4,0.5)+0.1)

# weight
plot(qsec ~ wt, ylab=&amp;quot;&amp;quot;, xlab=&amp;quot;weight&amp;quot;, col= col, pch=as.character(cars$cyl))
mtext(sprintf(&amp;quot;%.3f&amp;quot;,cor(qsec,wt)), side=3)
grid(nx=NA, ny=NULL)

# displacement
plot(qsec ~ disp, ylab=&amp;quot;&amp;quot;, yaxt=&amp;#x27;n&amp;#x27;, xlab=&amp;quot;displacement&amp;quot;, col= col, pch=as.character(cars$cyl))
mtext(sprintf(&amp;quot;%.3f&amp;quot;,cor(qsec,disp)), side=3)
axis(2,labels=F, tick=T)
grid(nx=NA, ny=NULL)

# displacement / weight
plot(qsec ~ I(disp/wt), ylab=&amp;quot;&amp;quot;, yaxt=&amp;#x27;n&amp;#x27;, xlab=&amp;quot;disp/wt&amp;quot;, col= col, pch=as.character(cars$cyl))
mtext(sprintf(&amp;quot;%.3f&amp;quot;,cor(qsec,disp/wt)), side=3)
axis(2,labels=F, tick=T)
grid(nx=NA, ny=NULL)

# power
plot(qsec ~ hp, ylab=&amp;quot;&amp;quot;, yaxt=&amp;#x27;n&amp;#x27;, xlab=&amp;quot;hp&amp;quot;, col= col, pch=as.character(cars$cyl))
mtext(sprintf(&amp;quot;%.3f&amp;quot;,cor(qsec,hp)), side=3)
axis(2,labels=F, tick=T)
grid(nx=NA, ny=NULL)

# power / weight
plot(qsec ~ I(hp/wt), ylab=&amp;quot;&amp;quot;, yaxt=&amp;#x27;n&amp;#x27;, xlab=&amp;quot;hp/wt&amp;quot;, col= col, pch=as.character(cars$cyl))
mtext(sprintf(&amp;quot;%.3f&amp;quot;,cor(qsec,hp/wt)), side=3)
axis(2,labels=F, tick=T)
grid(nx=NA, ny=NULL)

# restore defaults
par(mar=c(5,4,4,2)+0.1)
par(mfrow=c(1,1), omi=c(0,0,0,0))

# add titles and legend
title(&amp;quot;What factors influence acceleration?&amp;quot;)
mtext(&amp;quot;quarter mile time in seconds&amp;quot;, side=2, padj=-4)
legend(x=88,y=21.5,c(&amp;#x27;automatic&amp;#x27;,&amp;#x27;manual&amp;#x27;), fill=palette[c(1,2)], cex=0.6)

detach(cars)
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_dbECP0yvozc/TQV5RxbhGTI/AAAAAAAACxU/NeAtgeCBMP8/s1600/acceleration_transmission.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 277px;" src="http://4.bp.blogspot.com/_dbECP0yvozc/TQV5RxbhGTI/AAAAAAAACxU/NeAtgeCBMP8/s400/acceleration_transmission.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5549975462037494066" /&gt;&lt;/a&gt;

&lt;p&gt;The numbers above the scatterplots are correlation with &lt;i&gt;qsec&lt;/i&gt;. Using position, character, color, multi-part graphs and ratios we've managed to visualize 5 variables, which show increasingly good correlation with the variable we're trying to explain.&lt;/p&gt;

&lt;p&gt;Here's one theory that might emerge from staring at this data: 4-bangers with automatic transmission are slow. Here's another theory: there's an error in the data. Look at the slow 4 cylinder way at the top. It's quarter mile is nearly three seconds longer than the next slowest car. An outlier like that seems to need an explanation. That car, according to &lt;i&gt;mtcars&lt;/i&gt;, is the Mercedes 230. But, the 230 is listed right next to the 240D - D for diesel. The 240D is a solid car. Many are still running. But they're famously slow. What are the odds that the times for these two cars got transposed?&lt;/p&gt;

&lt;p&gt;We can check if the data supports our theories about how cars work. For example, we might guess that car makers are likely to put bigger engines in heavier cars to maintain adequate performance at the expense of gas mileage. Comparing weight with numbers of cylinders in the scatterplot above supports this idea. Displacement measures the total volume of the cylinders and that must be closely related to the number of cylinders. Try &lt;span style="font-family:Courier,monospace;"&gt;plot(disp ~ as.factor(cyl), data=mtcars)&lt;/span&gt;. Displacement and carburetors are big determinants of the horsepower of an engine. Statisticians might be horrified, but try this: &lt;span style="font-family:Courier,monospace;"&gt;plot(hp ~ I(disp * carb), data=mtcars)&lt;/span&gt;. Multiplying displacement by carburetion is a quick and dirty hack, but in this case, it seems to work out well.&lt;/p&gt;

&lt;p&gt;Chapter 4 introduces parts of R's &lt;a href="http://digitheadslabnotebook.blogspot.com/2010/01/r-type-system.html"&gt;type system&lt;/a&gt;, specifically, lists and data.frames, along with &lt;a href="http://digitheadslabnotebook.blogspot.com/2009/07/select-operations-on-r-data-frames.html"&gt;subsetting operations&lt;/a&gt; and the apply family of functions. I don't go into it here because that was the first thing I learned about R and if you're a programmer, you'll probably want to do the same. One thing the book doesn't cover at all, so far as I can tell, is clustering.&lt;/p&gt;

&lt;p&gt;Take a look at the plots above and see if you can't see the cars clustering into familiar categories: the two super-fast sports cars, the three land-yacht luxury cars weighing in at over 5000 pounds a piece, the 8-cylinder muscle cars, and the 4-cylinder econo-beaters. We don't have to do this by eye, because R has several clustering algorithms built in.&lt;/p&gt;

&lt;p&gt;Hierarchical clustering (&lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html"&gt;hclust&lt;/a&gt;) works by repeatedly merging the two most similar items or existing clusters together. Determining 'most similar' requires a measure of distance. In general, coming up with a good distance metric takes careful thought specific to the problem at hand, but let's live dangerously.&lt;/p&gt;

&lt;pre class="codebox"&gt;
&amp;gt; plot(hclust(dist(scale(mtcars))), main=&amp;#x27;Hierarchical clustering of cars&amp;#x27;, sub=&amp;#x27;1973-74 model year&amp;#x27;, xlab=&amp;#x27;Cars&amp;#x27;, ylab=&amp;quot;&amp;quot;)
&lt;/pre&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/TQV5SNPZM1I/AAAAAAAACxc/iLEfQw56zV8/s1600/hclust.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 328px;" src="http://1.bp.blogspot.com/_dbECP0yvozc/TQV5SNPZM1I/AAAAAAAACxc/iLEfQw56zV8/s400/hclust.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5549975469502837586" /&gt;&lt;/a&gt;

&lt;p&gt;Not bad at all. The &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/scale.html"&gt;scale&lt;/a&gt; function helps out here by putting the columns on a common center and scale so the &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/dist.html"&gt;dist&lt;/a&gt; function ends up giving equal weight to each variable.&lt;/p&gt;

&lt;p&gt;It's easy to analyze the bejeezes out of something you already you already know, like cars. The trick is to get a similar level of insight out of something entirely new.&lt;/p&gt;

&lt;h4&gt;Links to more...&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.compbiome.com/2010/12/r-basic-r-skills-splitting-and-plotting.html"&gt;Basic R skills: splitting and plotting&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-4325928617471395210?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/4325928617471395210/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/12/using-r-for-introductory-statistics.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4325928617471395210'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4325928617471395210'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/12/using-r-for-introductory-statistics.html' title='Using R for Introductory Statistics, Chapter 4'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_dbECP0yvozc/TQV5SBTa2KI/AAAAAAAACxk/KgtUFtpRZ50/s72-c/ferrari-dino-246-gt-5w.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-537020864842506636</id><published>2010-11-27T23:24:00.000-08:00</published><updated>2011-10-17T12:19:46.332-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='links'/><title type='text'>Git cheat sheet</title><content type='html'>&lt;p&gt;I'm trying to wrap my head around Git, Linus Torvalds's complicated but powerful distributed version control system. Here's some quick notes and a wad of links:&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Configure&lt;/b&gt;&lt;/p&gt;
&lt;pre class="codebox"&gt;
&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-config.html"&gt;git config&lt;/a&gt; --global user.name "John Q. Hacker"
&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-config.html"&gt;git config&lt;/a&gt; --global user.email "jqhacker@somedomain.com"
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Start a new empty repository&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-init.html"&gt;git init&lt;/a&gt;&lt;/pre&gt;

&lt;pre class="codebox"&gt;
mkdir fooalicious
cd fooalicious
git init
touch README
git add README
git commit -m 'first commit'
git remote add origin git@github.com:nastyhacks/foo.git
git push -u origin master
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Create a local copy of a remote repository&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-clone.html"&gt;git clone&lt;/a&gt; [remote-repository]&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Commit to local repository&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-commit.html"&gt;git commit&lt;/a&gt; -a -m "my message"&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Review previous commits&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-log.html"&gt;git log&lt;/a&gt; --name-only&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;See what branches exist&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-branch.html"&gt;git branch&lt;/a&gt; -v&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Switch to a different branch&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-checkout.html"&gt;git checkout&lt;/a&gt; [branch you want to switch to]&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Create a new branch and switch to it&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-checkout.html"&gt;git checkout&lt;/a&gt; -b [name of new branch]&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Merge&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-merge.html"&gt;git merge&lt;/a&gt; mybranch&lt;/pre&gt;
&lt;p&gt;merge the development in the branch "mybranch" into the current branch.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Show remote repositories tracked&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-remote.html"&gt;git remote -v&lt;/a&gt;&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Track a remote repository&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-remote.html"&gt;git remote add&lt;/a&gt; --track master origin git@github.com:jqhacker/foo.git&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Retrieve from a remote repository&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-fetch.html"&gt;git fetch&lt;/a&gt;&lt;/pre&gt;
&lt;p&gt;Git fetch grabs changes from remote repository and puts it in your repository's object database. It also fetches branches from remote repository and stores them as remote-tracking branches. (&lt;a href="http://stackoverflow.com/questions/3419658/understanding-git-fetch-then-merge"&gt;see this&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Fetch and merge from a remote repository&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-pull.html"&gt;git pull&lt;/a&gt;&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Push to a remote repository&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/git-push.html"&gt;git push&lt;/a&gt;&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Pull changes from another fork&lt;/b&gt;&lt;/p&gt;
&lt;pre class="codebox"&gt;
git checkout -b otherguy-master master
git fetch https://github.com/otherguy/foo.git master
git merge otherguy-master/master

git checkout master
git merge otherguy-master
git push origin master
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Resolve merge conflict in favor of us/them&lt;/b&gt;&lt;/p&gt;
&lt;pre class="codebox"&gt;
git checkout --theirs another.txt
git checkout --ours some.file.txt
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Diff between local working directory and remote tracking branch&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;Say you're working with Karen on a project. She adds some nifty features to the source
file &lt;i&gt;nifty_files/our_code.py&lt;/i&gt;. You'd like to diff your local working copy against hers to see
the changes, and prepare to merge them in. First, make sure you have a remote tracking branch
for Karen's repo.&lt;/p&gt;
&lt;pre class="codebox"&gt;
git remote add karen git://github.com/karen/our_project.git
git remote -v
&lt;/pre&gt;

&lt;p&gt;The results ought to look something like this:&lt;/p&gt;
&lt;pre class="codebox"&gt;
karen git://github.com/karen/our_project.git (fetch)
karen git://github.com/karen/our_project.git (push)
origin git@github.com:cbare/our_project.git (fetch)
origin git@github.com:cbare/our_project.git (push)
&lt;/pre&gt;

&lt;p&gt;Next, fetch Karen's changes into your local repo. Git can't do a diff across the network, so we
have to get a local copy of Karen's commits stored in a remote tracking branch.&lt;/p&gt;
&lt;pre class="codebox"&gt;
git fetch karen
&lt;/pre&gt;

&lt;p&gt;Now, we can do our diff.&lt;/p&gt;
&lt;pre class="codebox"&gt;
git diff karen/master:nifty_files/our_code.py nifty_files/our_code.py
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Fixing a messed up working tree&lt;/b&gt;&lt;/p&gt;
&lt;pre&gt;git reset --hard HEAD&lt;/pre&gt;
&lt;p&gt;return the entire working tree to the last committed state&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Shorthand naming&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;Branches, remote-tracking branches, and tags are all references to commits. Git allows shorthand, so you mostly ever shorthand rather than full names:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The branch "test" is short for "refs/heads/test".&lt;/li&gt;
&lt;li&gt;The tag "v2.6.18" is short for "refs/tags/v2.6.18".&lt;/li&gt;
&lt;li&gt;"origin/master" is short for "refs/remotes/origin/master".&lt;/li&gt;
&lt;/ul&gt;


&lt;h4&gt;Links&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/user-manual.html"&gt;Git User’s Manual&lt;/a&gt;
Mostly cryptic but occasionally very helpful. See:
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#public-repositories"&gt;public repositories&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#resolving-a-merge"&gt;resolving a merge&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#git-quick-start"&gt;Quick Reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#glossary"&gt;Git Glossay&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="http://www.kernel.org/pub/software/scm/git/docs/gittutorial.html"&gt;Git Tutorial&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;github's &lt;a href="http://help.github.com/git-cheat-sheets/"&gt;Git cheat sheets&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;github on &lt;a href="http://help.github.com/remotes/"&gt;Working with remotes&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="http://book.git-scm.com/"&gt;The Git Community Book&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="http://www.eecs.harvard.edu/~cduan/technical/git/"&gt;Understanding Git Conceptually&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="http://git.or.cz/course/svn.html"&gt;Git - SVN Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://tomayko.com/writings/the-thing-about-git"&gt;The thing about Git&lt;/a&gt; by Ryan Tomayko&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.gitready.com/"&gt;git ready&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href="http://stackoverflow.com/questions/1138990/git-equivalent-of-svn-status-u"&gt;git equivalent of svn status -u&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://longair.net/blog/2009/04/16/git-fetch-and-merge/"&gt;git: fetch and merge, don’t pull&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://zrusin.blogspot.com/2007/09/git-cheat-sheet.html"&gt;Git Cheat Sheet&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ontwik.com/git-github/mastering-git-basics-by-tom-preston-werner/"&gt;Mastering Git Basics&lt;/a&gt; video presentation by Tom Preston-Werner of GitHub&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-537020864842506636?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/537020864842506636/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/11/git-cheat-sheet.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/537020864842506636'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/537020864842506636'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/11/git-cheat-sheet.html' title='Git cheat sheet'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-3455629402697991401</id><published>2010-11-15T00:35:00.000-08:00</published><updated>2010-11-23T07:55:04.197-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rant'/><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><title type='text'>Tech Industry Gossip</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/TODxEwQUEpI/AAAAAAAACw8/lauo1-yhIpQ/s1600/20101023_wbc910.gif"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 179px; height: 320px;" src="http://2.bp.blogspot.com/_dbECP0yvozc/TODxEwQUEpI/AAAAAAAACw8/lauo1-yhIpQ/s320/20101023_wbc910.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5539692605640807058" /&gt;&lt;/a&gt;
&lt;blockquote style="font-style: italic;"&gt;&lt;a href="http://twitter.com/#!/phil_nash"&gt;Welcome to the new decade: Java is a restricted platform, Google is evil, Apple is a monopoly and Microsoft are the underdogs&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;I mostly try to ignore tech &lt;a href="http://www.cringely.com/"&gt;industry&lt;/a&gt; &lt;a href="http://techcrunch.com/"&gt;gossip&lt;/a&gt;. But, there's a lot of it, lately. And underlying the shenanigans are big changes in the computing landscape.&lt;/p&gt;

&lt;p&gt;First, there's the flurry of lawsuits. This is illustrated nicely by the Economist in &lt;i&gt;&lt;a href="http://www.economist.com/node/17309237?story_id=17309237"&gt;The great patent battle&lt;/a&gt;&lt;/i&gt;. There are several &lt;a href="http://gizmodo.com/5656913/whos-suing-who-in-the-mobile-industry"&gt;similar&lt;/a&gt; &lt;a href="http://www.informationisbeautiful.net/2010/whos-suing-whom-in-the-telecoms-trade/"&gt;graphs&lt;/a&gt; of the &lt;a href="http://www.techdirt.com/blog/wireless/articles/20101007/22591311328/meet-the-patent-thicket-who-s-suing-who-for-smartphone-patents.shtml"&gt;patent thicket&lt;/a&gt; floating around. IP law is increasingly being used as a tool to lock customers in and competitors out. We can expect the (sarcastically named) "Citizens United" ruling on campaign finance to result in more of this particular kind of antisocial behavior.&lt;/p&gt;

&lt;p&gt;Google and Facebook are in a pitched battle over your personal data and &lt;a href="http://techcrunch.com/2010/09/01/google-making-extraordinary-counteroffers-to-stop-flow-of-employees-to-facebook/"&gt;engineering talent&lt;/a&gt;. Google engineers, apparently, are trying to jump over to Facebook prior to what promises to be a huge IPO.&lt;/p&gt;

&lt;p&gt;Apple caused quite a kerfuffle by &lt;a href="http://www.theregister.co.uk/2010/10/21/apple_threatens_to_kill_java_on_the_mac/"&gt;deprecating Java on Mac OS X&lt;/a&gt;. After remaining ominously silent for weeks, Oracle seems to have lined up both &lt;a href="http://blogs.oracle.com/henrik/2010/11/oracle_and_apple_announce_openjdk_project_for_osx.html"&gt;Apple&lt;/a&gt; and &lt;a href="http://www.infoq.com/news/2010/10/ibm-joins-openjdk"&gt;IBM&lt;/a&gt; behind OpenJDK. Apache Harmony looks to be a casualty of this maneuvering. Harmony, probably not coincidentally, is the basis for parts of Google's Android and &lt;a href="http://gigaom.com/2010/11/12/google-throws-the-kitchen-sink-at-oracle-in-android-java-suit/"&gt;Oracle is suing Google&lt;/a&gt; over Android's use of Java technology.&lt;/p&gt;

&lt;p&gt;Microsoft seems to be waning in importance along with the desktop in general. Ray Ozzie, Chief Architect since 2005, announced that he was leaving, following Robbie Bach of the XBox division and Stephen Elop, now running Nokia. I spoke with one MS marketing guy who said of Ozzie, "Lost him? I'd say we got rid of him!" A lesser noticed departure, that of Jython and Iron Python creator &lt;a href="http://hugunin.net/microsoft_farewell.html"&gt;Jim Hugunin&lt;/a&gt; may also be telling. Profitable stagnation seems to be the game plan there.&lt;/p&gt;

&lt;p&gt;Adobe's been struggling to the point where the NYTimes asked &lt;a href="http://bits.blogs.nytimes.com/2010/10/22/where-does-adobe-go-from-here/?ref=technology"&gt;where does Adobe go from here?&lt;/a&gt; They took a beating over flash performance and rumors circulated briefly of a buyout by Microsoft.&lt;/p&gt;

&lt;p&gt;The &lt;a href="http://blogs.hbr.org/bigshift/2010/09/cloud-computings-stormy-future.html"&gt;cloud&lt;/a&gt; is where a lot of the action in software development is moving. Mobile has been growing in importance by leaps and bounds ever since the launch of the iPhone. Cloud computing and consumer devices like smart phones, tablets, and even Kindles are complementary to a certain extent. The &lt;a href="http://mndoci.com/2008/04/16/the-logistics-and-economics-of-cloud-computing/"&gt;economics of cloud computing&lt;/a&gt; are hard to argue with. (See &lt;a href="http://www.mvdirona.com/jrh/work/"&gt;James Hamilton&lt;/a&gt;'s &lt;a href="http://www.mvdirona.com/jrh/TalksAndPapers/JamesHamilton_Velocity20100623.pdf"&gt;slides&lt;/a&gt; and &lt;a href="http://www.youtube.com/watch?v=kHW-ayt_Urk"&gt;video&lt;/a&gt; on data centers.)&lt;/p&gt;

&lt;p&gt;Another part of what's changing is a swing of the pendulum away from openness and back towards the &lt;a href="http://www.econtalk.org/archives/2010/10/hazlett_on_appl.html"&gt;walled gardens&lt;/a&gt; that most of us thought were left behind in the ashes of Compuserve and AOL. Ironically enough, Apple has become the poster child of walled gardens, with iTunes and the app store. ...the mobile carriers even  more so. And the cloud infrastructures of both Microsoft's Azure and (to a lesser degree) Google's App Engine are proprietary. Out of the big 3, Amazon's EC2 is, by far, the most open. &lt;a href="http://www.newyorker.com/reporting/2010/09/20/100920fa_fact_vargas"&gt;Mark Zuckerberg says&lt;/a&gt;, "I’m trying to make the world a more open place." But, to Tim Berners-Lee, &lt;a href="http://www.scientificamerican.com/article.cfm?id=long-live-the-web"&gt;Facebook and Apple threaten the internet&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There's plenty of money in serving the bulk population. That's why Walmart is so huge. My fear is that in a rush to provide "sugar water" to consumers, the computing industry will neglect the creative people that made the industry so vibrant. But, not to worry. Ray Ozzie's essay &lt;a href="http://ozzie.net/docs/dawn-of-a-new-day/"&gt;Dawn of a new Day&lt;/a&gt; does a nice job of putting into perspective the embarrassment of riches that technology has yielded. We're just at the beginning of figuring out what to do with it all.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-3455629402697991401?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/3455629402697991401/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/11/tech-industry-gossip.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3455629402697991401'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3455629402697991401'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/11/tech-industry-gossip.html' title='Tech Industry Gossip'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_dbECP0yvozc/TODxEwQUEpI/AAAAAAAACw8/lauo1-yhIpQ/s72-c/20101023_wbc910.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-3277240931968028396</id><published>2010-10-19T23:09:00.000-07:00</published><updated>2010-10-21T14:02:28.451-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='messaging'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><title type='text'>Message queues with Python</title><content type='html'>&lt;p&gt;A while back, I wanted to build a web front end for a long-running python script. I started with a basic front end using &lt;a href="http://www.djangoproject.com/"&gt;Django&lt;/a&gt;. Django is a pleasantly straight-forward web framework, quite similar to Rails, easy to learn (with the help of the excellent and free &lt;a href="http://www.djangobook.com/"&gt;Django book&lt;/a&gt;), and generally trouble-free. &lt;a href="http://pylonshq.com/"&gt;Pylons&lt;/a&gt; is an alternate choice.&lt;/p&gt;

&lt;p&gt;Because the computation was fairly resource intensive, I thought to introduce a queue. The web-app could then concentrate on collecting the necessary input from the user and dumping a job on the queue, leaving the heavy lifting to a worker process (or several). We'd redirect the user to a status page where s/he could monitor progress and get results upon completion. Sounds simple enough, right? I figured my worker processes would look something like this:&lt;/p&gt;

&lt;pre class="codebox"&gt;
big_chunka_data = load_big_chunka_data()
mo_data = load_mo_data()
queue = init_queue(&amp;quot;http://myserver.com:12345&amp;quot;, &amp;quot;user&amp;quot;, &amp;quot;pw&amp;quot;, &amp;quot;etc&amp;quot;)

while &amp;lt;not-done&amp;gt;:
    try:
        message = queue.block_until_we_can_take_a_message()
        if message says shutdown: shutdown
        big_computation(message[&amp;#x27;param&amp;#x27;],
                        message[&amp;#x27;foo&amp;#x27;],
                        big_chunka_data,
                        mo_data)
    except e:
        log_errors(e)
&lt;/pre&gt;

&lt;p&gt;...and the whole pile-o-junk would look like this:&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_dbECP0yvozc/TL8u74_jYZI/AAAAAAAACwY/AryoD6QHmd8/s1600/workers_and_queue.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 199px;" src="http://4.bp.blogspot.com/_dbECP0yvozc/TL8u74_jYZI/AAAAAAAACwY/AryoD6QHmd8/s400/workers_and_queue.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5530190473880363410" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Start up a handful of workers and a nice load balancing effect comes for free. Slow heavily loaded workers will take fewer jobs, while faster workers take more. I was also hoping for a good answer to the question, "What happens when one of our workers dies?"&lt;/p&gt;

&lt;h4&gt;Options&lt;/h4&gt;

&lt;p&gt;There are a ridiculous number of message queues to choose from. I looked at &lt;a href="http://kr.github.com/beanstalkd/"&gt;beanstalk&lt;/a&gt; which is nice and simple, but its python binding, &lt;a href="http://github.com/sophacles/pybeanstalk"&gt;pybeanstalk&lt;/a&gt; seems to be out of date. There's &lt;a href="http://gearman.org/"&gt;gearman&lt;/a&gt;, from &lt;a href="http://danga.com/"&gt;Danga&lt;/a&gt; the source of &lt;a href="http://memcached.org/"&gt;memcached&lt;/a&gt;. That looked fairly straight forward as well, although be careful to get the &lt;a href="http://github.com/samuel/python-gearman"&gt;newer python binding&lt;/a&gt;. Python, itself, now offers the &lt;a href="http://docs.python.org/library/multiprocessing.html"&gt;multiprocessing module&lt;/a&gt; which has a queue.&lt;/p&gt;

&lt;p&gt;One intriguing option is &lt;a href="http://www.zeromq.org/"&gt;ZeroMQ&lt;/a&gt; (aka 0MQ). It's message queueing without a queue. It's brokerless, meaning there's no external queue server process. Messages are routed in common MQ patterns right down at the network level. Of course, if you want store and forward, you're on your own for the persistence part. Still, very cool... Python bindings for ZeroMQ are found in &lt;a href="http://github.com/zeromq/pyzmq"&gt;pyzmq&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Several on the &lt;a href="http://lists.seapig.org/pipermail/seattle-python/2010-July/003392.html"&gt;seattle python mailing list&lt;/a&gt; recommended &lt;a href="http://celeryproject.org/"&gt;Celery&lt;/a&gt;. After a (superficial) look, Celery seemed too RPC-ish for my taste. I'm probably being up-tight, but when using a queue, I'd rather think in terms of sending a message than calling a function. That seems more decoupled and avoids making assumptions about the structure of the conversation and what's on the other side. I should probably lighten up. Celery is built on top of RabbitMQ, although they support &lt;a href="http://ask.github.com/celery/tutorials/otherqueues.html"&gt;other options&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;RabbitMQ and Carrot&lt;/h4&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/TL8vH1PhGGI/AAAAAAAACwg/Rg3O0o8lqxw/s1600/rabbitmq_logo.png"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 81px; height: 81px;" src="http://2.bp.blogspot.com/_dbECP0yvozc/TL8vH1PhGGI/AAAAAAAACwg/Rg3O0o8lqxw/s200/rabbitmq_logo.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5530190679032010850" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.rabbitmq.com/"&gt;RabbitMQ&lt;/a&gt;, now part of the &lt;a href="http://www.springsource.com/"&gt;SpringSource&lt;/a&gt; empire (in turn owned by VMWare), aims to compete with &lt;a href="http://activemq.apache.org/"&gt;Apache ActiveMQ&lt;/a&gt; as a full on enterprise messaging system based on the &lt;a href="http://www.amqp.org/"&gt;AMQP&lt;/a&gt; spec. I installed RabbitMQ using MacPorts, where you'll notice that RabbitMQ pulls in an absurd amount of dependencies.&lt;/p&gt;

&lt;pre class="codebox"&gt;
sudo port selfupdate
sudo port install rabbitmq-server
&lt;/pre&gt;

&lt;p&gt;For getting python to talk to RabbitMQ, &lt;a href="http://github.com/ask/carrot"&gt;Carrot&lt;/a&gt; is a nice option. It was a bit confusing at first, but some nice folks on the &lt;a href="http://groups.google.com/group/carrot-users/browse_thread/thread/e470ec686726780b/35db85d0559758d5#35db85d0559758d5"&gt;carrot-users mailing list&lt;/a&gt; set me straight. Apparently, Carrot's author is working on a rewrite called &lt;a href="http://github.com/ask/kombu"&gt;Kombu&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's what worked for me. A producer sends Python dictionary objects, which get turned into JSON. My example code is only slightly modified from &lt;a href="http://ask.github.com/carrot/introduction.html#creating-a-connection"&gt;Creating a Connection&lt;/a&gt; in the &lt;a href="http://ask.github.com/carrot/"&gt;Carrot documentation&lt;/a&gt;. You'll need a little RabbitMQ terminology to understand the connection methods.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;b&gt;queues&lt;/b&gt; are addresses of receivers&lt;/li&gt;
&lt;li&gt;&lt;b&gt;exchanges&lt;/b&gt; are routers with their own process&lt;/li&gt;
&lt;li&gt;&lt;b&gt;virtual hosts&lt;/b&gt; are the unit of security&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Producer&lt;/h4&gt;
&lt;pre class="codebox"&gt;
from carrot.connection import BrokerConnection
from carrot.messaging import Publisher

conn = BrokerConnection(hostname=&amp;quot;localhost&amp;quot;, port=5672,
                          userid=&amp;quot;guest&amp;quot;, password=&amp;quot;guest&amp;quot;,
                          virtual_host=&amp;quot;/&amp;quot;)

publisher = Publisher(connection=conn,
                    exchange=&amp;quot;feed&amp;quot;, routing_key=&amp;quot;importer&amp;quot;)

for i in range(30):
   publisher.send({&amp;quot;name&amp;quot;:&amp;quot;foo&amp;quot;, &amp;quot;i&amp;quot;:i})
publisher.close()
&lt;/pre&gt;

&lt;p&gt;The consumers print out the messages as they arrive, then sleep for a bit to simulate long-running tasks. I tested by starting two consumers, one with a longer sleep time. Then I started a producer and saw that the slower consumer got fewer messages, which is what I expected. Note that setting prefetch_count to 1 is necessary to achieve this low-budget load balancing effect.&lt;/p&gt;

&lt;h4&gt;Consumer&lt;/h4&gt;
&lt;pre class="codebox"&gt;
import time
import sys
from carrot.connection import BrokerConnection
from carrot.messaging import Consumer

# supply an integer argument for sleep time to simulate long-running tasks
if (len(sys.argv) &amp;gt; 1):
    sleep_time = int(sys.argv[1])
else:
    sleep_time = 1

connection = BrokerConnection(hostname=&amp;quot;localhost&amp;quot;, port=5672,
                          userid=&amp;quot;guest&amp;quot;, password=&amp;quot;guest&amp;quot;,
                          virtual_host=&amp;quot;/&amp;quot;)

consumer = Consumer(connection=connection, queue=&amp;quot;feed&amp;quot;,
                    exchange=&amp;quot;feed&amp;quot;, routing_key=&amp;quot;importer&amp;quot;)

def import_feed_callback(message_data, message):
    print &amp;quot;-&amp;quot; * 80
    print message_data
    print message
    message.ack()
    print &amp;quot;-&amp;quot; * 80
    time.sleep(sleep_time)

consumer.register_callback(import_feed_callback)
consumer.qos(prefetch_count=1)

consumer.consume() 
while True: 
    connection.drain_events()
&lt;/pre&gt;

&lt;p&gt;The project remains incomplete and I'm not at all ready to say this is the best way to go about it. It's just the first thing I got working. RabbitMQ seems maybe a little heavy for a simple task queue, but it's also well supported and documented.&lt;/p&gt;

&lt;p&gt;It seems like this sort of thing is less mature in the Python world than in Java. It's moving fast though.&lt;/p&gt;

&lt;h4&gt;Links, links, links&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.eaipatterns.com/toc.html"&gt;Enterprise Integration Patterns&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.artima.com/weblogs/viewpost.jsp?thread=299551"&gt;Threads, processes and concurrency in Python&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://stomp.codehaus.org/"&gt;Stomp&lt;/a&gt; and it's python binding &lt;a href="http://packages.python.org/stompy/introduction.html"&gt;stompy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ask.github.com/carrot/"&gt;Carrot documentation&lt;/a&gt; and &lt;a href="http://groups.google.com/group/carrot-users"&gt;mailing list&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://robertpogorzelski.com/blog/2009/09/10/rabbitmq-celery-and-django/"&gt;RabbitMQ, Celery and Django&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blogs.digitar.com/jjww/2009/01/rabbits-and-warrens/"&gt;Rabbits and Warrens&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://bhavin.directi.com/selecting-a-message-queue-amqp-or-zeromq/"&gt;Selecting a Message Queue - AMQP or ZeroMQ&lt;/a&gt; is insightful and has lots of good links.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;span style="color:red;"&gt;Obsolete note&lt;/span&gt;: The version of &lt;a href="http://www.macports.org/ports.php?by=name&amp;substr=rabbitmq"&gt;RabbitMQ in MacPorts&lt;/a&gt; (1.7.2) at the time was a version behind and broken. I had to dig through the compiler error log and add a closing paren in line 100 of rabbit_exchange.erl.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-3277240931968028396?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/3277240931968028396/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/10/message-queues-with-python.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3277240931968028396'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3277240931968028396'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/10/message-queues-with-python.html' title='Message queues with Python'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_dbECP0yvozc/TL8u74_jYZI/AAAAAAAACwY/AryoD6QHmd8/s72-c/workers_and_queue.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-6704770216303683494</id><published>2010-10-05T20:22:00.000-07:00</published><updated>2010-10-05T20:28:25.328-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Computer science'/><title type='text'>Economics meets computer science</title><content type='html'>&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/TKvsJ4GZzkI/AAAAAAAACpQ/1OS1IHfSj_Y/s1600/supply_and_demand.png"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 200px; height: 194px;" src="http://2.bp.blogspot.com/_dbECP0yvozc/TKvsJ4GZzkI/AAAAAAAACpQ/1OS1IHfSj_Y/s200/supply_and_demand.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5524769022322265666" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I saw an interesting lecture by &lt;a href="http://www.cc.gatech.edu/~vazirani/"&gt;Vijay Vazirani&lt;/a&gt; on the cross-over between computer science and economics: &lt;a href="http://www.cs.washington.edu/htbin-post/mvis/mvis?ID=959"&gt;The "Invisible Hand of the Market": Algorithmic Ratification and the Digital Economy&lt;/a&gt;. Starting with Adam Smith, he covered the development of algorithmic theories of markets and market equilibrium.&lt;/p&gt;

&lt;p&gt;The first step was a proof that simple market models have equilibria. A very computational next question, then, is, "OK, these equilibria exist. How hard is it to find them?" Enter &lt;span style="font-weight:bold;"&gt;complexity theory&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;Apparently, algorithmic game theory has been used to derive &lt;a href="http://arxiv.org/abs/1007.4586"&gt;Equilibrium Pricing of Digital Goods via a New Market Model&lt;/a&gt; and was applied to cook up the pricing algorithm for &lt;a href="http://research.google.com/pubs/archive/35113.pdf"&gt;Google's TV ads&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Attempts to compute equilibrium prices led &lt;a href="http://en.wikipedia.org/wiki/Irving_Fisher"&gt;Irving Fisher&lt;/a&gt; to develop his "&lt;a href="http://priceless-the-book.blogspot.com/2009/06/price-machine-of-irving-fisher.html"&gt;Price Machine&lt;/a&gt;" in the early 1890's, which was essentially a hydraulic computer.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/TKvsmJ5InvI/AAAAAAAACpY/VGLqma5y9ws/s1600/fisher_water_computer.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 288px;" src="http://2.bp.blogspot.com/_dbECP0yvozc/TKvsmJ5InvI/AAAAAAAACpY/VGLqma5y9ws/s320/fisher_water_computer.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5524769508134788850" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Vazirani wrote a book on &lt;a href="http://books.google.com/books?id=EILqAmzKgYIC&amp;dq=Approximation+Algorithms&amp;printsec=frontcover&amp;source=bn&amp;hl=en&amp;ei=aMurTJDxG5CosAPH9vnVAw&amp;sa=X&amp;oi=book_result&amp;ct=result&amp;resnum=4&amp;ved=0CDIQ6AEwAw#v=onepage&amp;q&amp;f=false"&gt;approximation algorithms&lt;/a&gt;. I kept expecting him to introduce the idea that, while finding equilibria is very hard, markets approximate a solution. Maybe, given a set of conditions we could show that the approximation was within certain bounds of the true optimum. But, he didn't.&lt;/p&gt;

&lt;p&gt;A few unanswered questions come to mind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can irrationality be modeled?&lt;/li&gt;
&lt;li&gt;Do big differences in wealth or other asymmetries illustrate any divergence between the model and the real world?&lt;/li&gt;
&lt;li&gt;Even if markets reach equilibrium in some theoretical long run, in the real world they're constantly buffeted by exogenous shocks, and therefore disequilibrium must be pretty common. He answered a question like this saying that no attempt to incorporate dynamics has been very successful.&lt;/li&gt;
&lt;li&gt;Digital marketplaces like eBay or Google's Ad auction must provide a ridiculous treasure trove of data to be mined for empirical economics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Q&amp;amp;A session degenerated into politics pretty quickly. People asked about how the financial crisis might have challenged his models. One questioner asked about "animal spirits". Vazirani defended the neoclassical line and stated that their algorithms had "ratified the free market". The questioner responded that "The markets have ratified P.T. Barnum". Another audience member added "...because of government interference". It's surprising that people aren't able to differentiate between a theoretical result based on a carefully constructed model and political opinions in the real world, but that stuff belongs &lt;a href="http://pragmaticpoliticaleconomy.blogspot.com/"&gt;over here&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-6704770216303683494?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/6704770216303683494/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/10/economics-meets-computer-science.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/6704770216303683494'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/6704770216303683494'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/10/economics-meets-computer-science.html' title='Economics meets computer science'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_dbECP0yvozc/TKvsJ4GZzkI/AAAAAAAACpQ/1OS1IHfSj_Y/s72-c/supply_and_demand.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-8405609423099205684</id><published>2010-10-04T09:48:00.000-07:00</published><updated>2010-10-04T10:01:09.366-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='crackpot theory'/><title type='text'>Worse than linear performance in Rails ActiveRecords with associations</title><content type='html'>&lt;p&gt;The other day I was doing a task that required iterating through 142 thousand active record objects and computing simple aggregate statistics. Taking convenience over efficiency, I thought I'd just whip up a script to run in a the Ruby on Rails script runner. I usually put little print-line debugging statements in programs like that as a poor man's progress bar. I noticed that as the script ran, it seemed to get slower and slower.&lt;/p&gt;

&lt;p&gt;My ActiveRecord abjects are linked up as shown below. The names are changed to protect the guilty, so I might be neglecting something important.&lt;/p&gt;

&lt;pre class="codebox"&gt;
class Basket &amp;lt; ActiveRecord::Base
  has_many :basket_items
  has_many :items, :through =&amp;gt; :basket_items
end

class Item &amp;lt; ActiveRecord::Base
  belongs_to :category
  has_many :basket_items
  has_many :baskets, :through =&amp;gt; :basket_items
end
&lt;/pre&gt;

&lt;pre class="codebox"&gt;
Basket.find(:all).each do |basket|
  puts "basket.id = #{basket.id-20000}\t#{Time.now - start_time}" if basket.id % 1000 == 0
end
&lt;/pre&gt;

&lt;p&gt;This part is super fast, 'cause nothing much is done inside the loop other than printing. I don't know exactly how Rails does the conversion from DB fields to objects. That bit seems to be taking place outside the loop. Anyway, what happens if we access the associated items?&lt;/p&gt;

&lt;pre class="codebox"&gt;
counter = 0
Basket.find(:all).each do |basket|
  puts "basket.id = #{basket.id-20000}\t#{Time.now - start_time}" if basket.id % 1000 == 0
  counter += basket.items.length
end
&lt;/pre&gt;

&lt;p&gt;In the second case, we trigger the lazy-load of the list of items. Baskets average 18.7 items. The distribution of item count within the data is more or less random and, on average, flat. Now, we see the timings below.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/TKoHqHSQWTI/AAAAAAAACo8/bK_Wt2ATNv8/s1600/ruby.perf.count.items.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 400px;" src="http://1.bp.blogspot.com/_dbECP0yvozc/TKoHqHSQWTI/AAAAAAAACo8/bK_Wt2ATNv8/s400/ruby.perf.count.items.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5524236313014851890" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In other words, this is an &lt;em&gt;n&lt;/em&gt;x&lt;em&gt;m&lt;/em&gt; operation, but &lt;em&gt;m&lt;/em&gt; (the number of items) is more or less constant. I can't guess why this wouldn't be linear. Garbage collection? Should that level off? Maybe, we're just seeing the cost of maintaining the heap? Items are also associated with baskets. Maybe Rails is spending time fixing up that association?&lt;/p&gt;

&lt;p&gt;The real script, only a little more complex that the one above, ran in about 30 hours. I realize this is all a little half-baked and I don't intend to chase it down further, but I'm hoping to &lt;a href="http://xkcd.com/356/"&gt;nerd snipe&lt;/a&gt; someone into figuring it out. I'm probably missing a lot and leaving out too much.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-8405609423099205684?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/8405609423099205684/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/10/worse-than-linear-performance-in-rails.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/8405609423099205684'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/8405609423099205684'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/10/worse-than-linear-performance-in-rails.html' title='Worse than linear performance in Rails ActiveRecords with associations'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_dbECP0yvozc/TKoHqHSQWTI/AAAAAAAACo8/bK_Wt2ATNv8/s72-c/ruby.perf.count.items.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-4612668938843562285</id><published>2010-10-02T10:06:00.000-07:00</published><updated>2011-04-10T17:31:32.952-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='reference'/><category scheme='http://www.blogger.com/atom/ns#' term='NoSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><category scheme='http://www.blogger.com/atom/ns#' term='db'/><title type='text'>CouchDB and R</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/TJwypGSCoZI/AAAAAAAACos/WDRNFUNmW54/s1600/Apache+CouchDB:+Relax.png"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 175px; height: 150px;" src="http://1.bp.blogspot.com/_dbECP0yvozc/TJwypGSCoZI/AAAAAAAACos/WDRNFUNmW54/s200/Apache+CouchDB:+Relax.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5520342924891693458" /&gt;&lt;/a&gt;

&lt;p&gt;Here are some quick crib notes on getting R talking to CouchDB using Couch's &lt;a href="http://wiki.apache.org/couchdb/HTTP_Document_API"&gt;ReSTful HTTP API&lt;/a&gt;. We'll do it in two different ways. First, we'll construct HTTP calls with &lt;a href="http://www.omegahat.org/RCurl/"&gt;RCurl&lt;/a&gt;, then move on to the &lt;a href="http://github.com/wactbprot/R4CouchDB"&gt;R4CouchDB&lt;/a&gt; package for a higher level interface. I'll assume you've already &lt;a href="/2010/09/geting-started-with-couchdb.html"&gt;gotten started with CouchDB&lt;/a&gt; and are familiar with the basic &lt;a href="/2009/02/rest-representational-state-transfer.html"&gt;ReST&lt;/a&gt; actions: GET PUT POST and DELETE.&lt;/p&gt;

&lt;p&gt;First install &lt;a href="http://www.omegahat.org/RCurl/"&gt;RCurl&lt;/a&gt; and &lt;a href="http://www.omegahat.org/RJSONIO/"&gt;RJSONIO&lt;/a&gt;. You'll have to download the tar.gz's if you're on a Mac. For the second part, we'll need to install &lt;a href="http://github.com/wactbprot/R4CouchDB/"&gt;R4CouchDB&lt;/a&gt;, which depends on the previous two. I checked it out from GitHub and used R CMD INSTALL.&lt;/p&gt;

&lt;h3&gt;ReST with RCurl&lt;/h3&gt;

&lt;h4&gt;Ping server&lt;/h4&gt;

&lt;pre class="codebox"&gt;
getURL(&amp;quot;http://localhost:5984/&amp;quot;)
[1] &amp;quot;{\&amp;quot;couchdb\&amp;quot;:\&amp;quot;Welcome\&amp;quot;,\&amp;quot;version\&amp;quot;:\&amp;quot;1.0.1\&amp;quot;}\n&amp;quot;
&lt;/pre&gt;

&lt;p&gt;That's nice, but we want to get the result back as a real R data structure. Try this:&lt;/p&gt;

&lt;pre class="codebox"&gt;
welcome &amp;lt;- fromJSON(getURL(&amp;quot;http://localhost:5984/&amp;quot;))
welcome$version
[1] &amp;quot;1.0.1&amp;quot;
&lt;/pre&gt;

&lt;p&gt;Sweet!&lt;/p&gt;

&lt;h4&gt;PUT&lt;/h4&gt;

&lt;p&gt;One way to add a new record is with http PUT.&lt;/p&gt;

&lt;pre class="codebox"&gt;
bozo = list(name=&amp;quot;Bozo&amp;quot;, occupation=&amp;quot;clown&amp;quot;, shoe.size=100)
getURL(&amp;quot;http://localhost:5984/testing123/bozo&amp;quot;,
       customrequest=&amp;quot;PUT&amp;quot;,
       httpheader=c(&amp;#x27;Content-Type&amp;#x27;=&amp;#x27;application/json&amp;#x27;),
       postfields=toJSON(bozo))
[1] &amp;quot;{\&amp;quot;ok\&amp;quot;:true,\&amp;quot;id\&amp;quot;:\&amp;quot;bozo\&amp;quot;,\&amp;quot;rev\&amp;quot;:\&amp;quot;1-70f5f59bf227d2d715c214b82330c9e5\&amp;quot;}\n&amp;quot;
&lt;/pre&gt;

&lt;p&gt;Notice that RJSONIO has no high level PUT method, so you have to fake it using the costumrequest parameter. I'd never have figured that out without an example from &lt;a href="http://github.com/wactbprot/R4CouchDB"&gt;R4CouchDB&lt;/a&gt;'s source. The &lt;a href="http://curl.haxx.se/libcurl/c/"&gt;API of libCurl&lt;/a&gt; is odd, I have to say, and RCurl mostly just reflects it right into R.&lt;/p&gt;

&lt;p&gt;If you don't like the idea of sending a put request with a get function, you could use RCurl's &lt;em&gt;curlPerform&lt;/em&gt;. Trouble is, &lt;em&gt;curlPerform&lt;/em&gt; returns an integer status code rather than the response body. You're supposed to provide an R function to collect the response body text. Not really worth the bother, unless you're getting into some of the advanced tricks described in the paper, &lt;em&gt;&lt;a href="http://www.omegahat.org/RCurl/RCurlJSS.pdf"&gt;R as a Web Client - the RCurl package&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;

&lt;pre class="codebox"&gt;
bim &amp;lt;-  list(
  name=&amp;quot;Bim&amp;quot;, 
  occupation=&amp;quot;clown&amp;quot;,
  tricks=c(&amp;quot;juggling&amp;quot;, &amp;quot;pratfalls&amp;quot;, &amp;quot;mocking Bolsheviks&amp;quot;))
reader = basicTextGatherer()
curlPerform(
  url = &amp;quot;http://localhost:5984/testing123/bim&amp;quot;,
  httpheader = c(&amp;#x27;Content-Type&amp;#x27;=&amp;#x27;application/json&amp;#x27;),
  customrequest = &amp;quot;PUT&amp;quot;,
  postfields = toJSON(bim),
  writefunction = reader$update
)
reader$value()
&lt;/pre&gt;

&lt;h4&gt;GET&lt;/h4&gt;

&lt;p&gt;Now that there's something in there, how do we get it back? That's super easy.&lt;/p&gt;

&lt;pre class="codebox"&gt;
bozo2 &amp;lt;- fromJSON(getURL(&amp;quot;http://localhost:5984/testing123/bozo&amp;quot;))
bozo2
$`_id`
[1] &amp;quot;bozo&amp;quot;

$`_rev`
[1] &amp;quot;1-646331b58ee010e8df39b5874b196c02&amp;quot;

$name
[1] &amp;quot;Bozo&amp;quot;

$occupation
[1] &amp;quot;clown&amp;quot;

$shoe.size
[1] 100
&lt;/pre&gt;

&lt;h4&gt;PUT again for updating&lt;/h4&gt;

&lt;p&gt;Updating is done by using PUT on an existing document. For example, let's give Bozo, some mad skillz:&lt;/p&gt;

&lt;pre class="codebox"&gt;
getURL(
  &amp;quot;http://localhost:5984/testing123/bozo&amp;quot;,
  customrequest=&amp;quot;PUT&amp;quot;,
  httpheader=c(&amp;#x27;Content-Type&amp;#x27;=&amp;#x27;application/json&amp;#x27;),
  postfields=toJSON(bozo2))
&lt;/pre&gt;

&lt;h4&gt;POST&lt;/h4&gt;

&lt;p&gt;If you POST to the database, you're adding a document and letting CouchDB assign its _id field.&lt;/p&gt;

&lt;pre class="codebox"&gt;
bender = list(
  name=&amp;#x27;Bender&amp;#x27;,
  occupation=&amp;#x27;bending&amp;#x27;,
  species=&amp;#x27;robot&amp;#x27;)
response &amp;lt;- fromJSON(getURL(
  &amp;#x27;http://localhost:5984/testing123/&amp;#x27;,
  customrequest=&amp;#x27;POST&amp;#x27;,
  httpheader=c(&amp;#x27;Content-Type&amp;#x27;=&amp;#x27;application/json&amp;#x27;),
  postfields=toJSON(bender)))
response
$ok
[1] TRUE

$id
[1] &amp;quot;2700b1428455d2d822f855e5fc0013fb&amp;quot;

$rev
[1] &amp;quot;1-d6ab7a690acd3204e0839e1aac01ec7a&amp;quot;
&lt;/pre&gt;

&lt;h4&gt;DELETE&lt;/h4&gt;

&lt;p&gt;For DELETE, you pass the doc's revision number in the query string. Sorry, Bender.&lt;/p&gt;

&lt;pre class="codebox"&gt;
response &amp;lt;- fromJSON(getURL(&amp;quot;http://localhost:5984/testing123/2700b1428455d2d822f855e5fc0013fb?rev=1-d6ab7a690acd3204e0839e1aac01ec7a&amp;quot;,
  customrequest=&amp;quot;DELETE&amp;quot;))
&lt;/pre&gt;

&lt;h3&gt;CRUD with R4CouchDB&lt;/h3&gt;

&lt;p&gt;&lt;a href="http://github.com/wactbprot/R4CouchDB"&gt;R4CouchDB&lt;/a&gt; provides a layer on top of the techniques we've just described.&lt;/p&gt;

&lt;p&gt;R4CouchDB uses a slightly strange idiom. You pass a &lt;em&gt;cdb&lt;/em&gt; object, really just a list of parameters, into every R4CouchDB call and every call returns that object again, maybe modified. Results are returned in &lt;em&gt;cdb$res&lt;/em&gt;. Maybe, they did this because R uses pass by value. Here's how you would initialize the object.&lt;/p&gt;

&lt;pre class="codebox"&gt;
cdb &amp;lt;- cdbIni()
cdb$serverName &amp;lt;- &amp;quot;localhost&amp;quot;
cdb$port &amp;lt;- 5984
cdb$DBName=&amp;quot;testing123&amp;quot;
&lt;/pre&gt;

&lt;h4&gt;Create&lt;/h4&gt;

&lt;pre class="codebox"&gt;
fake.data &amp;lt;- list(
  state=&amp;#x27;WA&amp;#x27;,
  population=6664195,
  state.bird=&amp;#x27;Lady GaGa&amp;#x27;)
cdb$dataList &amp;lt;- fake.data
cdb$id &amp;lt;- &amp;#x27;fake.data&amp;#x27;  ## optional, otherwise an ID is generated
cdb &amp;lt;- cdbAddDoc(cdb)

cdb$res
$ok
[1] TRUE

$id
[1] &amp;quot;fake.data&amp;quot;

$rev
[1] &amp;quot;1-14bc025a194e310e79ac20127507185f&amp;quot;
&lt;/pre&gt;

&lt;h4&gt;Read&lt;/h4&gt;

&lt;pre class="codebox"&gt;
cdb$id &amp;lt;- &amp;#x27;bozo&amp;#x27;
cdb &amp;lt;- cdbGetDoc(cdb)

bozo &amp;lt;- cdb$res
bozo
$`_id`
[1] &amp;quot;bozo&amp;quot;
... etc.
&lt;/pre&gt;

&lt;h4&gt;Update&lt;/h4&gt;

&lt;p&gt;First we take the document id and rev from the existing document. Then, save our revised document back to the DB.&lt;/p&gt;

&lt;pre class="codebox"&gt;
cdb$id &amp;lt;- bozo$`_id`
cdb$rev &amp;lt;- bozo$`_rev`
bozo = list(
  name=&amp;quot;Bozo&amp;quot;,
  occupation=&amp;quot;assassin&amp;quot;,
  shoe.size=100,
  skills=c(
    &amp;#x27;pranks&amp;#x27;,
    &amp;#x27;honking nose&amp;#x27;,
    &amp;#x27;kung fu&amp;#x27;,
    &amp;#x27;high explosives&amp;#x27;,
    &amp;#x27;sniper&amp;#x27;,
    &amp;#x27;lock picking&amp;#x27;,
    &amp;#x27;safe cracking&amp;#x27;))
cdb &amp;lt;- cdbUpdateDoc(bozo)
&lt;/pre&gt;

&lt;h4&gt;Delete&lt;/h4&gt;

&lt;p&gt;Shortly thereafter, Bozo mysteriously disappeared.&lt;/p&gt;

&lt;pre class="codebox"&gt;
cdb$id = bozo$`_id`
cdb &amp;lt;- cdbDeleteDoc(cdb)
&lt;/pre&gt;

&lt;h3&gt;More on ReST and CouchDB&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;One issue you'll probably run into is that unfortunately &lt;a href="http://stackoverflow.com/questions/1423081/json-left-out-infinity-and-nan-json-status-in-ecmascript"&gt;JSON left out NaN and Infinity&lt;/a&gt;. And, of course only R knows about NAs.&lt;/li&gt;

&lt;li&gt;One-off ReST calls are easy using curl from the command line, as described in &lt;a href="http://blogs.plexibus.com/2009/01/15/rest-esting-with-curl/"&gt;REST-esting with cURL&lt;/a&gt;.&lt;/li&gt;

&lt;li&gt;I flailed about quite a bit trying to figure out the best way to do &lt;a href="/2010/09/how-to-send-http-put-request-from-r.html"&gt;HTTP with R&lt;/a&gt;.&lt;/li&gt;

&lt;li&gt;I originally thought R4CouchDB was part of a &lt;a href="http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:nosql_interface"&gt;Google summer of code project to support NoSQL DBs in R&lt;/a&gt;. Dirk Eddelbuettel clarified that R4CouchDB was developed independently. In any case, the schema-less approach fits nicely with R's philosophy of exploratory data analysis.&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-4612668938843562285?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/4612668938843562285/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/10/couchdb-and-r.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4612668938843562285'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4612668938843562285'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/10/couchdb-and-r.html' title='CouchDB and R'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_dbECP0yvozc/TJwypGSCoZI/AAAAAAAACos/WDRNFUNmW54/s72-c/Apache+CouchDB:+Relax.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-668162914855685897</id><published>2010-09-27T11:20:00.001-07:00</published><updated>2010-10-02T10:31:10.088-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rant'/><category scheme='http://www.blogger.com/atom/ns#' term='interoperability'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>How to send an HTTP PUT request from R</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_dbECP0yvozc/TKDgttCiq5I/AAAAAAAACo0/Rb109jj2Zq4/s1600/http_logo.jpeg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 136px; height: 90px;" src="http://3.bp.blogspot.com/_dbECP0yvozc/TKDgttCiq5I/AAAAAAAACo0/Rb109jj2Zq4/s200/http_logo.jpeg" border="0" alt=""id="BLOGGER_PHOTO_ID_5521660218945219474" /&gt;&lt;/a&gt;

&lt;p&gt;I wanted to get R talking to &lt;a href="/2010/09/geting-started-with-couchdb.html"&gt;CouchDB&lt;/a&gt;. CouchDB is a &lt;a href="http://nosql-database.org/"&gt;NoSQL&lt;/a&gt; database that stores JSON documents and exposes a &lt;a href="/2009/02/rest-representational-state-transfer.html"&gt;ReSTful&lt;/a&gt; API over HTTP. So, I needed to issue the basic HTTP requests: GET, POST, PUT, and DELETE from within R. Specifically, to get started, I wanted to add documents to the database using PUT.&lt;/p&gt;

&lt;p&gt;There's CRAN package called &lt;a href="http://cran.r-project.org/web/packages/httpRequest/index.html"&gt;httpRequest&lt;/a&gt;, which I thought would do the trick. This wound up being a dead end. There's a better way. Skip to the RCurl section unless you want to snicker at my hapless flailing.&lt;/p&gt;

&lt;h4&gt;Stuff that's totally beside the point&lt;/h4&gt;

&lt;blockquote&gt;As Edison once said, "Failures? Not at all. We've learned several thousand things that won't work."&lt;/blockquote&gt;

&lt;p&gt;The httpRequest package is very incomplete, which is fair enough for a package at version 0.0.8. They implement only basic get and post and multipart post. Both post methods seem to expect name/value pairs in the body of the POST, whereas accessing web services typically requires XML or JSON in the request body. And, if I'm interpreting the HTTP spec right, these methods mishandle termination of response bodies.&lt;/p&gt;

&lt;p&gt;Given this shaky foundation to start with, I implemented my own PUT function. While I eventually got it working for my specific purpose, I don't recommend going that route. HTTP, especially 1.1, is a complex protocol and implementing it is tricky. As I said, I believe the httpRequest methods, which send HTTP/1.1 in their request headers, get it wrong.&lt;/p&gt;

&lt;p&gt;Specifically, they read the HTTP response with a loop like one of the following:&lt;/p&gt;

&lt;pre class="codebox"&gt;
repeat{
  ss &amp;lt;- read.socket(fp,loop=FALSE)
  output &amp;lt;- paste(output,ss,sep=&amp;quot;&amp;quot;)
  if(regexpr(&amp;quot;\r\n0\r\n\r\n&amp;quot;,ss)&amp;gt;-1) break()
  if (ss == &amp;quot;&amp;quot;) break()
}
&lt;/pre&gt;

&lt;pre class="codebox"&gt;
repeat{
 ss &amp;lt;- rawToChar(readBin(scon, &amp;quot;raw&amp;quot;, 2048))
 output &amp;lt;- paste(output,ss,sep=&amp;quot;&amp;quot;)
 if(regexpr(&amp;quot;\r\n0\r\n\r\n&amp;quot;,ss)&amp;gt;-1) break()
 if(ss == &amp;quot;&amp;quot;) break()
 #if(proc.time()[3] &amp;gt; start+timeout) break()
}
&lt;/pre&gt;

&lt;p&gt;Notice that they're counting on a blank line, a zero followed by a blank line or the server closing the connection to signal the end of the response body. I dunno where the zero thing comes from or why we should count on it not being broken up during reading. Looking through &lt;a href="http://www.w3.org/Protocols/rfc2616/rfc2616.html"&gt;RFC2616&lt;/a&gt; we find this description of an HTTP message:&lt;/p&gt;

&lt;pre class="codebox"&gt;
generic-message = start-line
                  *(message-header CRLF)
                  CRLF
                  [ message-body ]
&lt;/pre&gt;

&lt;p&gt;While the headers section ends with a blank line, the message body is not required to end in anything in particular. The &lt;a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.4"&gt;part of the spec that refers to message length&lt;/a&gt; lists 5 ways that a message may be terminated, 4 of which are not "server closes connection". None of them are "a blank line". HTTP 1.1 was specifically designed this way so web browsers could download a page and all its images using the same open connection.&lt;/p&gt;

&lt;p&gt;For my PUT implementation, I fell back to HTTP 1.0, where I could at least count on the connection closing at the end of the response. Even then, socket operations in R are confusing, at least for the clueless newbie such as myself.&lt;/p&gt;

&lt;p&gt;One set of socket operations consists of: &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/utils/html/make.socket.html"&gt;&lt;em&gt;make.socket&lt;/em&gt;&lt;/a&gt;, &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.socket.html"&gt;&lt;em&gt;read.socket&lt;/em&gt;/&lt;em&gt;write.socket&lt;/em&gt;&lt;/a&gt; and &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/utils/html/close.socket.html"&gt;&lt;em&gt;close.socket&lt;/em&gt;&lt;/a&gt;. Of these functions, the &lt;a href="http://cran.r-project.org/doc/manuals/R-data.html"&gt;R Data Import/Export&lt;/a&gt; guide states, "For new projects it is suggested that socket connections are used instead."&lt;/p&gt;

&lt;p&gt;OK, &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/base/html/connections.html"&gt;socket connections&lt;/a&gt;, then. Now we're looking at: &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/base/html/connections.html"&gt;&lt;em&gt;socketConnection&lt;/em&gt;&lt;/a&gt;, &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/base/html/readLines.html"&gt;&lt;em&gt;readLines&lt;/em&gt;&lt;/a&gt;, and &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/base/html/writeLines.html"&gt;&lt;em&gt;writeLines&lt;/em&gt;&lt;/a&gt;. Actually, tons of IO methods in R can accept connections: &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/base/html/readBin.html"&gt;&lt;em&gt;readBin&lt;/em&gt;/&lt;em&gt;writeBin&lt;/em&gt;&lt;/a&gt;, &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/base/html/readChar.html"&gt;&lt;em&gt;readChar&lt;/em&gt;/&lt;em&gt;writeChar&lt;/em&gt;&lt;/a&gt;, &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/base/html/cat.html"&gt;&lt;em&gt;cat&lt;/em&gt;&lt;/a&gt;, &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/scan.html"&gt;&lt;em&gt;scan&lt;/em&gt;&lt;/a&gt; and the &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html"&gt;&lt;em&gt;read.table&lt;/em&gt;&lt;/a&gt; methods among others.&lt;/p&gt;

&lt;p&gt;At one point, I was trying to use the &lt;em&gt;Content-Length&lt;/em&gt; header to properly determine the length of the response body. I would read the header lines using &lt;em&gt;readLines&lt;/em&gt;, parse those to find &lt;em&gt;Content-Length&lt;/em&gt;, then I tried reading the response body with &lt;em&gt;readChar&lt;/em&gt;. By the name, I got the impression that &lt;em&gt;readChar&lt;/em&gt; was like &lt;em&gt;readLines&lt;/em&gt; but one character at a time. According to some &lt;a href="http://www.mail-archive.com/r-help@r-project.org/msg110778.html"&gt;helpful tips I got on the r-help mailing list&lt;/a&gt; this is not the case. Apparently, &lt;em&gt;readChars&lt;/em&gt; is for binary mode connections, which seems odd to me. I didn't chase this down any further, so I still don't know how you would properly use &lt;em&gt;Content-Length&lt;/em&gt; with the R socket functions.&lt;/p&gt;

&lt;p&gt;Falling back to HTTP 1.0, we can just call readLines 'til the server closes the connection. In an amazing, but not recommended, feat of beating a dead horse until you actually get somewhere, I finally came up with the following code, with a couple variations commented out:&lt;/p&gt;

&lt;pre class="codebox"&gt;
http.put &amp;lt;- function(host, path, data.to.send, content.type=&amp;quot;application/json&amp;quot;, port=80, verbose=FALSE) {

  if(missing(path))
    path &amp;lt;- &amp;quot;/&amp;quot;
  if(missing(host))
    stop(&amp;quot;No host URL provided&amp;quot;)
  if(missing(data.to.send))
    stop(&amp;quot;No data to send provided&amp;quot;)

  content.length &amp;lt;- nchar(data.to.send)

  header &amp;lt;- NULL
  header &amp;lt;- c(header,paste(&amp;quot;PUT &amp;quot;, path, &amp;quot; HTTP/1.0\r\n&amp;quot;, sep=&amp;quot;&amp;quot;))
  header &amp;lt;- c(header,&amp;quot;Accept: */*\r\n&amp;quot;)
  header &amp;lt;- c(header,paste(&amp;quot;Content-Length: &amp;quot;, content.length, &amp;quot;\r\n&amp;quot;, sep=&amp;quot;&amp;quot;))
  header &amp;lt;- c(header,paste(&amp;quot;Content-Type: &amp;quot;, content.type, &amp;quot;\r\n&amp;quot;, sep=&amp;quot;&amp;quot;))
  request &amp;lt;- paste(c(header, &amp;quot;\r\n&amp;quot;, data.to.send), sep=&amp;quot;&amp;quot;, collapse=&amp;quot;&amp;quot;)

  if (verbose) {
    cat(&amp;quot;Sending HTTP PUT request to &amp;quot;, host, &amp;quot;:&amp;quot;, port, &amp;quot;\n&amp;quot;)
    cat(request, &amp;quot;\n&amp;quot;)
  }

  con &amp;lt;- socketConnection(host=host, port=port, open=&amp;quot;w+&amp;quot;, blocking=TRUE, encoding=&amp;quot;UTF-8&amp;quot;)
  on.exit(close(con))

  writeLines(request, con)

  response &amp;lt;- list()

  # read whole HTTP response and parse afterwords
  # lines &amp;lt;- readLines(con)
  # write(lines, stderr())
  # flush(stderr())
  # 
  # # parse response and construct a response &amp;#x27;object&amp;#x27;
  # response$status = lines[1]
  # first.blank.line = which(lines==&amp;quot;&amp;quot;)[1]
  # if (!is.na(first.blank.line)) {
  #   header.kvs = strsplit(lines[2:(first.blank.line-1)], &amp;quot;:\\s*&amp;quot;)
  #   response$headers &amp;lt;- sapply(header.kvs, function(x) x[2])
  #   names(response$headers) &amp;lt;- sapply(header.kvs, function(x) x[1])
  # }
  # response$body = paste(lines[first.blank.line+1:length(lines)])

  response$status &amp;lt;- readLines(con, n=1)
  if (verbose) {
    write(response$status, stderr())
    flush(stderr())
  }
  response$headers &amp;lt;- character(0)
  repeat{
    ss &amp;lt;- readLines(con, n=1)
    if (verbose) {
      write(ss, stderr())
      flush(stderr())
    }
    if (ss == &amp;quot;&amp;quot;) break
    key.value &amp;lt;- strsplit(ss, &amp;quot;:\\s*&amp;quot;)
    response$headers[key.value[[1]][1]] &amp;lt;- key.value[[1]][2]
  }
  response$body = readLines(con)
  if (verbose) {
    write(response$body, stderr())
    flush(stderr())
  }

  # doesn&amp;#x27;t work. something to do with encoding?
  # readChar is for binary connections??
  # if (any(names(response$headers)==&amp;#x27;Content-Length&amp;#x27;)) {
  #   content.length &amp;lt;- as.integer(response$headers[&amp;#x27;Content-Length&amp;#x27;])
  #   response$body &amp;lt;- readChar(con, nchars=content.length)
  # }

  return(response)
}
&lt;/pre&gt;

&lt;p&gt;After all that suffering, which was undoubtedly good for my character, I found an easier way.&lt;/p&gt;

&lt;h4&gt;RCurl&lt;/h4&gt;

&lt;p&gt;Duncan Temple Lang's &lt;a href="http://www.omegahat.org/RCurl/"&gt;RCurl&lt;/a&gt; is an R wrapper for &lt;a href="http://curl.haxx.se/"&gt;libcurl&lt;/a&gt;, which provides robust support for HTTP 1.1. The paper &lt;a href="http://www.omegahat.org/RCurl/RCurlJSS.pdf"&gt;R as a Web Client - the RCurl package&lt;/a&gt; lays out a strong case that wrapping an existing C library is a better way to get good HTTP support into R. RCurl works well and seems capable of everything needed to communicate with web services of all kinds. The API, mostly inherited from libcurl, is dense and a little confusing. Even given the docs and paper for RCurl and the docs for libcurl, I don't think I would have figured out PUT.&lt;/p&gt;

&lt;p&gt;Luckily, at that point I found &lt;a href="http://github.com/wactbprot/R4CouchDB"&gt;R4CouchDB&lt;/a&gt;, an R package built on &lt;a href="http://www.omegahat.org/RCurl/"&gt;RCurl&lt;/a&gt; and &lt;a href="http://www.omegahat.org/RJSONIO/"&gt;RJSONIO&lt;/a&gt;. R4CouchDB is part of a Google Summer of Code effort, &lt;a href="http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:nosql_interface"&gt;NoSQL interface for R&lt;/a&gt;, through which high-level APIs were developed for several NoSQL DBs. Finally, I had stumbled across the answer to my problem.&lt;/p&gt;

&lt;p&gt;I'm mainly documenting my misadventures here. In the next installment, &lt;a href="/2010/10/couchdb-and-r.html"&gt;CouchDB and R&lt;/a&gt; we'll see what actually worked. In the meantime, is there a conclusion from all this fumbling?&lt;/p&gt;

&lt;h4&gt;My point if I have one&lt;/h4&gt;

&lt;p&gt;HTTP is so universal that a high quality implementation should be a given for any language. HTTP-based APIs are being used by databases, message queues, and cloud computing services. And let's not forget plain old-fashioned web services. Mining and analyzing these data sources is something lots of people are going to want to do in R.&lt;/p&gt;

&lt;p&gt;Others have stumbled over similar issues. There are threads on r-help about &lt;a href="http://www.mail-archive.com/r-help@r-project.org/msg85952.html"&gt;hanging socket reads&lt;/a&gt;, &lt;a href="http://www.mail-archive.com/r-help@r-project.org/msg106404.html"&gt;R with CouchDB&lt;/a&gt;, and &lt;a href="http://www.mail-archive.com/r-help@r-project.org/msg99772.html"&gt;getting R to talk over Stomp&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;RCurl gets us pretty close. It could use high-level methods for PUT and DELETE and a high-level POST amenable to web-service use cases. More importantly, this stuff needs to be easier to find without sending the clueless noob running down blind alleys. RCurl is greatly superior to httpRequest, but that's not obvious without trying it or looking at the source. At minimum, it would be great to add a section on HTTP and web-services with RCurl to the &lt;a href="http://cran.r-project.org/doc/manuals/R-data.html"&gt;R Data Import/Output guide&lt;/a&gt;. And finally, take it from the fool: trying to role your own HTTP (1.1 especially) is a fool's errand.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-668162914855685897?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/668162914855685897/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/09/how-to-send-http-put-request-from-r.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/668162914855685897'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/668162914855685897'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/09/how-to-send-http-put-request-from-r.html' title='How to send an HTTP PUT request from R'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_dbECP0yvozc/TKDgttCiq5I/AAAAAAAACo0/Rb109jj2Zq4/s72-c/http_logo.jpeg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-1970374573585833236</id><published>2010-09-23T22:06:00.000-07:00</published><updated>2011-04-10T17:32:05.645-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='NoSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='tutorial'/><category scheme='http://www.blogger.com/atom/ns#' term='software engineering'/><category scheme='http://www.blogger.com/atom/ns#' term='analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='db'/><title type='text'>Geting started with CouchDB</title><content type='html'>&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/TJwypGSCoZI/AAAAAAAACos/WDRNFUNmW54/s1600/Apache+CouchDB:+Relax.png"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 175px; height: 150px;" src="http://1.bp.blogspot.com/_dbECP0yvozc/TJwypGSCoZI/AAAAAAAACos/WDRNFUNmW54/s200/Apache+CouchDB:+Relax.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5520342924891693458" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'm investigating using &lt;a href="http://couchdb.apache.org/"&gt;CouchDB&lt;/a&gt; for a data mining application. CouchDB is a  schema-less document-oriented database that stores JSON documents and uses JavaScript as a query language. You write queries in the form of &lt;a href="http://labs.google.com/papers/mapreduce.html"&gt;map-reduce&lt;/a&gt;. Applications connect to the database over a ReSTful HTTP API. So, Couch is a creature of the web in a lot of ways.&lt;/p&gt;

&lt;p&gt;What I have in mind (eventually) is sharding a collection of documents between several instances of CouchDB each running on their own nodes. Then, I want to run distributed map-reduce queries over the whole collection of documents. But, I'm just a beginner, so we're going to start off with the basics. The &lt;a href="http://wiki.apache.org/couchdb/"&gt;CouchDB wiki&lt;/a&gt; has a ton of getting started material.&lt;/p&gt;

&lt;p&gt;Couchdb's &lt;a href="http://wiki.apache.org/couchdb/Installation"&gt;installation&lt;/a&gt; instructions cover several options for &lt;a href="http://wiki.apache.org/couchdb/Installation"&gt;installing on Mac OS X&lt;/a&gt;, as well as other OS's. I used &lt;a href="http://www.macports.org/"&gt;MacPorts&lt;/a&gt;.&lt;/p&gt;

&lt;pre class="codebox"&gt;
sudo port selfupdate
sudo port install couchdb
&lt;/pre&gt;

&lt;p&gt;Did I remember to update my port definitions the first time through? Of f-ing course not. Port tries to be helpful, but it's a little late sometimes with the warnings. Anyway, now that it's installed, let's start it up. I came across &lt;a href="http://www.imacusers.com/couchdb-on-mac-os-105-via-macports/"&gt;CouchDB on Mac OS 10.5 via MacPorts&lt;/a&gt; which tells you how to start CouchDB using Apple's &lt;a href="http://developer.apple.com/macosx/launchd.html"&gt;launchctl&lt;/a&gt;.&lt;/p&gt;

&lt;pre class="codebox"&gt;
sudo launchctl load /opt/local/Library/LaunchDaemons/org.apache.couchdb.plist
sudo launchctl start org.apache.couchdb
&lt;/pre&gt;

&lt;p&gt;To verify that it's up and running, type:&lt;/p&gt;

&lt;pre class="codebox"&gt;
curl http://localhost:5984/
&lt;/pre&gt;

&lt;p&gt;...which should return something like:&lt;/p&gt;

&lt;pre class="codebox"&gt;
{"couchdb":"Welcome","version":"1.0.1"}
&lt;/pre&gt;

&lt;p&gt;&lt;a href="http://localhost:5984/_utils/"&gt;Futon&lt;/a&gt;, the web based management tool for CouchDB can be browsed to at &lt;a href="http://localhost:5984/_utils/"&gt;http://localhost:5984/_utils/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Being a nerd, I tried to run Futon's test suite. After they failed, I found this: &lt;a href="http://github.com/janl/couchdbx-core/issues/#issue/13"&gt;The tests run only(!) in a separate browser and that browser needs to be Firefox.&lt;/a&gt; Maybe that's been dealt with by now.&lt;/p&gt;

&lt;p&gt;Let's create a test database and add some bogus records like these:&lt;/p&gt;

&lt;pre class="codebox"&gt;
{
   "_id": "3f8e4c80b3e591f9f53243bfc8158abf",
   "_rev": "1-896ed7982ecffb9729a4c79eac9ef08a",
   "description": "This is a bogus description of a test document in a couchdb database.",
   "foo": true,
   "bogosity": 99.87526349
}

{
   "_id": "f02148a1a2655e0ed25e61e8cee71695",
   "_rev": "1-a34ffd2bf0ef6c5530f78ac5fbd586de",
   "foo": true,
   "bogosity": 94.162327,
   "flapdoodle": "Blither blather bonk. Blah blabber jabber jigaboo splat. Pickle plop dribble quibble."
}

{
   "_id": "9c24d1219b651bfeb044a0162857f8ab",
   "_rev": "1-5dd2f82c03f7af2ad24e726ea1c26ed4",
   "foo": false,
   "bogosity": 88.334,
   "description": "Another bogus document in CouchDB."
}
&lt;/pre&gt;

&lt;p&gt;When I first looked at CouchDB, I thought &lt;a href="http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views"&gt;Views&lt;/a&gt; were more or less equivalent to SQL queries. That's not really true in some ways, but I'll get to that later. For now, let's try a couple in Futon. First, we'll just use a map function, no reducer. Let's filter our docs by bogosity. We want really bogus documents.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Map Function&lt;/b&gt;&lt;/p&gt;
&lt;pre class="codebox"&gt;
function(doc) {
  if (doc.bogosity &gt; 95.0)
    emit(null, doc);
}
&lt;/pre&gt;

&lt;p&gt;Now, let's throw in a reducer. This mapper emits the bogosity value for all docs. The reducer takes their sum.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Map Function&lt;/b&gt;&lt;/p&gt;
&lt;pre class="codebox"&gt;
function(doc) {
  emit(null, doc.bogosity);
}
&lt;/pre&gt;

&lt;p&gt;&lt;b&gt;Reduce Function&lt;/b&gt;&lt;/p&gt;
&lt;pre class="codebox"&gt;
function (key, values, rereduce) {
  return sum(values);
}
&lt;/pre&gt;

&lt;p&gt;It's a fun little exercise to try and take the average. That's tricky because, for example, &lt;i&gt;ave&lt;/i&gt;(&lt;i&gt;ave&lt;/i&gt;(a,b), &lt;i&gt;ave&lt;/i&gt;(c)) is not necessarily the same as &lt;i&gt;ave&lt;/i&gt;(a,b,c). That's important because the reducer needs to be free to operate on subsets of the keys emitted from the mapper, then combine the values. The wiki doc &lt;a href="http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views"&gt;Introduction to CouchDB views&lt;/a&gt; explains the requirements on the map and reduce functions. There's a great &lt;a href="http://labs.mudynamics.com/wp-content/uploads/2009/04/icouch.html"&gt;interactive emulator and tutorial on CouchDB and map-reduce&lt;/a&gt; that will get you a bit further writing views.&lt;/p&gt;

&lt;p&gt;One fun fact about CouchDB's views is that they're stored in CouchDB as design documents, which are just regular JSON like everything else. This is in contrast to SQL where a query is a completely different thing from the data. (OK, yes, I've heard of stored procs.)&lt;/p&gt;

&lt;p&gt;That's the basics. At this point, a couple questions arise:&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;How do you do parameterized queries? For example, what if I wanted to let a user specify a cut-off for bogosity at run time?&lt;/li&gt;
 &lt;li&gt;How do I more fully get my head around these map-reduce "queries"?&lt;/li&gt;
 &lt;li&gt;Can CouchDB do distributed map-reduce like Hadoop?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's more to design documents than views. Both &lt;a href="http://wiki.apache.org/couchdb/Formatting_with_Show_and_List"&gt;&lt;span style="font-style:italic;"&gt;_show&lt;/span&gt; and &lt;span style="font-style:italic;"&gt;_list&lt;/span&gt; functions&lt;/a&gt; let you transform documents. List functions use cursor-like iterator that enables on-the-fly filtering and aggregating as well. Apparently, there are plans for &lt;span style="font-style:italic;"&gt;_update&lt;/span&gt; and &lt;span style="font-style:italic;"&gt;_filter&lt;/span&gt; functions as well. I'll have to do some more reading and hacking and leave those for later.&lt;/p&gt;

&lt;h4&gt;Links&lt;/h4&gt;
&lt;ul&gt;
 &lt;li&gt;&lt;a href="http://wiki.apache.org/couchdb/"&gt;CouchDB wiki&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://guide.couchdb.org/"&gt;CouchDB: The Definitive Guide&lt;/a&gt; an O'Reilly book on CouchDB (free online)&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://blog.stackoverflow.com/2009/06/podcast-59/"&gt;Stack Overflow podcast&lt;/a&gt; with Damien Katz, developer of CouchDB&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://planet.couchdb.org/"&gt;Planet Couch&lt;/a&gt;: a CouchDB blog&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-1970374573585833236?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/1970374573585833236/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/09/geting-started-with-couchdb.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1970374573585833236'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/1970374573585833236'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/09/geting-started-with-couchdb.html' title='Geting started with CouchDB'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_dbECP0yvozc/TJwypGSCoZI/AAAAAAAACos/WDRNFUNmW54/s72-c/Apache+CouchDB:+Relax.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-960883839914065531</id><published>2010-08-31T16:51:00.000-07:00</published><updated>2010-08-31T18:54:45.620-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><title type='text'>Probability processor</title><content type='html'>&lt;p&gt;MIT's Technology Review reports on &lt;a href="http://www.technologyreview.com/computing/26055/?a=f"&gt;A New Kind of Microchip&lt;/a&gt;, a probability-based processor designed to speed up statistical computations.&lt;/p&gt;

&lt;p&gt;The chip works with electrical signals that represent probabilities, instead of 1s and 0s using building blocks known as Bayesian NAND gates. “Whereas a conventional NAND gate outputs a "1" if neither of its inputs match, the output of a Bayesian NAND gate represents the odds that the two input probabilities match. This makes it possible to perform calculations that use probabilities as their input and output.”&lt;/p&gt;

&lt;p&gt;“This is not digital computing in the traditional sense,” says Ben Vigoda, founder of &lt;a href="http://www.lyricsemiconductor.com/"&gt;Lyric Semiconductor&lt;/a&gt;. “We are looking at processing where the values can be between a zero and a one.” (from Wired article &lt;i&gt;&lt;a href="http://www.wired.com/gadgetlab/2010/08/flash-error-correction-chip/"&gt;Probabilistic Chip Promises Better Flash Memory, Spam Filtering&lt;/a&gt;&lt;/i&gt;) Vigoda's &lt;i&gt;&lt;a href="http://pubs.media.mit.edu/pubs/papers/03.07.vigoda.pdf"&gt;Analog Logic: Continuous-Time Analog Circuits for Statistical Signal Processing&lt;/a&gt;&lt;/i&gt; probably spells it all out, if you've got the fortitude to read it. For us light-weights, there's a video &lt;a href="http://www.theinquirer.net/inquirer/news/1728981/video-lyric-semiconductor-explains-probability-chip"&gt;Lyric Semiconductor explains its probability chip&lt;/a&gt;. It's super-cool that he mentions genomics as a potential application.&lt;/p&gt;

&lt;p&gt;Computing has been steadily moving towards more specialized coprocessors, for example the vector capabilities of graphics chips (GPU's). Wouldn't it be neat to have a stats coprocessor alongside your general purpose CPU? (Or inside it like an FPU?) How about a cell processor configuration where you'd get an assortment of CPU cores, graphic/vector GPU cores and probability processor cores?&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-960883839914065531?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/960883839914065531/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/08/probability-processor.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/960883839914065531'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/960883839914065531'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/08/probability-processor.html' title='Probability processor'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-3972792946804279878</id><published>2010-08-21T15:04:00.000-07:00</published><updated>2010-08-21T22:11:26.975-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='UsingR'/><category scheme='http://www.blogger.com/atom/ns#' term='tutorial'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Using R for Introductory Statistics, Chapter 3.4</title><content type='html'>&lt;p&gt;...a continuing journey through &lt;a href="http://books.google.com/books?id=jwolc192c5kC&amp;lpg=PP1&amp;dq=Using%20R%20for%20Introductory%20Statistics&amp;pg=PP1#v=onepage&amp;q&amp;f=false"&gt;Using R for Introductory Statistics&lt;/a&gt;, by &lt;a href="http://wiener.math.csi.cuny.edu/verzani/"&gt;John Verzani&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;Simple linear regression&lt;/h4&gt;

&lt;p&gt;Linear regression is a kooky term for fitting a line to some data. This odd bit of terminology can be blamed on &lt;a href="http://en.wikipedia.org/wiki/Francis_Galton"&gt;Sir Francis Galton&lt;/a&gt;, a prolific victorian scientist and traveler who saw it as related to his concept of regression toward the mean. Calling it a linear model is a little more straight-forward, and linear modeling through the &lt;i&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html"&gt;lm&lt;/a&gt;&lt;/i&gt; function is bread-and-butter to R.&lt;/p&gt;

&lt;p&gt;For example, let's look at the data set &lt;i&gt;diamonds&lt;/i&gt; to see if there's a linear relationship between weight and cost of diamonds.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/THBOi-5HYuI/AAAAAAAACnU/40QkJ9QhPGI/s1600/diamonds.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 298px; border:none;" src="http://1.bp.blogspot.com/_dbECP0yvozc/THBOi-5HYuI/AAAAAAAACnU/40QkJ9QhPGI/s320/diamonds.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5507988707179193058" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;pre class="codebox"&gt;
f = price ~ carat
plot(f, data=diamond, pch=5,
     main="Price of diamonds predicted by weight")
res = lm(f, data=diamond)
abline(res, col='blue')
&lt;/pre&gt;

&lt;p&gt;We start by creating the formula &lt;i&gt;f&lt;/i&gt; using the strange looking &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/tilde.html"&gt;tilde operator&lt;/a&gt;. That tells the R interpreter that we're defining a symbolic &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html"&gt;formula&lt;/a&gt;, rather than an expression to be evaluated immediately. So, our definition of formula &lt;i&gt;f&lt;/i&gt; says, "price is a function of carat". In the &lt;i&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/graphics/html/plot.html"&gt;plot&lt;/a&gt;&lt;/i&gt; statement, the formula is evaluated in the context given by &lt;i&gt;data=diamond&lt;/i&gt;, so that the variables in our formula have values. That gives us the scatter plot. Now let's fit a line using &lt;i&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html"&gt;lm&lt;/a&gt;&lt;/i&gt;, context again given by &lt;i&gt;data=diamond&lt;/i&gt;, and render the resulting object as a line using &lt;i&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/graphics/html/abline.html"&gt;abline&lt;/a&gt;&lt;/i&gt;. Looks spiffy, but what just happened?&lt;/p&gt;

&lt;p&gt;The equation of a line that we learned in high school is:&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/THBOuro1PtI/AAAAAAAACnc/_XCBtTbC5ts/s1600/line.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 91px; height: 19px;border:none;" src="http://2.bp.blogspot.com/_dbECP0yvozc/THBOuro1PtI/AAAAAAAACnc/_XCBtTbC5ts/s400/line.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5507988908169051858" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Minimizing squared error over our sample gives us estimates of the slope and intercept. The book presents this without derivation, which is a shame.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/THBO2WNiPfI/AAAAAAAACnk/ocVw05QL8wk/s1600/lm_slope.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 208px; height: 45px;border:none;" src="http://2.bp.blogspot.com/_dbECP0yvozc/THBO2WNiPfI/AAAAAAAACnk/ocVw05QL8wk/s400/lm_slope.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5507989039856369138" /&gt;&lt;/a&gt;&lt;br /&gt;
&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/THBO9b1dpqI/AAAAAAAACns/-ZKI4Ic91Ms/s1600/lm_intercept.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 103px; height: 24px;border:none;" src="http://1.bp.blogspot.com/_dbECP0yvozc/THBO9b1dpqI/AAAAAAAACns/-ZKI4Ic91Ms/s400/lm_intercept.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5507989161625101986" /&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;Maybe later, I'll get brave an try to insert a derivation here.&lt;/p&gt;

&lt;h4&gt;Examples&lt;/h4&gt;

&lt;p&gt;There's a popular linear model that applies to dating, which goes like this: It's OK for a man to date a younger woman if her age is at least half the man's age plus seven. In other words, this:&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_dbECP0yvozc/THBPMbgorKI/AAAAAAAACn0/2FDk2OiOkqA/s1600/dating_ages.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 108px; height: 39px;border:none;" src="http://4.bp.blogspot.com/_dbECP0yvozc/THBPMbgorKI/AAAAAAAACn0/2FDk2OiOkqA/s400/dating_ages.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5507989419235781794" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Apparently, I should be dating a 27 year old. Let me go ask my wife if that's OK. In the meantime, let's see how our rule compares to results of a survey asking the proper cutoff for dating for various ages.&lt;/p&gt;

&lt;pre class="codebox"&gt;
plot(jitter(too.young$Male), jitter(too.young$Female),
     main="Appropriate ages for dating",
     xlab="Male age", ylab="Female age")
abline(7,1/2, col='red')
res &lt;- lm(Female ~ Male, data=too.young)
abline(res, col='blue', lty=2)
legend(15,45, legend=c("half plus 7 rule",
       "Estimated from survey data"),
       col=c('red', 'blue'), lty=c(1,2))
&lt;/pre&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_dbECP0yvozc/THBPbm6XPsI/AAAAAAAACn8/LdjzjLXXv5M/s1600/appropriate_ages_for_dating.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 400px;border:none;" src="http://4.bp.blogspot.com/_dbECP0yvozc/THBPbm6XPsI/AAAAAAAACn8/LdjzjLXXv5M/s400/appropriate_ages_for_dating.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5507989679994519234" /&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;That's a nice correspondence. On second thought, this is statistical proof that my daughter is not allowed to leave the house 'til she's 30.&lt;/p&gt;

&lt;p&gt;Somehow related to that is the data set &lt;i&gt;Animals&lt;/i&gt;, comparing weights of body and brain for several animals. The basic scatterplot not revealing much, we put the data on a log scale and find that it looks much better. As near as I can tell, the &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/base/html/AsIs.html"&gt;&lt;i&gt;I&lt;/i&gt; or &lt;i&gt;AsIs&lt;/i&gt; function&lt;/a&gt; does something like the opposite of the tilde operator. It tells the interpreter to go ahead and evaluate the enclosed expression. The general gist is to transform our data to log scale then apply linear modeling.&lt;/p&gt;

&lt;pre class="codebox"&gt;
f = I(log(brain)) ~ I(log(body))
plot(f, data=Animals,
     main="Animals: brains vs. bodies",
     xlab="log body weight", ylab="log brain weight")
res = lm(f, data=Animals)
abline(res, col='brown')
&lt;/pre&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_dbECP0yvozc/THBPl5qp70I/AAAAAAAACoE/ciAhvNTm2ko/s1600/animals_brains_vs_bodies.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 400px;border:none;" src="http://3.bp.blogspot.com/_dbECP0yvozc/THBPl5qp70I/AAAAAAAACoE/ciAhvNTm2ko/s400/animals_brains_vs_bodies.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5507989856827600706" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now the problem is, the line doesn't seem to fit very well. Those three outliers on the right edge have high body weights but less than expected going on upstairs. That seems to unduly influence the linear model away from the main trend. R contains some alternative algorithms for fitting a line to data. The function &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/lqs.html"&gt;lqs&lt;/a&gt; is more resistant to outliers, like the large but pea-brained creatures in this example.&lt;/p&gt;

&lt;pre class="codebox"&gt;
res.lqs = lqs(f, data=Animals)
abline(res.lqs, col='green', lty=2)
&lt;/pre&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/THBP5XRW04I/AAAAAAAACoM/T-uKll0Jl18/s1600/animals_brains_vs_bodies_2.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 400px;border:none;" src="http://2.bp.blogspot.com/_dbECP0yvozc/THBP5XRW04I/AAAAAAAACoM/T-uKll0Jl18/s400/animals_brains_vs_bodies_2.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5507990191192068994" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's better. Finally, you might use &lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/graphics/html/identify.html"&gt;&lt;i&gt;identify&lt;/i&gt;&lt;/a&gt; to solve the mystery of the knuckleheaded beasts.&lt;/p&gt;

&lt;pre class="codebox"&gt;
with(Animals, identify(log(body), log(brain), n=3, labels=rownames(Animals)))
&lt;/pre&gt;

&lt;p&gt;Problem 3.31 is about replicate measurements, which might be a good idea where measurement error, noisy data, or other random variation is present. We follow the by now familiar procedure of defining our formula, doing a scatterplot, building our linear model, and finally plotting it over the scatterplot.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/THBQCiwMeWI/AAAAAAAACoU/55k3Th2cMBI/s1600/breakdown_vs_voltage.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 400px;border:none;" src="http://1.bp.blogspot.com/_dbECP0yvozc/THBQCiwMeWI/AAAAAAAACoU/55k3Th2cMBI/s400/breakdown_vs_voltage.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5507990348893026658" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are then asked to look at the variance of measurements at each particular voltage. To do that, we'll first split our data.frame up by voltage. The result is a list of vectors, one per voltage level.&lt;/p&gt;

&lt;pre class="codebox"&gt;
breakdown.by.voltage = split(breakdown$time, breakdown$voltage)
str(breakdown.by.voltage)
List of 7
 $ 26: num [1:3] 5.8 1580 2323
 $ 28: num [1:5] 69 108 110 426 1067
 $ 30: num [1:11] 7.7 17 20 21 22 43 47 139 144 175 ...
 $ 32: num [1:15] 0.27 0.4 0.69 0.79 2.75 3.9 9.8 14 16 27 ...
 $ 34: num [1:19] 0.19 0.78 0.96 1.31 2.78 3.16 4.15 4.67 4.85 6.5 ...
 $ 36: num [1:15] 0.35 0.59 0.96 0.99 1.69 1.97 2.07 2.58 2.71 2.9 ...
 $ 38: num [1:7] 0.09 0.39 0.47 0.73 1.13 1.4 2.38
&lt;/pre&gt;

&lt;p&gt;Next, let's compute the variance for each component of the above list and build a data.frame out of it.&lt;/p&gt;

&lt;pre class="codebox"&gt;
var.by.voltage = data.frame(voltage=names(breakdown.by.voltage),
                            variance=sapply(breakdown.by.voltage,
                            FUN=var))
&lt;/pre&gt;

&lt;p&gt;This split-apply-combine pattern looks familiar. It's basically a &lt;a href="http://digitheadslabnotebook.blogspot.com/2009/12/sql-group-by-in-r.html"&gt;SQL &lt;i&gt;group by&lt;/i&gt; in R&lt;/a&gt;. It's also the basis for Hadley Wickham's &lt;a href="http://had.co.nz/plyr/"&gt;plyr&lt;/a&gt; library. Plyr's ddply function takes breakdown, a data.frame, and splits it on values of the voltage column. For each part, it computes the variance in the time column, then assembles the results back into a data.frame.&lt;/p&gt;

&lt;pre class="codebox"&gt;
ddply(breakdown, .(voltage), .fun=function(df) {var(df$time)})
&lt;/pre&gt;

&lt;p&gt;While that's not directly related to linear modeling, this kind of exploratory data manipulation is what R is made for.&lt;/p&gt;

&lt;p&gt;More fun&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.duke.edu/~rnau/regintro.htm"&gt;Introduction to linear regression&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Who in their right mind wouldn't love the &lt;a href="http://www.codecogs.com/latex/eqneditor.php"&gt;online TeX equation editor&lt;/a&gt;?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Previous episode of Using R for Introductory Statistics&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://digitheadslabnotebook.blogspot.com/2010/04/using-r-for-introductory-statistics.html"&gt;Chapters 1 &amp;amp; 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://digitheadslabnotebook.blogspot.com/2010/05/using-r-for-introductory-statistics-31.html"&gt;Chapter 3.1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://digitheadslabnotebook.blogspot.com/2010/06/using-r-for-introductory-statistics-32.html"&gt;Chapter 3.2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://digitheadslabnotebook.blogspot.com/2010/08/using-r-for-introductory-statistics-33.html"&gt;Chapter 3.4&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-3972792946804279878?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/3972792946804279878/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/08/using-r-for-introductory-statistics.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3972792946804279878'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3972792946804279878'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/08/using-r-for-introductory-statistics.html' title='Using R for Introductory Statistics, Chapter 3.4'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_dbECP0yvozc/THBOi-5HYuI/AAAAAAAACnU/40QkJ9QhPGI/s72-c/diamonds.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-2346525235824246693</id><published>2010-08-11T07:36:00.000-07:00</published><updated>2010-08-21T15:27:47.232-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='UsingR'/><category scheme='http://www.blogger.com/atom/ns#' term='stats'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Using R for Introductory Statistics 3.3</title><content type='html'>&lt;p&gt;...continuing our way though John Verzani's &lt;a href="http://wiener.math.csi.cuny.edu/UsingR/"&gt;Using R for introductory statistics&lt;/a&gt;. Previous installments: &lt;a href="http://digitheadslabnotebook.blogspot.com/2010/04/using-r-for-introductory-statistics.html"&gt;chapt1&amp;amp;2&lt;/a&gt;, &lt;a href="http://digitheadslabnotebook.blogspot.com/2010/05/using-r-for-introductory-statistics-31.html"&gt;chapt3.1&lt;/a&gt;, &lt;a href="http://digitheadslabnotebook.blogspot.com/2010/06/using-r-for-introductory-statistics-32.html"&gt;chapt3.2&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;Relationships in numeric data&lt;/h4&gt;

&lt;p&gt;If two data series have a natural pairing (x&lt;sub&gt;1&lt;/sub&gt;,y&lt;sub&gt;1&lt;/sub&gt;),...,(x&lt;sub&gt;n&lt;/sub&gt;,y&lt;sub&gt;n&lt;/sub&gt;), then we can ask, &amp;ldquo;What (if any) is the relationship between the two variables?&amp;rdquo; Scatterplots and correlation are first-line ways of assessing a bivariate data set.&lt;/p&gt;

&lt;h4&gt;Pearson's correlation&lt;/h4&gt;

&lt;p&gt;The Pearson's correlation coefficient is calculated by dividing the covariance of the two variables by the product of their standard deviations. It ranges from 1 for perfectly correlated variables to -1 for perfectly anticorrelated variables. 0 means uncorrelated. Independent variables have a correlation coefficient close to 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. [see &lt;a href="http://en.wikipedia.org/wiki/Correlation_and_dependence"&gt;wikipedia entry on correlation&lt;/a&gt;]&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_dbECP0yvozc/TGK124QjUcI/AAAAAAAACm0/23_smPKsjrw/s1600/pearsons_correlation.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 35px;border:none;" src="http://3.bp.blogspot.com/_dbECP0yvozc/TGK124QjUcI/AAAAAAAACm0/23_smPKsjrw/s400/pearsons_correlation.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5504161649019539906" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Question 3.19 concerns a sampling of 1000 New York Marathon runners. We're asked whether we expect a correlation between age and finishing time.&lt;/p&gt;

&lt;pre class="codebox"&gt;attach(nym.2002)
cor(age, time)
[1] 0.1898672
cor(age, time, method="spearman")
0.1119944
&lt;/pre&gt;

&lt;p&gt;We discover a low correlation - good news for us wheezing old geezers. A scatterplot might show something. And we have the gender of each runner, so let's use that, too.&lt;/p&gt;

&lt;p&gt;First, let's set ourselves up for a two panel plot.&lt;/p&gt;
&lt;pre class="codebox"&gt;
par(mfrow=c(2,1))
par(mar=c(2,4,4,2)+0.1)
&lt;/pre&gt;

&lt;p&gt;Next let's set up colors - pink for ladies, blue for guys - and throw in some transparency because a lot of data points are on top of each other.&lt;/p&gt;

&lt;pre class="codebox"&gt;
blue = rgb(0,0,255,64, maxColorValue=255)
pink = rgb(255,192,203,128, maxColorValue=255)

color &amp;lt;- rep(blue, length(gender))
color[gender==&amp;#x27;Female&amp;#x27;] &amp;lt;- pink
&lt;/pre&gt;

&lt;p&gt;In the first panel, draw the scatter plot.&lt;/p&gt;
&lt;pre class="codebox"&gt;
plot(time, age, col=color, pch=19, main=&amp;quot;NY Marathon&amp;quot;, ylim=c(18,80), xlab=&amp;quot;&amp;quot;)
&lt;/pre&gt;

&lt;p&gt;And in the second panel, break it down by gender. It's a well kept secret that &lt;a href="http://svn.r-project.org/R/trunk/src/library/graphics/R/boxplot.R"&gt;outcol and outpch&lt;/a&gt; can be used to set the color and shape of the outliers in a boxplot.&lt;/p&gt;
&lt;pre class="codebox"&gt;
par(mar=c(5,4,1,2)+0.1)
boxplot(time ~ gender, horizontal=T, col=c(pink, blue), outpch=19, outcol=c(pink, blue), xlab=&amp;quot;finishing time (minutes)&amp;quot;)
&lt;/pre&gt;

&lt;p&gt;Now return our settings to normal for good measure.&lt;/p&gt;
&lt;pre class="codebox"&gt;
par(mar=c(5,4,4,2)+0.1)
par(mfrow=c(1,1))
&lt;/pre&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_dbECP0yvozc/TGK2EjwIu8I/AAAAAAAACm8/Kh4Q9TCaZM8/s1600/ny_marathon_finishing_times.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 396px;border:none;" src="http://3.bp.blogspot.com/_dbECP0yvozc/TGK2EjwIu8I/AAAAAAAACm8/Kh4Q9TCaZM8/s400/ny_marathon_finishing_times.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5504161884033039298" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sure enough, there doesn't seem to be much correlation between age and finishing time. Gender has an effect, although I'm sure the elite female runners would have little trouble dusting my slow booty off the trail.&lt;/p&gt;

&lt;p&gt;It looks like we have fewer data points for women. Let's check that. We can use &lt;span style="font-style:italic;"&gt;&lt;a href="http://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html"&gt;table&lt;/a&gt;&lt;/span&gt; to count the number of times each level of a factor occurs, or in other words, count the number of males and females.&lt;/p&gt;

&lt;pre class="codebox"&gt;
table(gender)
gender
Female   Male 
   292    708
&lt;/pre&gt;

&lt;p&gt;I'm still a little skeptical of our previous result - the low correlation between age and finishing time. Let's look at the data binned by decade.&lt;/p&gt;

&lt;pre class="codebox"&gt;
bins &amp;lt;- cut(age, include.lowest=T, breaks=c(20,30,40,50,60,70,100), right=F, labels=c(&amp;#x27;20s&amp;#x27;,&amp;#x27;30s&amp;#x27;,&amp;#x27;40s&amp;#x27;,&amp;#x27;50s&amp;#x27;,&amp;#x27;60s&amp;#x27;,&amp;#x27;70+&amp;#x27;))
boxplot(time ~ bins, col=colorRampPalette(c(&amp;#x27;green&amp;#x27;,&amp;#x27;yellow&amp;#x27;,&amp;#x27;brown&amp;#x27;))(6), ylim=c(570,140))
&lt;/pre&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_dbECP0yvozc/TGK2NZIW1nI/AAAAAAAACnE/CWVcPI6-lDM/s1600/ny_marathon_binned_by_age.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 400px;border:none;" src="http://4.bp.blogspot.com/_dbECP0yvozc/TGK2NZIW1nI/AAAAAAAACnE/CWVcPI6-lDM/s400/ny_marathon_binned_by_age.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5504162035800659570" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It looks like you're not washed up as a runner until your 50's. Things go down hill from there, but, it doesn't look very linear, so we shouldn't be too surprised about our low &lt;em&gt;r&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Coarser bins, old and young using 50 as our cut-off, reveal that there's really no correlation in the younger group. In the older group, we're starting to see some correlation. I suppose you could play with the numbers to find an optimum cut-off that maximized the difference in correlation. Not sure what the point of that would be.&lt;/p&gt;

&lt;pre class="codebox"&gt;
y &amp;lt;- nym.2002[age&amp;lt;50,]
cor(y$age,y$time)
[1] -0.01148919
cor(y$age,y$time, method=&amp;#x27;spearman&amp;#x27;)
[1] -0.01512368
&lt;/pre&gt;

&lt;pre class="codebox"&gt;
o &amp;lt;- nym.2002[age&amp;gt;=50,]
cor(o$age, o$time)
[1] 0.3813543
cor(o$age, o$time, method=&amp;#x27;spearman&amp;#x27;)
[1] 0.1980635
&lt;/pre&gt;

&lt;p&gt;I ran a marathon once in my life. I think I was 30 and my time was a pokey 270 or so. My knees hurt for days afterwards, so I'm not sure I'd try it again. I do want to do a half, though. Gotta get back in shape for that...&lt;/p&gt;

&lt;h4&gt;More on &lt;a href="http://books.google.com/books?id=jwolc192c5kC&amp;lpg=PP1&amp;dq=Using%20R%20for%20Introductory%20Statistics&amp;pg=PP1#v=onepage&amp;q&amp;f=false"&gt;Using R for Introductory Statistics&lt;/a&gt;&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://digitheadslabnotebook.blogspot.com/2010/06/using-r-for-introductory-statistics-32.html"&gt;Using R for Introductory Statistics 3.2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://digitheadslabnotebook.blogspot.com/2010/05/using-r-for-introductory-statistics-31.html"&gt;Using R for Introductory Statistics, 3.1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://digitheadslabnotebook.blogspot.com/2010/04/using-r-for-introductory-statistics.html"&gt;Using R for Introductory Statistics, Chapters 1 and 2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-2346525235824246693?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/2346525235824246693/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/08/using-r-for-introductory-statistics-33.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2346525235824246693'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/2346525235824246693'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/08/using-r-for-introductory-statistics-33.html' title='Using R for Introductory Statistics 3.3'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_dbECP0yvozc/TGK124QjUcI/AAAAAAAACm0/23_smPKsjrw/s72-c/pearsons_correlation.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-3233904632054334212</id><published>2010-07-25T22:37:00.000-07:00</published><updated>2010-08-21T15:28:49.283-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Java'/><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='Bioinformatics'/><title type='text'>Gaggle Genome Browser</title><content type='html'>&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_dbECP0yvozc/TE0e0O6vVTI/AAAAAAAACl0/-8P3vtorSzA/s1600/don_quixote.jpg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 100px; height: 167px;" src="http://3.bp.blogspot.com/_dbECP0yvozc/TE0e0O6vVTI/AAAAAAAACl0/-8P3vtorSzA/s200/don_quixote.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5498084602795742514" /&gt;&lt;/a&gt;There's a certain windmill I've been tilting towards for a couple of years now. It's known as the &lt;a href="http://gaggle.systemsbiology.net/docs/geese/genomebrowser/"&gt;Gaggle Genome Browser&lt;/a&gt; and we've published a paper on it called &lt;i&gt;&lt;a href="http://www.biomedcentral.com/1471-2105/11/382/"&gt;Integration and visualization of systems biology data in context of the genome&lt;/a&gt;&lt;/i&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;a href="http://gaggle.systemsbiology.net/docs/geese/genomebrowser/"&gt;Gaggle Genome Browser&lt;/a&gt; is a cross-platform desktop program, based on Java and &lt;a href="http://www.sqlite.org/"&gt;SQLite&lt;/a&gt; for interactively visualizing high-density genomic data, joining heterogeneous data by location on the genome to create information-rich visualizations of genome organization, transcription and its regulation. As always, a key feature is interoperability with other bioinformatics apps through the &lt;a href="http://gaggle.systemsbiology.net/"&gt;Gaggle&lt;/a&gt; framework.&lt;/p&gt;

&lt;p&gt;Here it is displaying some tiling microarray data for &lt;i&gt;Sulfolobus solfataricus&lt;/i&gt;. Click for a bigger graphic. The reference sample is shown in blue circles overlaid with segmentation in red. Eight time-points along a growth curve are plotted as a heat map - red indicating increased transcription relative to the reference; green indicating decreased transcription. We also show &lt;a href="http://pfam.sanger.ac.uk/"&gt;Pfam&lt;/a&gt; domains, predicted operons, and some &lt;a href="http://www.ncbi.nlm.nih.gov/pubmed/15752202"&gt;previously&lt;/a&gt; &lt;a href="http://www3.interscience.wiley.com/journal/118659628/abstract?CRETRY=1&amp;amp;SRETRY=0"&gt;observed&lt;/a&gt; non-coding RNAs, several of which we were able to confirm.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://docs.google.com/leaf?id=0B4tVEMf-vR20NzFiYTJlODEtNTQ5MS00ZWU1LWJmZDAtYjAwZjFiMDMzOGUz"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 245px;border:none;" src="http://3.bp.blogspot.com/_dbECP0yvozc/TE0T3QZKO_I/AAAAAAAACls/RzY7cT4AgkY/s400/genome_browser_sso_small.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5498072560103472114" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One of the features I'm most proud of is the integration with &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, a tactic also being used by &lt;a href="http://www.tm4.org/mev/"&gt;MeV&lt;/a&gt;. At this point it's only partially complete. There's quite a bit more that could be done with it, and I'm looking for time (or help!) to finish.&lt;/p&gt;

&lt;p&gt;The past couple of years have seen a whole crop of new genome browsers. See the entry &lt;a href="http://digitheadslabnotebook.blogspot.com/2008/12/browsing-genomes.html"&gt;browsing genomes&lt;/a&gt; for a partial list. One reason is a new generation of lab hardware and techniques, including ChIP-chip, tiling arrays and high-throughput next-generation sequencing. Another is the ever changing landscape in computing.&lt;/p&gt;

&lt;p&gt;It's lacking polish in some places. There's plenty yet to be done. Maybe later, I'll write up some lessons learned and mistakes made, but for now, I'm happy to have it published and out there.&lt;/p&gt;

&lt;p&gt;Read more about the biology here:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="http://www.ncbi.nlm.nih.gov/pubmed/19536208"&gt;Prevalence of transcription promoters within archaeal operons and coding sequences.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check out the screencast by OpenHelix here:&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;&lt;a href="http://blog.openhelix.eu/?p=4862"&gt;Tip of the Week: Gaggle Genome Browser&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-3233904632054334212?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/3233904632054334212/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/07/gaggle-genome-browser.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3233904632054334212'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/3233904632054334212'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/07/gaggle-genome-browser.html' title='Gaggle Genome Browser'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_dbECP0yvozc/TE0e0O6vVTI/AAAAAAAACl0/-8P3vtorSzA/s72-c/don_quixote.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-4219385577669405702</id><published>2010-07-20T23:16:00.000-07:00</published><updated>2010-11-17T09:25:28.262-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='software engineering'/><category scheme='http://www.blogger.com/atom/ns#' term='links'/><title type='text'>How to design good APIs</title><content type='html'>&lt;p&gt;A long time ago, I asked a bunch of programming gurus how to go about designing an API. Several gave answers that boiled down to the unsettling advice, "Try to get it right the first time," to which a super-guru then added, "...but you'll never get it right the first time." With that zen wisdom in mind, here's a pile of resources that may help get it slightly less wrong.&lt;/p&gt;

&lt;p&gt;Joshua Bloch, designer of the Java collection classes and author of &lt;a href="http://books.google.com/books?id=ka2VUBqHiWkC&amp;amp;dq=effective+java&amp;amp;printsec=frontcover&amp;amp;source=bn&amp;amp;hl=en&amp;amp;ei=An5GTL7lIYXWtQOTuonVDQ&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=4&amp;amp;ved=0CD0Q6AEwAw"&gt;Effective Java&lt;/a&gt;, gives a Google tech-talk called &lt;a href="http://www.youtube.com/watch?v=aAb7hSCtvGw"&gt;How to Design a Good API &amp;amp; Why it Matters&lt;/a&gt;. Video for &lt;a href="http://www.infoq.com/presentations/effective-api-design"&gt;another version of the same talk&lt;/a&gt; is available on &lt;a href="http://www.infoq.com/"&gt;InfoQ&lt;/a&gt;. He starts off with the observation that, "Good programming is modular. Module boundaries are APIs."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Characteristics of a Good API&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Easy to learn&lt;/li&gt;
&lt;li&gt;Easy to use, even without documentation&lt;/li&gt;
&lt;li&gt;Hard to misuse&lt;/li&gt;
&lt;li&gt;Easy to read and maintain code that uses it&lt;/li&gt;
&lt;li&gt;Sufficiently powerful to satisfy requirements&lt;/li&gt;
&lt;li&gt;Easy to extend&lt;/li&gt;
&lt;li&gt;Appropriate to audience&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

Michi Henning, in &lt;a href="http://cacm.acm.org/magazines/2009/5/24646-api-design-matters/fulltext"&gt;API Design Matters&lt;/a&gt;, Communications of the ACM, May 2009, observes that, "An API is a user interface. APIs should be designed from the perspective of the caller."

&lt;blockquote&gt;Much of software development is about creating abstractions, and APIs are the visible interfaces to these abstractions. Abstractions reduce complexity because they throw away irrelevant detail and retain only the information that is necessary for a particular job. Abstractions do not exist in isolation; rather, we layer abstractions on top of each other. [...] This hierarchy of abstraction layers is an immensely powerful and useful concept. Without it, software as we know it could not exist because programmers would be completely overwhelmed by complexity.&lt;/blockquote&gt;

&lt;p&gt;Because you'll get it wrong the first time, and just because things change, you'll have to &lt;a href="https://netfiles.uiuc.edu/dig/papers/API_Evolution.pdf"&gt;evolve APIs&lt;/a&gt;. Breaking clients is unpleasant, but "Backward compatibility erodes APIs over time." &lt;/p&gt;

&lt;p&gt;My own little bit of wisdom is this: Performance characteristics are often part of the API. Unless stated otherwise, the caller will assume that a function will complete quickly. For example, it often seems like a good idea to make remote method calls look just like local method calls. This is a bad idea, because you can't abstract away time.&lt;/p&gt;


&lt;h4&gt;Links&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Joshua Bloch's &lt;a href="http://lcsd05.cs.tamu.edu/slides/keynote.pdf"&gt;slides&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.infoq.com/articles/API-Design-Joshua-Bloch"&gt;Bumper-Sticker API Design&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Jasmin Blanchette’s &lt;a href="http://chaos.troll.no/~shausman/api-design/api-design.pdf"&gt;The Little Manual of API Design&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;SE Radio podcast &lt;a href="http://www.se-radio.net/podcast/2009-08/episode-143-api-design-jim-des-rivieres"&gt;API Design with Jim des Rivieres&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;From the Eclipse folks, &lt;a href="http://wiki.eclipse.org/Evolving_Java-based_APIs"&gt;Evolving Java-based APIs&lt;/a&gt; and &lt;a href="http://www.slideshare.net/moberhuber/eclipsecon-2010-api-design-and-evolution-tutorial"&gt;API Design and Evolution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://queue.acm.org/detail.cfm?id=1071731"&gt;Programmers are People, Too&lt;/a&gt;
Programming language and API designers can learn a lot from the field of human-factors design, by Ken Arnold&lt;/li&gt;
&lt;li&gt;John Resig on &lt;a href="http://video.google.com/videoplay?docid=-474821803269194441#"&gt;Best Practices in Javascript Library Design&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://sites.google.com/site/yacoset/Home/api-design-tips"&gt;API Design Tips&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mndoci.com/2010/01/02/apis-are-powerful-platforms/"&gt;APIs are powerful platforms&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://wonderfullyflawed.com/2009/07/02/get-your-api-right.html"&gt;Get your API right&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html"&gt;How Do I Make This Hard to Misuse?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-4219385577669405702?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/4219385577669405702/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/07/how-to-design-good-apis.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4219385577669405702'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4219385577669405702'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/07/how-to-design-good-apis.html' title='How to design good APIs'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-4661990001630438117</id><published>2010-07-01T22:02:00.000-07:00</published><updated>2010-07-06T10:41:05.608-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rant'/><title type='text'>Science funding and productivity</title><content type='html'>&lt;p&gt;These are interesting times for the practice and funding of science. The traditional model of fee-for-subscription peer-reviewed academic journals is &lt;a href="http://www.daniel-lemire.com/blog/archives/2010/06/10/academic-publishing-is-archaic/"&gt;looking more and more outdated&lt;/a&gt;. Scientific funding is increasingly competitive and dependent on salesmanship and networking rather than scientific merit.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://chronicle.com/article/We-Must-Stop-the-Avalanche-of/65890/"&gt;We Must Stop the Avalanche of Low-Quality Research&lt;/a&gt; argues that scientists are drowning in a sea of mediocre papers that &lt;a href="http://www.johndcook.com/blog/2010/06/23/write-only-articles/"&gt;nobody reads&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In economic terms, attention is the scarce resource. Electronic publishing is dirt cheap, so it makes sense to publish even weak or negative results. But human attention is expensive and the peer review process is time consuming and unfunded. There needs to be a better mechanism for ranking the quality and importance of papers, so that scarce attention can be allocated efficiently.&lt;/p&gt;

&lt;p&gt;Certainly, counting papers is as poor a metric of scientific output as counting lines of code is of programmer productivity.&lt;/p&gt;

&lt;p&gt;Scientists should be scientists, not fund raisers. &lt;a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1000197"&gt;Real Lives and White Lies in the Funding of Scientific Research&lt;/a&gt; details the tyranny of grant applications.&lt;/p&gt;

&lt;p&gt;One proposed improvement is a track system, in which a researcher would be placed into a funding category and reviewed for productivity every five years and moved up or down to higher or lower tracks accordingly. Emphasis would shift from plans to outcomes.&lt;/p&gt;

&lt;p&gt;Stanford bioengineering professor Steven Quake makes a similar point in the New York Times:&lt;/p&gt;

&lt;blockquote&gt;&lt;a href="http://opinionator.blogs.nytimes.com/2009/02/10/guest-column-letting-scientists-off-the-leash/"&gt;As we consider the monumental challenges facing our generation — climate change, energy needs and health care — and look to science for solutions, it would behoove us to remember that it is almost impossible to predict where the next great discoveries will be made — and thus we should invest broadly and let scientists off their leashes.&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;One has to wonder how well science funding will hold up in the face of the gaping government deficits in most western countries.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/TC1z_7039zI/AAAAAAAAClc/wJxu16ynzrw/s1600/us_budget.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 298px;" src="http://2.bp.blogspot.com/_dbECP0yvozc/TC1z_7039zI/AAAAAAAAClc/wJxu16ynzrw/s400/us_budget.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5489171063062918962" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Meanwhile, &lt;a href="http://www.washingtonpost.com/wp-dyn/content/story/2010/06/28/ST2010062800373.html?sid=ST2010062800373"&gt;China is becoming scientific superpower&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dbECP0yvozc/TC10dDDu_SI/AAAAAAAAClk/YWCR4kOogIg/s1600/national_research_funding.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 297px;" src="http://2.bp.blogspot.com/_dbECP0yvozc/TC10dDDu_SI/AAAAAAAAClk/YWCR4kOogIg/s400/national_research_funding.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5489171563220499746" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;&lt;a href="http://www.washingtonpost.com/wp-dyn/content/story/2010/06/28/ST2010062800373.html?sid=ST2010062800373"&gt;Luo Minmin, 37, a neurobiologist, returned to China six years ago after getting his PhD from the University of Pennsylvania and completing a postdoctoral research stint at Duke. Luo said he has a big budget at NIBS and greater research freedom than he would have in the United States. "If I had stayed in America, the chances of making a discovery would have been lower," he said. "Here, people are willing to take risks. They give you money, and essentially you can do whatever you want."&lt;/a&gt;&lt;/blockquote&gt;

&lt;h4&gt;Related links&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.aaas.org/spp/rd/presentations/"&gt;Figures on science funding&lt;/a&gt; come from the AAAS's science and policy program.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.academicproductivity.com/2010/the-future-of-the-journal-by-anita-de-waard/"&gt;The Future of the Journal&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.boingboing.net/2010/06/13/peer-review-provides.html"&gt;Peer review provides £209,976,000 public subsidy to commercial publishers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://nogoodreason.typepad.co.uk/no_good_reason/2010/06/the-return-on-peer-review.html"&gt;The return on peer review&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-4661990001630438117?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/4661990001630438117/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/07/science-funding-and-productivity.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4661990001630438117'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4661990001630438117'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/07/science-funding-and-productivity.html' title='Science funding and productivity'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_dbECP0yvozc/TC1z_7039zI/AAAAAAAAClc/wJxu16ynzrw/s72-c/us_budget.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-4826391669287387422</id><published>2010-06-06T20:42:00.000-07:00</published><updated>2010-08-10T06:02:34.704-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='UsingR'/><category scheme='http://www.blogger.com/atom/ns#' term='books'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Using R for Introductory Statistics 3.2</title><content type='html'>&lt;p&gt;...continuing my sloth-like progress through John Verzani's &lt;a href="http://wiener.math.csi.cuny.edu/UsingR/"&gt;Using R for Introductory Statistics&lt;/a&gt;. Previous installments: Chapters &lt;a href="http://digitheadslabnotebook.blogspot.com/2010/04/using-r-for-introductory-statistics.html"&gt;1 and 2&lt;/a&gt; and &lt;a href="http://digitheadslabnotebook.blogspot.com/2010/05/using-r-for-introductory-statistics-31.html"&gt;3.1&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;Comparing independent samples&lt;/h3&gt;

&lt;p&gt;Boxplots provide a visual comparison between two or more distributions. For problem 3.8, we're asked to compare the reaction times of cell phone users verses a control group, to test the theory that using a cell phone while driving is a bad idea. Comparing the centers and spreads can be done with the following &lt;a href="http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/boxplot.html"&gt;boxplot&lt;/a&gt;.&lt;/p&gt;

&lt;pre class="codebox"&gt;
boxplot(time ~ control, reaction.time, names=c('control', 'phone'),
  col='gray',
  ylab='reaction time in seconds',
  main='Reaction time with cell phone usage')
&lt;/pre&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_dbECP0yvozc/TAxthrwHwLI/AAAAAAAAClE/hWIzt3OIqQ4/s1600/reaction.time.boxplot.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 377px;" src="http://4.bp.blogspot.com/_dbECP0yvozc/TAxthrwHwLI/AAAAAAAAClE/hWIzt3OIqQ4/s400/reaction.time.boxplot.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5479875272050720946" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The tilde operator, &lt;em&gt;~&lt;/em&gt;, is used to define a &lt;a href="http://cran.r-project.org/doc/manuals/R-intro.html#Formulae-for-statistical-models"&gt;model formula&lt;/a&gt;, which is something I aspire to understand someday but currently am clueless about.&lt;/p&gt;

&lt;p&gt;Looking at the same data as a density plot might give a better picture of each distribution.&lt;/p&gt;

&lt;pre class="codebox"&gt;
plot(density(reaction.time$time[reaction.time$control=='T']),
  main="Reaction time with cell phone usage",
  xlab="reaction time in seconds")
lines(density(reaction.time$time[reaction.time$control=='C']), lty=2)
&lt;/pre&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_dbECP0yvozc/TAxtugfwHzI/AAAAAAAAClM/3n-UsrlNy9g/s1600/reaction.time.density.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 400px;" src="http://1.bp.blogspot.com/_dbECP0yvozc/TAxtugfwHzI/AAAAAAAAClM/3n-UsrlNy9g/s400/reaction.time.density.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5479875492367572786" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Still, boxplots are nice because they give you a sense of the center, range, dispersion, and skew of a sample in a compact and comparable form. Plus, you can plot several boxplots side-by-side.&lt;/p&gt;

&lt;pre class="codebox"&gt;
boxplot(morley$Speed ~ morley$Expt,
  col='light grey', xlab='Experiment #',
  ylab="speed (km/s minus 299,000)",
  main="Michelson–Morley experiment")
mtext("speed of light data")
abline(h=sol, col='red')
&lt;/pre&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_dbECP0yvozc/TAxt7yeJsII/AAAAAAAAClU/ZuhaQ6vYalM/s1600/michelson-morley.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 330px;" src="http://3.bp.blogspot.com/_dbECP0yvozc/TAxt7yeJsII/AAAAAAAAClU/ZuhaQ6vYalM/s400/michelson-morley.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5479875720530997378" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Problem 3.11 uses data from the 1887 &lt;a href="http://en.wikipedia.org/wiki/Michelson%E2%80%93Morley_experiment"&gt;Michelson-Morley experiments&lt;/a&gt; attempting to find variations in the speed of light due to earth's motion through the aether, believed at the time to be the medium through which light waves traveled. The correct value for the speed of light is shown in red.&lt;/p&gt;

&lt;p&gt;And finally, whadya know, this stuff came in handy for some (probably not very rigorous) &lt;a href="http://gaggle.systemsbiology.net/docs/geese/genomebrowser/perf/"&gt;performance analysis&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-4826391669287387422?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/4826391669287387422/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/06/using-r-for-introductory-statistics-32.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4826391669287387422'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/4826391669287387422'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/06/using-r-for-introductory-statistics-32.html' title='Using R for Introductory Statistics 3.2'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_dbECP0yvozc/TAxthrwHwLI/AAAAAAAAClE/hWIzt3OIqQ4/s72-c/reaction.time.boxplot.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-8031828975636306709</id><published>2010-05-31T22:20:00.000-07:00</published><updated>2010-06-30T10:06:45.741-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='software engineering'/><category scheme='http://www.blogger.com/atom/ns#' term='software architecture'/><title type='text'>How many distinct paradigms of programming are there?</title><content type='html'>&lt;p&gt;I learned a few styles of programming in school. At least, I heard of them somewhere in passing.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Procedural&lt;/li&gt;
&lt;li&gt;Object-oriented&lt;/li&gt;
&lt;li&gt;Functional&lt;/li&gt;
&lt;li&gt;Logical&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then, I come across a thread on Hacker News asking &lt;a href="http://news.ycombinator.com/item?id=1352292"&gt;Ask HN: Do you recognize this approach to programming?&lt;/a&gt;. The answer seems to be something called &lt;i&gt;functional reactive programming&lt;/i&gt;. So, I started wondering, "How many distinct styles or paradigms of programming are there?"&lt;/p&gt;

&lt;p&gt;That led me to Peter Van Roy's &lt;a href="http://www.info.ucl.ac.be/~pvr/paradigmsDIAGRAMeng107.pdf"&gt;The principal programming paradigms&lt;/a&gt; and &lt;a href="http://www.info.ucl.ac.be/~pvr/paradigms.html"&gt;Programming Paradigms for Dummies&lt;/a&gt;. PVR is a co-author of &lt;a href="http://ctm.info.ucl.ac.be/"&gt;Concepts, Techniques, and Models of Computer Programming&lt;/a&gt;, which I've heard described as, "If you liked SICP, you'll like this."&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.info.ucl.ac.be/~pvr/book.html"&gt;&lt;img width="200" height="249" src="http://www.info.ucl.ac.be/people/PVR/BookEnglish.jpg" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5964816804623588850-8031828975636306709?l=digitheadslabnotebook.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://digitheadslabnotebook.blogspot.com/feeds/8031828975636306709/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/05/how-many-distinct-paradigms-of.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/8031828975636306709'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5964816804623588850/posts/default/8031828975636306709'/><link rel='alternate' type='text/html' href='http://digitheadslabnotebook.blogspot.com/2010/05/how-many-distinct-paradigms-of.html' title='How many distinct paradigms of programming are there?'/><author><name>Christopher Bare</name><uri>http://www.blogger.com/profile/01570188379488941406</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_dbECP0yvozc/SU2g-GpT8lI/AAAAAAAABi8/GIRitIOr4zo/S220/south_park_christopher_bare.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5964816804623588850.post-7567515972480281584</id><published>2010-05-26T17:14:00.000-07:00</published><updated>2010-05-26T17:15:46.743-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rant'/><category scheme='http://www.blogger.com/atom/ns#' term='crackpot theory'/><title type='text'>Attention, Intelligence, Creativity and Flow</title><content type='html'>&lt;p&gt;Most coders are aware of the importance of &lt;b&gt;pure uninterrupted concentration&lt;/b&gt;. Creative work of any kind requires focuses attention. The state of &lt;em&gt;flow&lt;/em&gt; happens when the spotlight of attention is completely focused on an activity. A piece by &lt;a href="http://www.jonahlehrer.com/"&gt;Jonah Lehrer&lt;/a&gt;, &lt;a href="http://scienceblogs.com/cortex/2010/04/attention_and_intelligence.php"&gt;Attention and Intelligence&lt;/a&gt;, reminded me of that happy bubble so easily popped by meetings, spouses and pointy-haired bosses. He writes, "Our mind has strict cognitive limitations - selective attention helps us compensate."&lt;/p&gt;

&lt;p&gt;Herbert Simon said, "A wealth of information creates a poverty of attention."&lt;/p&gt;

&lt;p&gt;William James famously wrote, "Everyone knows what attention is... It implies withdrawal from some things in order to deal effectively with others."&lt;/p&gt;

&lt;p&gt;&lt;a href="http://en.wikipedia.org/wiki/Mihaly_Csikszentmihalyi"&gt;Mihaly Csikszentmihalyi&lt;/a&gt; tells us, "Flow describes a state of experience that is engrossing, intrinsically rewarding and outside the parameters of worry and boredom." The components of &lt;a href="http://en.wikipedia.org/wiki/Flow_(psychology)"&gt;Flow&lt;/a&gt; are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Clear goals.&lt;/li&gt;
&lt;li&gt;Attention is focused on a limited stimulus field. There is full concentration, complete involvement. Focus of awareness is narrowed down to the activity itself.&lt;/li&gt;
&lt;li&gt;A loss of self-consciousness, action and awareness merge.&lt;/li&gt;
&lt;li&gt;Immediate feedback; behavior can be adjusted as needed.&lt;/li&gt;
&lt;li&gt;Balance between ability and challenge.&lt;/li&gt;
&lt;li&gt;A sense of control and serenity; freedom from worry about failure. &lt;/li&gt;
&lt;li&gt;Timelessness; thoroughly focused on the present.&lt;/li&gt;
&lt;li&gt;Intrinsic motivation; the experience becomes its own reward, resulting in effortlessness of action.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src="http://upload.wikimedia.org/wikipedia/commons/thumb/f/f6/Challenge_vs_skill.svg/300px-Challenge_vs_skill.svg.png"&gt;&lt;/img&gt;&lt;/p&gt;

&lt;p&gt;Not entirely unrelated is &lt;a href="http://www.ted.com/talks/dan_pink_on_motivation.html"&gt;Daniel Pink's analysis of motivation&lt;/a&gt; as:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Autonomy&lt;/strong&gt;: The urge to direct our own lives&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Mastery&lt;/strong&gt;: The desire to get better and better at something that matters&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: The yearning to do what we do in the service of something larger than ourselves&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From Palm Sunday by Kurt Vonnuget:&lt;/p&gt;
&lt;blockquote&gt;
Most of my adult life has been spent in bringing to some kind of order sheets of paper eight and a half inches wide and eleven inches long. This severely limited activity has allowed me to ignore many a storm. It has also caused many of the worst storms I ignored. My mates have often been angered by how much attention I pay to paper and how little attention I pay to them.&lt;/blockquote&gt;

&lt;blockquote&gt;I can only reply that the secret to success in every human endeavor is total concentration. Ask any great athlete.&lt;/blockquote&gt;

&lt;blockquote&gt;To put it another way: Sometimes I don&amp;apos;t consider myself very good at life, so I hide in my profession.&lt;/blockquote&gt;

&lt;blockquote&gt;I know what Delilah really did to Samson to make him as weak as a baby. She didn&amp;apos;t have to cut his hair off. All she had to do was break his concentration.&lt;/blockquote&gt;

&lt;p&gt;All this is a long way of saying multitasking sucks, or as someone with a fistful of yen might say, "What was that? This is not a charade. We need total concentration."&lt;/p&gt;

&lt;h4&gt;Links&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;TED talk &lt;a href="http://www.ted.com/talks/mihaly_csikszentmihalyi_on_flow.html"&gt;Mihaly Csikszentmihalyi on flow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://books.google.com/books?id=axRaiRhQ2CwC&amp;pg=PA47&amp;lpg=PA47&amp;dq=programming+uninterrupted+concentration"&gt;Effective leadership in adventure programming&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Maslow's_hierarchy_of_needs"&gt;Maslow's Hierarchy of Needs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.nytimes.com/2009/10/15/health/15chen.html?pagewanted=all"&gt;How Mindfulnes
