Digithead's Lab Notebook

Thursday, July 11, 2019

The New Digithead's Lab Notebook

For more state-of-the-art technological insight...

head on over to...

The New Digithead's Lab Notebook

[image source]

Monday, June 05, 2017

Postgres lockup on ALTER TABLE

You can learn a lot when you break things.

I had a bit of fun and panic with Postgres the other day, while trying to do the equivalent of this innocuous seeming drop column query:


ALTER TABLE aschema.sometable DROP COLUMN IF EXISTS unwanted_column;

So innocuous was this query that the column didn't even exist in production. It had already been dropped. I was checking in an Alembic migration to keep staging and dev DBs in line with prod. Harmless, right?

The deploy went live and things started going from green to red. Uh oh.

Well, it turns out that even a non-consequential ALTER TABLE operation takes a table lock. ...which should be totally fine, unless some long-running process has taken out an AccessShareLock and is sitting on it. There's a nice write-up on this exact situation here:

The [ALTER TABLE operation] tries to take the ACCESS EXCLUSIVE it needs, and it queues up behind the first lock. Now all future lock requests queue up behind the waiting ACCESS EXCLUSIVE request. Conceptually, incoming lock requests which are compatible with the already-granted lock could jump over the waiting ACCESS EXCLUSIVE and be granted out of turn, but that is not how PostgreSQL does it.

...in other words, screeching halt - database frozen. Crap!

Postgres committer Josh Berkus wrote a pair of articles on the topic (ALTER TABLE and downtime, part 1 and ALTER TABLE and downtime, part 2). Great, now what?

Luckily, Postgres provides means to find and kill errant tasks. So, let's find and cancel the blocked ALTER TABLE query and/or the guilty part holding the shared lock.

In case you're ever in the same situation, here are a few key PostgreSQL incantations:

What locks are being held?


select d.datname, t.relname, l.relation, l.locktype, page, virtualtransaction, pid, mode, granted
from pg_locks l
left join pg_stat_all_tables t on l.relation=t.relid
left join pg_database d on l.database=d.oid
order by d.datname, relation asc;

What locks are being held on a particular table?


select * from pg_locks where granted and relation = 'my_table'::regclass \x\g\x

Who's breaking my database and what do they think they're doing?


select * from pg_stat_activity where pid=1234567 \x\g\x
select pg_cancel_backend(pid);

See what you get when you break my database?


select pg_cancel_backend(1234567);

...and finally, thanks to our friend pg_cancel_backend, our DB is unfrozen and we can go track down who had that lock held for so long.

Tuesday, May 24, 2016

Docker cheat sheet

If you're using Docker, here's a nice Docker cheat sheet. I've collected a few choice bits of Docker magic here.

Docker comes with a point-n-click way to start a shell with docker hooks attached. Here's an easier way:


eval "$(docker-machine env default)"

Terminology

Docker terminology has spawned some confusion. For instance: images vs. containers and registry vs. repository. Luckily, there's help, for example this stack-overflow post by a brilliant, but under-appreciated, hacker on the difference between images and containers.

Registry - a service that stores image repositories
Repository - a set of Docker images, usually versions of the same application
Image - an immutable snapshot of a running container. An image consists of layers of file system changes stacked up on top of a base image.
Container - a runtime instance of an image

Working with Docker

We don't need to ssh into the container. Maybe you could call this "shelling" into a container?


docker run --rm -it ubuntu:latest bash

... more here later. In the meantime, see wsargent's cheat sheet ...

The Dockerfile is a version-controllable artifact that automates the creation of a customized image from a base image. There's plenty of good advice in the guide to Best practices for writing Dockerfiles.


docker build -t myimagename .

Clean

You should almost always run with the --rm flag to avoid leaving shrapnel around. But, that's easy to forget. If there are lots of old containers hanging around, this bit of magic will help:


docker ps -a | grep 'weeks ago' | awk '{print $1}' | xargs --no-run-if-empty docker rm

Images can pile up, too. In What are Docker : images?, Shishir Mahajan show how to clean up dangling images:


docker rmi $(docker images -f "dangling=true" -q)

Tuesday, May 03, 2016

Topic Modeling with LDA

Rob McDaniel gave a nice presentation on the flaming-hot topic of topic analysis yesterday evening hosted by Seattle metastartup Pitchbook. Grab slides and code from the github repo.

Rob is interested in using NLP to discern the level of objectivity or bias in text. As an example, he took the transcripts of the debates of this year's presidential campaign. Here's part of what he did with them:

For more, have a look at the post on Semantic analysis of GOP debates.

Interesting tidbits:

Wikipedia is a source of documents labeled as not objective.
Movie reviews are a source of documents labeled by rating, number of stars.
Topic cohesion measures how well a given document stays "on-topic" or even "on-message".
KL Divergence is entropy based measure of relatedness of topics.

There was an interesting side discussion of the orthogonality of topic modeling and word embedding (word2vec).

Some of the sources Rob mentioned were Tethne and one of it's tutorials, also a pair of papers Introduction to Probabilistic Topic Models and Probabilistic Topic Models both by David Blei.

Sunday, May 01, 2016

Future looks bright

Way back in December of 2008, Python 3.0 was released. Seven years later, Python 3 is finally gaining traction.

Why the wait? Broken backward compatibility? Lagging library support? Print's annoying new parentheses? Well, coders are a cranky lot, often not fond of change. Some even thought Python 3 would be the end of Python, suffering a fate similar to Perl 6. There was a bit of controversy:

The story holds some lessons.

Initially, the Python core developers seem to have imagined that people would take the plunge sooner or later, migrating to Python 3 and never looking back. But, there was something of a chicken-and-egg problem. Library maintainers were confronted with the unappealing prospect of supporting two code bases for some transition period of unknown duration. Python application developers were reluctant to upgrade until all their dependencies had made the move. Libraries and applications saw the majority of their peers waiting to upgrade. So, nothing happened. Python 3 adoption languished.

It took a while, but the Python community came up with a better solution. Tools like future and six enable a single code base to support both Python 2 and Python 3. Mostly this works by backporting the new Python 3 idioms to Python 2 allowing existing code to adopt new idioms at a measured pace while maintaining backwards compatibility. This strategy makes a ton of sense in a language as dynamic as Python. And with it, the deadlock was broken and migration could get underway. End of Life for Python 2 has been moved out to 2020, so those shims will be in place for a long time.

Contrast the Python 3 story with that of Java where a very high level of backwards compatibility is the standard. At this point, Java folks may be congratulating themselves on making the right call. But, backwards compatibility comes at a cost. There's no denying that those constraints have limited the evolution of the language. Java generics are marred by type erasure. Widely recognized warts in the standard libraries persist - for example, the inconsistency of string.length(), array.length, and collection.size(). Swing is as clunky as ever. Date support in the standard library has only recently been upgraded after years of criticism. Advanced practitioners like those behind Spring resort to esoteric means like byte-code manipulation to extend the capabilities of the language.

At the same time, the JVM became a laboratory for programming language design. Scala is essentially an evolved Java. Clojure brings LISPy metaprogramming to the JVM. These advanced JVM languages are a big influence on modern Java.

So, finally, Python 3 is humming along. The long incubation period has not been without advantages. Python 3.5 is very polished. Most promenant libraries support it. Usage is accelerating. Core developers promise Python 4.0 won't be like Python 3.0.

What lessons will the Python community and that of other technologies take away from the Python 3 experience? That breaking backwards compatibility is a terrible, terrible thing that should never be contemplated again? That's a progress limiting point of view. I hope there are lessons in how to manage substantial change and how to avoid fracturing the community in the process.

Careful attention needs to be paid to a smooth transition. Working code should continue to work while experiments continue to push forward with new and potentially better idioms. Those that offer a real improvement will, in time, take over, but gradually not as a step function. Periods of stability may alternate with accelerated change. End-users should be firmly in control of how cutting-edge they want to be. An essential ingredient is respect for differing rates of change, differing tolerance for risk, etc. There are those who thrive on the new and those who count on the tried and true. But, living breathing technologies need to evolve.

“In the long run every program becomes rococo, then rubble.”

...but not yet! ...can't wait 'til PyCon in a few weeks!

Sunday, November 22, 2015

Rdio was too good to last

When Apple killed Lala, there was a bad guy with motive, opportunity and a smoking gun. In the case of Rdio's recently announced demise, it's different. A postmortem by Casey Newton of the Verge explains Why Rdio died. The team had the design chops, engineering talent and love of music to create a fantastic product but lacked the business and marketing savvy to make it pay off.

Rdio's star feature was certainly design. The site enabled individuals to express themselves and relate to others - for their enthusiasm to feed off one another. You could follow people with intersecting tastes. New listening was suggested algorithmically based on what was "On the rise in your network". You could comment and respond, lovingly curate playlists and follow activity by your musical soul-mates. The social dimension is key to something as personal and tribal as music.

If you wanted to take art pop seriously, do a deep dive into electronic music, exhaustively survey the Zappa catalog, or peruse the archives of 5 years of Sunday Jazz Brunch selections, your people were there. Any platform will play music you already know, Rdio was a place to explore.

In some ways, credit belongs to the users themselves - those who shoehorned rich conversations into a relatively bare-bones comment feature, repurposing shared playlists as the equivalent of discussion forums. In one case, Community Playlist the Trilogy had some 3,522 comments and 116 collaborators.

It had, in a word, community.

Exodus

Refugees looking for a new musical home can find lots of resources on the Rdio lover's slack channel, including a compilation of tools for exporting playlists and other digital assets. A Python script by Jesse Mullan (playlist_helper), which will soon become the official data exporter, worked nicely for me.

There were several calls for a platform-neutral place for the community to live, independent of which streaming service folks end up migrating to. Some nascent possibilities are The Playlist or Hatchet. Last.fm fills that roll for me, at least for now. Maybe the Rdio Lover's slack channel will survive beyond the transition period.

Alternatives

Which service comes the closest to Rdio? Users on the slack channel have compiled a helpful guide Rdio features compared to the competitors. Roughly in order of how interesting they look to me, the main contenders are:

Why?

In A Eulogy for Rdio in the Atlantic, Robinson Meyer calls Rdio "a better streaming service in most every way". So, why did a great service with an intensely loyal following fail?

The economics of digital music are tricky. None of the streaming services are really making money. Rdio's $1.5M in monthly revenue, corresponding to perhaps 150,000 paying users, and $100-150k in advertising couldn't cover their costs of roughly $4M mainly for 140 employees and royalties. This explains the nasty pile-up of $220 million in debt. Music has been called Too free to be expensive. Too expensive to be free.

Pandora is buying Rdio's intellectual property and taking on some of the talent with the intention of introducing their own on-demand streaming service some time in 2016. Interestingly, customer data was not part of that transaction.

One message I hope no one takes away is that community doesn't matter. It's one of the few ways for streaming services to differentiate themselves. Without it, you feel "solitary, lonely and probed" in the characteristic phrasing of CAW a.k.a. The Aquatic Ape

So long

So, we're left with a reminder that the best doesn't always win. No doubt, us "snobby album purists" will find or coopt another platform on which to indulge our musical obsessions. Keep in touch, music peeps: http://www.last.fm/user/cbare.

“To all of you that have expanded my musical experience for the past six or so years - thank you, thank you thank you.”

“Man, I'm gonna miss my playlists.”

“I spent A LOT of time here.”

“Thank you all for introducing me to some truly great music (and some truly terrible, which I enjoyed nearly as much).”

“from the start it has been my most active social network”

“perhaps the kindest community in online music.”

“People left comments on albums, and, lo and behold, the writing was good and interesting.”

“you all have been invaluable in helping me not just discover new music, but in helping me open my mind to new kinds of music.”

“such a welcoming and amazing crew of fellow travelers”

...parting comments from Rdio users collected by fangoguagua.

Tuesday, August 04, 2015

Hacking Zebrafish thoughts

The last lab from Scalable Machine Learning with Spark features a guest lecture by Jeremy Freeman, a professor of neuroscience at Janelia Farm Research Campus.

Vladimirov et al., 2014

His group produced this gorgeous video of a living zebrafish brain. Little fish thoughts sparkle away, made visible by a technique called light-sheet flourescent microscopy in which engineered proteins that light up when the neurons fire are engineered into the fish.

The lab covers principal component analysis in a lively way. Principal components are extracted from time-series data and mapped onto an HSV color wheel and used to color an image of the zebrafish brain. In the process, we use some fun matrix manipulation to aggregate the time series data in two different ways - by time relative to the start of a visual stimulus and by the directionality of the stimulus (shown below).

The whole series of labs from the Spark classes was nicely done, but this was an especially fun way to finish it out.

Check out the Freeman Lab's papers:

Light-sheet functional imaging in fictively behaving zebrafish, Vladimirov et al., 2014
Mapping brain activity at scale with cluster computing Freeman et al., 2014

Tuesday, July 21, 2015

Machine learning on music data

The 3rd lab from Scalable Machine Learning with Spark has you predict the year a song was published based on features from the Million Song Dataset. How much farther could you take machine analysis of music? Music has so much structure that's so apparent to our ears. Wouldn't it be cool to be able to parse out that structure algorithmically? Turns out, you can.

from Music Information Retrieval - Juan P Bello

Apparently The International Society for Music Information Retrieval (ISMIR) is the place to go for this sort of thing. A few papers, based on minutes of rigorous research (aka random googling):

In addition to inferring a song's internal structure, you might want to relate it's acoustic features to styles, moods or time periods (as we did in the lab). For that, you'll want music metadata from sources like:

There's a paper on The Million Song Dataset paper by two researchers at Columbia's EE department and two more at the Echo Nest.

Even Google is interested in the topic: Sharing Learned Latent Representations For Music Audio Classification And Similarity.

Tangentially related, a group out of Cambridge and Stanford say Musical Preferences are Linked to Cognitive Styles. I fear what my musical tastes would reveal about my warped cognitive style.

Wednesday, July 08, 2015

Scalable Machine Learning with Spark class on edX

Introduction to Big Data with Apache Spark is an online class hosted on edX that just finished. Its follow-up Scalable Machine Learning with Spark just got started.

If you want to learn Spark - and who doesn't? - sign up.

Spark is a successor to Hadoop that comes out of the AMPLab at Berkeley. It's faster for many operations due to keeping data in memory, and the programming model feels more flexible in comparison to Hadoops rigid framework. The AMPLab provides a suite of related tools including support for machine learning, graphs, SQL and streaming. While Hadoop is most at home with batch processing, Spark is a little better suited to interactive work.

The first class was quick and easy, covering Spark and RDDs through PySpark. No brain stretching on the order of Daphne Koller's Probabilistic Graphical Models to be found here. The lectures stuck to the "applied" aspects, but that's OK. You can always hit the papers to go deeper. The labs were fun and effective at getting you up to speed:

Labs for the first class:

Word count, the hello world of map-reduce
Analysis of web server log files
Entity resolution using a bag-of-words approach
Collaborative filtering on a movie ratings database. Apparently, I should watch these: Seven Samurai, Annie Hall, Akira, Stop Making Sense, Chungking Express.

The second installment looks to very cool, delving deeper into mllib the AMPLab's machine learning library for Spark. Its labs cover:

Musicology: predict the release year of a song given a set of audio features
Prediction of click-through rates
Neuroimaging Analysis on brain activity of zebrafish (which I suspect is the phase "Just keep swimming" over and over) done in collaboration with Jeremy Freeman of the Janelia Research Campus.

The labs for both classes are authored as IPython notebooks in the amazingly cool Jupyter framework where prose, graphics and executable code fit combine to make a really nice learning environment.

Echoing my own digital hoarder tendencies, the first course is liberally peppered with links, which I've dutifully culled and categorized for your clicking compulsion:

The Data Science Process

In case you're still wondering what data scientists actually do, here it is according to...

Jim Gray

Capture
Curate
Communicate

Ben Fry

Acquire
Parse
Filter
Mine
Represent
Refine
Interact

Jeff Hammerbacher

Identify problem
Intrumenting data sources
Collect data
Prepare data (integrate, transform, clean, filter, aggregate)
Build model
Evaluate model
Communicate results

...and don't forget: Jeffrey Leek and Hadley Wickham.

Tuesday, June 02, 2015

Beyond PEP 8 -- Best practices for beautiful intelligible code

I didn't really mean to become a Python programmer. I was on my way to something with a little more rocket-science feel. R, Scala, Haskell, maybe. But, since I'm here, I may as well learn something about how to do it right. In this respect, I've become a fan of Raymond Hettinger.

Python coders will enjoy and benefit from Raymond's excellent talk given at PyCon 2015 about Python style, Beyond PEP 8 -- Best practices for beautiful intelligible code.

"Who should PEP-8-ify code? The author. PEP 8 unto thyself not unto others."

To Hettinger, PEP-8 is not a weapon for bludgeoning rival developers into submission. Going beyond PEP 8 is about paying attention to the stuff that really matters - using languages features like magic methods, properties, iterators and context managers. Business logic should be clear and float to the top. In short, writing beautiful idiomatic Pythonic code.

There are plenty more videos from PyCon 2015 where that one came from.

Monday, March 09, 2015

Extended Lake Union Loop

The standard running loop around Lake Union is a touch over 6 miles. With the addition of a side loop around Portage Bay, you can bring it up to 8 and a half, taking in a bit of UW's campus and crossing over the cut into Montlake. Sticking to the water's edge keeps the terrain nice and flat, but if you want some climbing, head up into Capitol Hill via Interlaken park.

Here, I've factored in a stop at PCC for a cold drink.

Tuesday, January 27, 2015

Haskell class wrap-up

[From the old-posts-that-I've-sat-on-for-entirely-too-long-for-no-apparent-reason department...]

Back in December, I finished FP101x, Introduction to Functional Programming. I'm stoked that I finally learned me a (little) Haskell, after wanting to get around to it for so long.

The first part of the course was very straight-forward covering the basics of programming in the functional style. But the difficulty ramped up quickly.

A couple of labs were particularly mind-bending, not just for me judging by the message boards. Both were based on Functional Pearl papers and featured monads prominantly. The first was on monad parser combinators and the second was based on A Poor Man's Concurrency Monad. Combining concurrency (of a simple kind), monads and continuation passing is a lot to throw at people at once.

The abrupt shift to more challenging material is part of a philosophy of "teaching the students to fish for themselves". So is introducing new material in the labs rather than in the lectures. This style of teaching alienated a number of students. It's not my favorite, but I can roll with it.

Just be aware that the course requires some self-directed additional reading and don't flail around trying to solve to homeworks without sufficient information.

More Haskell

Now that the class is over, I'd like to find time to continue learning Haskell:

Finish Learn You a Haskell
Write yourself a Scheme in 48 hours
99 Haskell Problems
Typeclassopedia
Chris Allen, aka bitemyapp has a really great How to learn Haskell guide as well as some interesting thoughts about furthering your functional education
The Haskell Road to Logic, Maths and Programming

One reason I wanted to learn Haskell is to be able to read some of the Haskell-ish parts of the programming languages literature:

Why Functional Programming Matters - John Hughes
Theorems for free! - Philip Wadler
The essence of functional programming - Philip Wadler
A history of Haskell: being lazy with class - Paul Hudak, John Hughes, Simon Peyton Jones, Philip Wadler

Monday, January 12, 2015

Brave Genius

Brave Genius is an unlikely dual biography of a biologist and a writer who shared a friendship and a common philosophy. Both were active in the French resistance to the German Occupation and both would later receive a Nobel prize. Sean B. Carroll forges an inspiring story from seemingly incongruous elements: the desperate defiance of a few in an occupied country, the exhilarating pursuit of an open scientific question, and a lonely stand on the moral high ground.

In 1940, Jacques Monod was a newly married father of twins and a researcher at the Sorbonne. Albert Camus, having already published a couple of books of essays, departed his native Algeria for France in March of that year to find work.

On May 10 1940, German troops crossed into Holland and Belgium. Panzers raced towards the Atlantic coast severing Allied lines and stranding French and British troops in the low countries. French defenses collapsed and Germans arrived in an undefended Paris on June 14. The armistice signed on June 22nd marked the beginning of four years of occupation.

During those years, Camus edited and wrote for the underground newspaper Combat urging resistance to the occupation. As the tide of the war turned, Monod organized sabotage attacks and armed resistance ahead of the approaching liberators.

“I have always believed that if people who placed their hopes in the human condition were mad, those who despaired of events were cowards. Henceforth, there will be only one honorable choice: to wager everything on the belief that in the end words will prove stronger than bullets.” Camus, Combat (November 30, 1946)

François Jacob, André Lwoff and Jacques Monod were awarded a Nobel prize in 1965 for their work on the control of gene expression, elucidating the regulation of the lac operon by which bacteria switch on metabolism of the sugar lactose.

In his writing, Camus confronts the absurdity of the human search for clarity and meaning in a world that offers only indifference. The attempt to derive meaning and morality without resort to mysticism links Camus's philosophy to Monod's scientific work, which provided some of the first direct evidence that life is mechanistic rather than the result of some magical "vital force" and that its workings could be understood.

“The scientific approach reveals to Man that he is an accident, almost a stranger in the universe.” Monod, in On Values in the Age of Science (1969)

“One of the great problems of philosophy, is the relationship between the realm of knowledge and the realm of values. Knowledge is what is; values are what ought to be. I would say that all traditional philosophies up to and including Marxism have tried to derive the 'ought' from the 'is.' My point of view is that this is impossible.” Monod

Carroll, a biologist himself, embeds philosophy and science into the personal lives of his protagonists and the geopolitical events unfolding around them. Both men did brilliant work in the darkest of times, and did so not by retreating but by fully engaging at great risk with the struggles that faced them. The book serves as a warning of what happens when good people overlook the malfeasance of their leaders, but also as confirmation of the resilience of intellect, creativity and humanity.

In the review paper Genetic Regulatory Mechanisms in the Synthesis of Proteins, published in 1961, François Jacob, André Lwoff and Jacques Monod synthesized the state of the art of molecular biology at the time.
What is Life? written in 1943-44 by the physicist Erwin Schrödinger inspired a generation of scientists by conjecturing some unknown "aperiodic solid" containing an "some kind of code-script the entire pattern of the individual's future development and of its functioning in the mature state."
The Birth of the Operon is François Jacob's essay on "night science".
Chance and Necessity Jacques Monod's 1971 book: "For some time now, the unpleasant idea has been dawning on mankind that it may owe its existence to nothing but a role of some cosmological dice."
Sean B. Carroll gave a wonderfully entertaining lecture based on another of his books, Remarkable Creatures
Jacques Monod and the origins of molecular biology, a review of the book Origins of molecular biology – A tribute to Jacques Monod.

Sunday, January 04, 2015

The Master Switch

The Master Switch: The Rise and Fall of Information Empires was described as "essential reading" by my boss's boss. If you're at all interested in the interplay of technology, economics and politics, I think you'll agree.

Author Tim Wu is the originator of the term "net neutrality" and a law professor at Columbia. He has written a fast-forward history of the information technology industry focusing on the people and corporations that have, over time, controlled the commanding heights of the information economy. The book examines the cartels that held sway over telephone, radio, film, and television leading up to the question of whether the internet will also come to fall under similar domination.

The cycle is the author's term for the progression of any given technology from the wide-open wild-west early days through a process of integration and consolidation to an end state of oligopoly or monopoly. This stasis eventually gets disrupted by newer technology or government intervention, leading to another open phase and a new round of the cycle, empires rising and falling in the process. "The one-time revolutionaries always become the next generation of dictators. That's why we need, in technology, another generation of revolutionaries to upend them."[1]

Open vs. closed systems

The book revolves around the virtues and vices of open and closed systems. Open systems are more adaptable and democratic but have trouble matching the stability, security and efficiency of closed systems. Open systems embrace the advantages of decentralization as espoused in different ways by Friedrich Hayek and Jane Jacobs. But, integrated centralized systems can be reliable and convenient.

Closed systems, of course, appeal to empire builders such as Theodore Vail who created the AT&T Bell System. Wu's knack for sketch biography is put to good use profiling these power-hungry moghuls and the often utopian upstarts that seek to dethrone them. We meet titans, like Vail, and get a glimps into the sometimes contradictory character traits it takes to control an information empire, for example: David Sarnoff, who ruled the Radio Corporation of America (RCA) and NBC; John Reith, founder of the BBC; Adolph Zukor who started Paramount pictures and Ted Turner creator CNN and former head of Time Warner. We also meet hackers like early radio enthusiast Lee De Forest and supressed inventor of FM radio Edwin Armstrong.

The capture of the Internet?

The American system attempts to carefully balance power within the government, but takes a laissez faire approach to private power. If Wu is right and we let things take their natural course, the openness that now characterizes the Internet - the "integrity of the Internet itself as a reliable, independent, and open structure"[2] - may be lost to a period of lockdown. Network effects, the power of integration and economies of scale favor the monopolist. Consumers may decide to favor consistency and convenience over openness and choice only to regret it later. If this is the case, the internet will not remain open automatically but only with concerted effort.

The remedy Wu proposes is a principle of separation akin to the separation of church and state or the separation of powers within the branches of the American government. The common carrier obligation of all infrastructure providers implies net neutrality and opposes verical integration across layers of the network stack. Technology leaders would be expected to self-regulate based on a sense of public duty. The FCC should pursue enforcement with an eye to the special role of information technology in a democratic society. Anti-trust regulation is the back-up, when it's time to bring out the big guns.

Fight on

The Master Switch gives a deeper perspective on the great game playing out in the technology sector. After reading it, you'll recognize the historical themes threading through the open-source movement, the Apple vs Google skirmishes or 2012's battle that defeated the SOPA / PIPA acts. The fight over the future of the Internet is surely not over.

[1] NPR interview The Past And Future Of Information Empires
[2] Vanity Fair's reporting on SOPA and PIPA World War 3.0
Colbert interviews Tim Wu End of Net Neutrality
The New Yorker's review Tim Wu on Communication, Chaos, and Control
NYT book review From Hobby to Industry By David Leonhardt
One on One: Tim Wu, Author of ‘The Master Switch’
Ars technica disputes the book's premise.
The Manichean World of Tim Wu
Why the man who coined the phrase 'net neutrality' feared Apple
Speaking at the MIT Media Lab
WGBHForum at the Harvard Book Store
Speaking at the Berkman Center for Internet & Society at Harvard
Stanford Center for Internet and Society
Tim Wu's TED talk
General Assembly and Tim Wu: Net Neutrality and the Politics of Entrepreneurship
Review of The Master Switch by "High Tech History"

Wednesday, December 24, 2014

What the #@$% is a Monad?

Monads are like fight club. The first rule of monads is don't blog about monads.

Kind of a design pattern for functional programming, monads are already the subject of more than enough well intentioned but confusing tutorials. We'll not commit the monad tutorial fallacy here. But, monads are needed for a couple of the labs from FP101x, an online class in Haskell - labs with a throw-'em-into-the-deep-end quality to them.

Here's a quick list of some of the better resources I found, while struggling to get a handle on these super-abstract objects of mystery.

Starting points

Functors, Applicatives, And Monads In Pictures, by Aditya Bhargava, is a friendly place to start.
Brian Beckman: Don't fear the Monad
FPComplete's School of Haskell explains what monads are and why to use them.

Phillip Wadler

It's been said that "Monads are hard because there are so many bad monad tutorials getting in the way of finally finding Wadler’s nice paper." Find it here:

Monads for functional programming

Need more?

Those got me over the first hump, but here are some I may want to come back to later:

You Could Have Invented Monads! (And Maybe You Already Have)
Monads for the Curious Programmer by Bartosz Milewski who used to be in Black Sabbath. See also Milewki's easiest way to understand Monads.
Monads and Gonads a Google Tech-Talk by Douglas Crockford
Is There Anything Left to Say about Monads?
Monads in small bites
Monads for Dummies
What are monads and why are they useful?

To put monads in a more general context, here's a really great guide to Getting started with Haskell.

Wednesday, December 03, 2014

Lee Edlefsen on Big Data in R

Lee Edlefsen, Chief Scientist at Revolution Analytics, spoke about Big Data in R at the FHCRC a week or two back. He introduced the PEMA or parallel external memory algorithm.

“Parallel external memory algorithms (PEMA's) allow solution of both capacity and speed problems, and can deal with distributed and streaming data.”

When a problem is too big to fit in memory, external memory algorithms come into play. The data to be processed is chunked and loaded into memory a chunk at a time and partial results from each chunk combined into a final result:

initialize
process chunk
update results
process results

Edlefsen made a couple of nice observations about these steps. Processing an individual chunk can often be done independently of other chunks. In this case, it's possible to parallelize. If updating results can be done as new data arrives, you get streaming.

Revolution has developed a framework for writing parallel external memory algorithms in R, RevoPemaR, making use of R reference classes.

I couldn't find Edlefsen's exact slides, but these decks on parallel external memory algorithms and another from UseR 2011 on Scalable data analysis in R seem to cover everything he talked about.

Saturday, November 22, 2014

Haskell class, so far

Well, I'm about 5 weeks into Introduction to Functional Programming, a.k.a. FP101x, an online class taught in Haskell by Eric Meijer. The class itself is a couple weeks ahead of that; I'm lagging a bit. So, how is it so far, you ask?

The first 4 weeks covered basic functional concepts and how to express them in Haskell, closely following chapters 1-7 of the book, Graham Hutton's Programming in Haskell:

Defining and applying functions
Haskell's type system
- parametric types
- type classes
- type signatures of curried functions
pattern matching
list comprehensions
recursion
higher-order functions

Haskell's hierarchy of type classes is elegant, but some obvious things seem to be missing. For example, you can't show a function. But, it would be really helpful to show something like a docstring, or at least the function's type signature. Also machine-word sized Int's don't automatically promote, so if n is an Int, n/5 produces a type error.

Most of the concepts were familiar already from other functional languages, Scheme via SICP, OCAML via Dan Grossman's programming languages class, and Clojure via The Joy of Clojure. So, this early part was mostly a matter of learning Haskell's syntax.

Some nifty examples

a recursive definition of factorial:

factorial :: Integer -> Integer
factorial 0 = 1
factorial n = n * (factorial (n-1))

sum of the first 8 powers of 2:
```
sum (map (2^) [0..7])
```

a recursive definition of map:

map :: (a -> b) -> [a] -> [b]
map f [] = []
map f (x:xs) = f x : map f xs

get all adjacent pairs of elements from a list:

pairs :: [a] -> [(a,a)]
pairs xs = zip xs (tail xs)

check if a list of elements that can be ordered is sorted by confirming that each pair of elements is ordered:
```
sorted :: Ord a => [a] -> Bool
sorted xs = and [x <= y |(x,y) <- pairs xs]
```

Haskell attains its sparse beauty by leaving a lot implied. One thing I figured out during my brief time with OCAML also seems to apply to Haskell. Although these languages lack the forest of parentheses you'll encounter in Lispy languages, it's not that the parentheses aren't there; you just can't see them. A key to reading Haskell is understanding the rules of precedence, associativity and fixity that imply the missing parentheses.

Pre- cedence	Left associative	Non- associative	Right associative
9	`!!`		`.`
8			`^`,`^^`,`**`
7	`*`, `/`, `div`, `mod`, `rem`, `quot`
6	`+`,`-`
5			`:`,`++`
4		`==`, `/=`, `<`, `<=`, `>`, `>=`, `elem`, `notElem`
3			`&&`
2			`\|\|`
1	`>>`, `>>=`
0			`$`, `$!`, `seq`

Another key is reading type signatures of curried functions, as currying is the default in Haskell and is relied upon extensively in composing functions, particularly in the extra terse "point-free" style.

Currently, I'm trying to choke down Graham Hutton's Addendum on Monads. If I end up understanding that, it'll get me a code-monkey merit badge, for sure.

Tuesday, November 11, 2014

The DREAM / RECOMB Conference 2014

The RECOMB/ISCB Conference on Regulatory and Systems Genomics, with DREAM Challenges and Cytoscape Workshops is running this week in San Diego.

A bunch of us from Sage Bionetworks are here to connect with the DREAM community. In introductory remarks, Stephen Friend framed the challenges as piloting new modes of collaboration and engagement addressing multidimensional problems based on the idea that open innovation will trump closed silos.

Lincoln Stein: The Future of Genomic Databases

I first heard Lincoln Stein speak at an O'Reilly conference in 2002, on building a bioinformatics nation. The same themes of openness and integration reappeared in Stein's talk on The Future of Genomic Databases.

Stein asks, "Open Data + open source = reproducible science?" Not exactly. Stein presents some emerging solutions to the remaining obstacles: big data sets, complex workflows, unportable code and data access restrictions.

Cloud computing, specifically colocation of data and compute, enables handling big data. Containers (ie Docker) address the problem of code portability. The Global Alliance is working towards providing APIs both to encapsulate technical complexity and to provide a control point at which to enforce restrictions.

In case we're wondering what to do with all the machine cycles made available by Amazon and Google, bioinformatics workflows are growing in complexity. Workflow managers like Seqware and Galaxy provide a formalized description of multistep processes and manage tools and their dependencies.

Legal restrictions hinder data integration. But, donors want their samples to contribute to research. Licensure for data access combined with uniform consent could reduce the friction resulting in a streamlined data access process. On the other hand, technical solutions involve homomorphic encryption and agent based federated queries.

As a parting thought, Stein notes that digital infrastructure enables experiments in incentive structures and economic models, citing micropayments, ratings, and challenges.

Andrea Califano

Andrea Califano spoke on the genotype to phenotype linkage in cancer. Thinking of the cell as an integrator of signals, Califo's group traces from gene or protein expression signatures of cell states (normal, neoplastic, metastatic) back through the network to the master regulators responsible for that signature. One related paper is Identification of Causal Genetic Drivers of Human Disease through Systems-Level Analysis of Regulatory Networks.

Paul Boutros Somatic Mutation Calling Challenge

Paul Boutros presented the Somatic Mutation Calling Challenge (SMC-DNA). He announced the intention for the SMC challenges to become a living benchmark, an objective standard against which future methods will be tested.

Paul also crowned the Broad Institute's MuTect (single nucleotide) and novoBreak (structural variants) by Ken Chen's lab at MD Anderson the winners of the synthetic tumor phase of SMC-DNA. The plan is to announce winners on real tumor data in February after experimental validation.

The Winners

The SMC challenge is a bit unique for DREAM in its level of specialization. In the other challenge, a couple of methods were highlighted: Gaussian process regression and dictionary learning for sparse representation.

But, increasingly, the main differentiator is application of biological domain knowledge, especially with respect to selecting and processing features. Li Liu of Arizona State's Biodesign Institute, for example, won part of the Accute Myoloid Leukemia challenge by weighting proteins based on their evolutionary conservation.

Another theme is that genetic features seem to have poor signal compared to more downstream features, gene expression or clinical variables. Peddinti Gopalacharyulu, a top performer in the Gene Essentiality Challenge, commented that perhaps the way to use genetics is to extract the component of gene expression that is not explained by genetic features.

DREAM 9.5

Two of the Dream 9.5 challenges are follow-ups to the Somatic Mutation Calling challenge from the 8.5 round. The SMC empire expands into RNA and tumor heterogeneity. In the olfaction challenge, the goal is to predict, from molecular features, odor as described by human subjects. The Prostate cancer challenge asks participants to classify patients according to survival using data sourced from the comparator arms of clinical trials.

For the DREAM 10 round, there's an imaging challenge in the works and a sequel to the ALS challenge challenge from DREAM 7.

On to RECOMB

That's just the DREAM part of the meeting, or, really, the subset that fit into my brain. As an added bonus, there were several representatives from Cytoscape-related projects and some conversation about the Global Alliance for Genomics and Health.

Tuesday, October 14, 2014

Let's learn us a Haskell

Let's say you've been meaning to learn Haskell for a long time, secretly yearning for purely functional programming, laziness and a type system based on more category theory than you can shake a functor at.

Now's your chance. Erik Meijer is teaching an online class Introduction to Functional Programming on edX, about which he says, "This course will use Haskell as the medium for understanding the basic principles of functional programming."

It starts today, but I've gotten a head start by working through the first few chapters of Learn You a Haskell for Great Good! which most agree is the best place to get started with Haskell.

Anyone up for a Seattle study group?

Tutorials

Books

Other resources

Typeclassopedia
H-99: Ninety-Nine Haskell Problems
"Programming with Arrows" paper by John Hughes.
C9 Lectures: Erik Meijer - Functional Programming Fundamentals
What I Wish I Knew When Learning Haskell

It must be some pack-rat instinct that makes me compile these lists.

Thursday, July 03, 2014

Galaxy Community Conference

The 2014 Galaxy Community Conference (GCC2014) wrapped up yesterday at Johns Hopkins University in Baltimore. The best part for me was the pre-conference hackathon.

If you're not familiar with Galaxy, it's a framework for wrapping command line bioinformatics tools in a web UI. On top of that, Galaxy adds lots of sophistication around job scheduling, configurable to run its job on SGE, SLURM, and other queuing systems and to run on virtual clusters with Cloudman. Galaxy users can design reproducible workflows and manage tool versioning and dependencies through the Toolshed.

The project has attracted a vibrant community supported by an active Q&A site and an IRC channel. Vendors offer Galaxy appliances, cloud deployments and consulting.

Hackathon

For the hackathon, 40 developers gathered in James Taylor's new computing lab in one of the red brick buildings interspersed with lawns and groves of trees that make up the Hopkins campus. This was a great setting for an accelerated ramp-up on the Galaxy codebase. Out of 18 proposed projects, 9 got far enough to be called finished. I learned how to write tool wrappers and how to access the Galaxy API.

Photo credit - someone on the Galaxy team

Galaxy + Synapse

Specifically, I was there to put together some integration between Synapse and Galaxy, bulding on what Kyle Ellrott had already started. Reproducibile analysis, provenance and annotation are core concerns of both projects, so it seems like a good fit. After data exchange, which Kyle got working, exchanging provenance and metadata sound like logical next steps.

Galaxy histories should be able to refer to Synapse entities and receive annotations along with data objects.

Serializing Galaxy histories, workflows etc. out to Synapse might allow a usage model where a cloud instances of Galaxy can be spun up and shut down on demand, with all the important details preserved in Synapse in between times.

Portable provenance might be a longer term goal. It would be neat to be able to thread provenance through some arbitrary combination of data sources and analysis platforms. All of these provenance-aware platforms - Figshare, Dryad, Synapse, Arvados, Galaxy, etc - ought to be able to share provenance between themselves.

Next year

The whole event was well put together in general. Next year's Galaxy Community Conference is scheduled for 6-8 July 2015 in Norwich, England.

Anton Nekrutenko group at Penn State
James Taylor's group which recently moved from Emory to Johns Hopkins