Monday, February 21, 2011

Using R for Introductory Statistics, Chapter 5, hypergeometric distribution

This is a little digression from Chapter 5 of Using R for Introductory Statistics that led me to the hypergeometric distribution.

Question 5.13 A sample of 100 people is drawn from a population of 600,000. If it is known that 40% of the population has a specific attribute, what is the probability that 35 or fewer in the sample have that attribute.

I'm pretty sure that you're supposed to reason that 600,000 is sufficiently large that the draws from the population are close enough to independent. The answer is then computed like so:

> pbinom(35,100,0.4)
[1] 0.1794694

Although this is close enough for practical purposes, the real way to answer this question is with the hypergeometric distribution.

The hypergeometric distribution is a discrete probability distribution that describes the number of successes in a sequence of k draws from a finite population without replacement, just as the binomial distribution describes the number of successes for draws with replacement.

The situation is usually described in terms of balls and urns. There are N balls in an urn, m white balls and n black balls. We draw k balls without replacement. X represents the number of white balls drawn.

R gives us the function phyper(x, m, n, k, lower.tail = TRUE, log.p = FALSE), which does indeed show that our approximation was close enough.

> phyper(35,240000,360000, 100)
[1] 0.1794489

Since we're down with OCD, let's explore a bit further. First, since our population is defined and not too huge, let's just try it empirically. First, create our population.

> pop <- rep(c(0,1),c(360000, 240000))
> length(pop)
[1] 600000
> mean(pop)
[1] 0.4
> sd(pop)
[1] 0.4898984

Next, generate a boatload of samples and see how many of them have 35 or fewer of the special members.

> sums <- sapply(1:200000, function(x) { sum(sample(pop,100))})
> sum(sums <= 35) / 200000
[1] 0.17935

Pretty close to our computed results. I thought I might be able to compute an answer using the central limit theorem, using the distribution of sample means, which should be approximately normal.

> means <- sapply(1:2000, function(x) { mean(sample(pop,100))})
> mean(means)
[1] 0.40154
> sd(means)
[1] 0.0479998
> curve(dnorm(x, 0.4, sd(pop)/sqrt(100)), 0.2, 0.6, col='blue')
> lines(density(means), lty=2)

Shouldn't I be able to compute how many of my samples will have 35 or fewer special members? This seems to be a ways off, but I don't know why. Maybe it's just the error due to approximating a discreet distribution with a continuous one?

> pnorm(0.35, 0.4, sd(pop)/sqrt(100))
[1] 0.1537173

This fudge gets us closer, but still not as close as our initial approximation.

> pnorm(0.355, 0.4, sd(pop)/sqrt(100))
[1] 0.1791634

If anyone knows what's up with this, that's what comments are for. Help me out.

Notes on Using R for Introductory Statistics

Chapters 1 and 2

Chapter 3

Chapter 4

Chapter 5

Sunday, February 13, 2011

The Tiger Mom and A Clockwork Orange

True disciple is doing what you want.

A wise friend once told me that. Amy Chua, better known as the Tiger Mother, wrote about discipline (from a different point of view) in Why Chinese Mothers Are Superior.

What Chinese parents understand is that nothing is fun until you're good at it. To get good at anything you have to work, and children on their own never want to work, which is why it is crucial to override their preferences. [...] Tenacious practice, practice, practice is crucial for excellence; rote repetition is underrated in America. Once a child starts to excel at something -- whether it's math, piano, pitching or ballet -- he or she gets praise, admiration and satisfaction. This builds confidence and makes the once not-fun activity fun. This in turn makes it easier for the parent to get the child to work even more.

For what it's worth, Chua's book is apparently less strident and more nuanced than the WSJ article. Anyway, like her methods or not, I have a lot of sympathy for a parent trying to teach her kids about delayed gratification, that you can do difficult things if you try, and that hard work pays off.

If it's true that mastering a complex skill takes 10,000 hours of practice, then the persistence to push through those hours is a fairly important lesson to learn early. Recent research has caused a reappraisal in how much talent arrises from innate genius versus how much is the product of effort, practice and persistence.

What brought to mind my old friend's remark about disciple was Paul Buchheit's take: motivation can be either intrinsic or extrinsic. Amy Chua is teaching her kids to be extrinsically motivated, to respond to the praise and admiration of others. You do it because you are told to. You put your energy into chasing the approval of external authorities. In contrast, he describes intrinsic motivation like this:

To the greatest extent possible, do whatever is most fun, interesting, and personally rewarding (and not evil).

Follow your heart, as hippy moms tell their children. Buccheit says, "I'm kind of lazy, or maybe I lack will power or discipline or something. Either way, it's very difficult for me to do anything that I don't feel like doing." Sounds familiar. "The intrinsic path to success is to focus on being the person that you are, and put all of your energy and drive into being the best possible version of yourself."

The difference between intrinsic and extrinsic motivation is easily recognized in the moral dimension. In A Clockwork Orange, Anthony Burgess imagines the transfer of aesthetic sense from creation to violence. Deprived of outlet, creativity turns destructive. The main thrust of the story is an examination of attempts to impose an external morality by force versus growing an internal morality.

Paul Buccheit was the software developer that originated Google's gmail. For myself, and I'm sure lots of others, a key attraction to programming was the ability to create in a powerful medium without asking anyone's permission. The creative freedom, the feeling that the authorities hadn't (yet) figured out how to lock things down was incredibly inspiring.

It's easy to see how that attraction to technology meshes with Daniel Pink's elements of motivationautonomy, mastery, and purpose. If you've got root and a compiler, you've got autonomy. And it's all about a pissing contest of mastery. (This might partially explain the gender ratio in the field.) And technology is rife with appeals to higher purpose, from the open-source movement to the digital media that helped fuel the revolutions in Tunisia and Egypt.

The Tiger Mom demands mastery before autonomy, leaving purpose firmly in the hands of the parent. Buccheit puts autonomy first, trusting in a natural sense of direction to lead to mastery and purpose. In terms of Maslow's hierarchy of needs, Amy Chua has the "esteem" level covered, but stops short of the top level - intrinsic self-directed creativity.

Some suggest that American society erects a border fence at the entry to the highest level. Well, nobody ever mentions why it's always drawn as a pyramid, but there's probably a reason. Not everyone gets to be at the top. Usually, that's the realm only of the elite.

David Cain of Raptitude writes under the title How to Make Trillions of Dollars, that the fundamentals of being a self-directed person are these:

Creativity. Curiosity. Resilience to distraction. Patience with others.

Cain defines self-reliance as “an unswerving willingness to take responsibility for your life, regardless of who had a hand in making it the way it is”.

Again, we're basically talking about the top levels of the pyramid. But, I like the addition of resilience to distraction. Discipline is a hard sell in a culture the promotes immediate gratification and nonstop indulgence. It's hard to hear your own voice over the clamor of consumer culture and expectations from family, boss and everyone else. Hearing it is nearly impossible while drowning in distractions like twitter and facebook. And here's something else to remember about online amusements:

If you're not paying, you're not the customer; you're the product.

At a recent data mining conference I saw rooms full of marketers ready to slice, dice and mash up your personal data to more precisely target advertising. Resisting this attack is an essential skill of modern life. A healthy cynicism is a necessary defense mechanism. Hearing yourself think is only going to get harder.

The flawed idea implicit in consumer culture is what you consume is what you are. But, valuing consuming over creating or doing is inevitably a dead end. The reason I'm not so hot on the iPad is that it's a device for consuming. The old macs were (marketed as) tools for programmers, graphic artists, musicians and film-makers -- in other works doers, builders and creators.

Purpose has to come from values. The recent travails of the financial sector show what happens when motivations or at least incentives become disconnected from morals and values.

Of course, lots of technology has a purpose no higher than selling golf clubs on the internet. And technology, itself, can be a distraction. It's easy to get caught up in a rat race of the latest whizzy buzzword laden language, tool or application framework dujour. For years, I've had a half-joking theory that the true purpose of the internet is to absorb the excess productivity of mankind.

Creative, conceptual work driven by autonomy, mastery and purpose pursued with uninterrupted concentration. That begins to answer the question, how do we get some motivation, apply it to something good and inspire the same in those around us, especially our ungrateful screaming offspring.

You can argue one way or another about whether a 13 year old has the foresight to be intrinsically motivated. I certainly didn't have the wherewithal at that age to set a long term goal. And pushing intrinsic motivation on your kids sounds to me something like imposing democracy by force. I doesn't make that much sense.

It takes determination to undertake exhausting frustrating efforts whose payoff is distant and uncertain. Most things that are worth doing are hard. The drive and courage to try anyway is what makes real progress possible. That doesn't come easily, and neither does the judgement necessary to gauge what is worth while against the scale of your own values.

From my current position, I don't particularly want to lecture anyone on how to succeed in life. I respond negatively to coercion and can be a champion slacker, both of which were much to the detriment of my academic career. Still, here it is:

  • Value creation over consumption.
  • Surround yourself with creators.
  • Do what you love, do it a lot, and do it hard.

Tuesday, February 08, 2011

Using R for Introductory Statistics, Chapter 5, Probability Distributions

In Chapter 5 of Using R for Introductory Statistics we get a brief introduction to probability and, as part of that, a few common probability distributions. Specifically, the normal, binomial, exponential and lognormal distributions make an appearance.

For each distribution, R provides four functions whose names start with the letters d, p, q or r followed by the family name of the distribution. For example, rnorm produces random numbers drawn from a normal distribution. The letters stand for:

ddensity/mass function
pprobability (cumulative distribution function) P(X <= x)
qquantiles, given q, the smallest x such that P(X <= x) > q
rrandom number generation

Normal

The Gaussian or normal distribution has a prominent place due to the central limit theorem. It is widely used to model natural phenomena like variations in height or weight as well as noise and error. The 68-95-99.7 rule says:

  • 68% of the data falls within 1 standard deviation of the mean
  • 95% of the data falls within 2 standard deviations of the mean
  • 99.7% of the data falls within 3 standard deviations of the mean

To plot a normal distribution, define some points x, and use dnorm to generate the density at those points.

x <- seq(-3,3,0.1)
plot(x=x, y=dnorm(x, mean=0, sd=1), type='l')

Binomial

A Bernoulli trial is an experiment which can have one of two possible outcomes. Independent repeated Bernoulli trials give rise to the Binomial distribution, which is the probability distribution of the number of successes in n independent Bernoulli trials. Although the binomial distribution is discrete, in the limit as n gets larger, it approaches the normal distribution.

x <- seq(0,20,1)
plot(x=x, y=dbinom(x,20,0.5))

Uniform

A Uniform distribution just says that all allowable values are equally likely, which comes up in dice or cards. Uniform distributions come in either continuous or discrete flavors.

Log-normal

The log-normal distribution is a probability distribution of a random variable whose logarithm is normally distributed. If X is a random variable with a normal distribution, then Y = exp(X) has a log-normal distribution. Analogously to the central limit theorem, the product of many independent random variables multiplied together tends toward a lognormal distribution. It can be used to model continuous random quantities whose distribution is skewed and non-negative, for example income or survival.

samples <- rlnorm(100, meanlog=0, sdlog=1)
par(fig=c(0,1,0,0.35))
boxplot(samples, horizontal=T, bty="n", xlab="log-normal distribution")
par(fig=c(0,1,0.25,1), new=T)
s <- seq(0,max(samples),0.1)
d <- dlnorm(s, meanlog=0, sdlog=1)
hist(samples, prob=T, main="", col=gray(0.9), ylim=c(0,max(d)))
lines(density(samples), lty=2)
curve(dlnorm(x, meanlog=0, sdlog=1), lwd=2, add=T)
rug(samples)

Exponential

The exponential distribution is used to model the time interval between successive random events such as time between failures arising from constant failure rates. The following plot is generated by essentially the same code as above.

More on probability distributions

More Using R for Introductory Statistics

Tuesday, February 01, 2011

Annotated source code

We programmers are told that reading code is a good idea. It may be good for you, but it's hard work. Jeremy Ashkenas has come up with a simple tool that makes it easier: docco. Ashkenas is also behind underscore.js and coffeescript, a dialect of javascript in which docco is written.

Interesting ways to mix prose and code have appealed to me ever since I first discovered Mathematica's live notebook, which lets you author documents that combine executable source code, typeset text and interactive graphics. For those who remember the early 90's chiefly for their potty training, running Mathematica on the Next pizza boxes was like a trip to the future. Combining the quick cycles of a Read-evaluate-print-loop with complete word processing and mathematical typesetting encourages you to keep lovely notes on your thinking and trials and errors.

Along the same lines, there's Sweave for R and sage for Python.

Likewise, one of the great innovations of Java was Javadoc. Javadoc doesn't get nearly enough credit for the success of Java as a language. It made powerful API's like the collections classes a snap and even helped navigate the byzantine complexities of Swing and AWT.

These days, automated documentation is expected for any language. Nice examples are: RubyDoc, scaladoc, Haddock (for Haskell). Doxygen works with a number of languages. Python has pydoc, but in practice seems to rely more on the library reference. Anyway, there are a bunch, and if your favorite language doesn't have one, start coding now.

The grand-daddy of these ideas is Donald Knuth's literate programming.

I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: "Literate Programming."

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.

Indeed, Ashkenas references Knuth, calling docco "quick-and-dirty, hundred-line-long, literate-programming".

This goodness needs to come to more language. There's a ruby port called rocco by Ryan Tomayko. And for Clojure there's marginalia.

I love the quick-and-dirty aspect and that will be the key to encouraging programmers to do more documentation that looks like this. I hope they build docco, or something like it, into github. Maybe one day there will be a Norton's anthology of annotated source code.

Vaguely related