Showing posts with label machine learning. Show all posts
Showing posts with label machine learning. Show all posts

Tuesday, May 03, 2016

Topic Modeling with LDA

Rob McDaniel gave a nice presentation on the flaming-hot topic of topic analysis yesterday evening hosted by Seattle metastartup Pitchbook. Grab slides and code from the github repo.

Rob is interested in using NLP to discern the level of objectivity or bias in text. As an example, he took the transcripts of the debates of this year's presidential campaign. Here's part of what he did with them:

For more, have a look at the post on Semantic analysis of GOP debates.

Interesting tidbits:

  • Wikipedia is a source of documents labeled as not objective.
  • Movie reviews are a source of documents labeled by rating, number of stars.
  • Topic cohesion measures how well a given document stays "on-topic" or even "on-message".
  • KL Divergence is entropy based measure of relatedness of topics.

There was an interesting side discussion of the orthogonality of topic modeling and word embedding (word2vec).

Some of the sources Rob mentioned were Tethne and one of it's tutorials, also a pair of papers Introduction to Probabilistic Topic Models and Probabilistic Topic Models both by David Blei.

Tuesday, July 21, 2015

Machine learning on music data

The 3rd lab from Scalable Machine Learning with Spark has you predict the year a song was published based on features from the Million Song Dataset. How much farther could you take machine analysis of music? Music has so much structure that's so apparent to our ears. Wouldn't it be cool to be able to parse out that structure algorithmically? Turns out, you can.

Apparently The International Society for Music Information Retrieval (ISMIR) is the place to go for this sort of thing. A few papers, based on minutes of rigorous research (aka random googling):

In addition to inferring a song's internal structure, you might want to relate it's acoustic features to styles, moods or time periods (as we did in the lab). For that, you'll want music metadata from sources like:

There's a paper on The Million Song Dataset paper by two researchers at Columbia's EE department and two more at the Echo Nest.

Even Google is interested in the topic: Sharing Learned Latent Representations For Music Audio Classification And Similarity.

Tangentially related, a group out of Cambridge and Stanford say Musical Preferences are Linked to Cognitive Styles. I fear what my musical tastes would reveal about my warped cognitive style.

Thursday, February 20, 2014

Regression with multiple predictors

Now that I'm ridiculously behind in the Stanford Online Statistical Learning class, I thought it would be fun to try to reproduce the figure on page 36 of the slides from chapter 3 or page 81 of the book. The result is a curvaceous surface that slices neatly through the data set.

I'm not sure what incurable chemical imbalance explains why such a thing sounds like fun to me. But let's go get us some data, specifically the Advertising data set from the book's website:

Advertising <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv", 
    row.names = 1)

Previously, we fooled around with linear models with one predictor. We now extend regression to the case with multiple predictors.

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 +···+ \hat{\beta}_p x_p \]

Or, in matrix notation:

\[ \hat{Y} = X \cdot \hat{\beta} + \hat{\beta_0} \]

Where the dimensions of X are n by p, and \( \beta \) is p by 1 resulting in a Y of size n by 1, a prediction for each of the n samples.

Find a fit for Sales as a function of TV and Radio advertising plus the interaction between the two.

fit2 = lm(Sales ~ TV * Radio, data = Advertising)

Before making the plot, define a helper function to evenly divide the ranges. This probably exists somewhere in R already, but I couldn't find it.

evenly_divide <- function(series, r = range(series), n = 10) {
    c(r[1], 1:n/n * (r[2] - r[1]) + r[1])
}

Generate a 2D grid over which we'll predict Sales.

x = evenly_divide(Advertising$TV, n = 16)
y = evenly_divide(Advertising$Radio, n = 16)

Using the coefficients of the fit, create a function f that computes the predicted response. It would be nice to use predict for this purpose. From a functional programming point of view, it seems like there should be a function that takes a fit and returns a function equivalent to f. If such a thing exists, I'd like to find it.

beta = coef(fit2)
f <- function(x, y) beta[1] + beta[2] * x + beta[3] * y + beta[4] * x * y
z = outer(x, y, f)

I copied the coloring of the regression surface from the examples for persp. I'm guessing the important part is create a color vector of the right size to cycle through the facets correctly.

nrz <- nrow(z)
ncz <- ncol(z)
nbcol <- 100

# Create color palette
palette <- colorRampPalette(c("blue", "cyan", "green"))
color <- palette(nbcol)

# Compute the z-value at the facet centres
zfacet <- z[-1, -1] + z[-1, -ncz] + z[-nrz, -1] + z[-nrz, -ncz]

# Recode facet z-values into color indices
facetcol <- cut(zfacet, nbcol)

OK, finally we get to the plotting. A call to persp sets up the coordinate system and renders the regression surface.

With the coordinate system we got in the return value, we plot the line segments representing the residuals, using a higher transparency for points under the surface. We do the same for the actual points, so they look a bit like they're underneath the surface.

# Draw the perspective plot
res <- persp(x, y, z, theta = 30, phi = 20, col = color[facetcol], xlab = "TV", 
    ylab = "Radio", zlab = "Sales")

# Draw the residual line segments
xy.true = trans3d(Advertising$TV, Advertising$Radio, Advertising$Sales, pmat = res)
xy.fit = trans3d(Advertising$TV, Advertising$Radio, predict(fit2, Advertising), 
    pmat = res)
colors = rep("#00000080", nrow(Advertising))
colors[residuals(fit2) < 0] = "#00000030"
segments(xy.true$x, xy.true$y, xy.fit$x, xy.fit$y, col = colors)

# Draw the original data points
colors = rep("#cc0000", nrow(Advertising))
colors[residuals(fit2) < 0] = "#cc000030"
points(trans3d(Advertising$TV, Advertising$Radio, Advertising$Sales, pmat = res), 
    col = colors, pch = 16)

Fit surface

To be honest, this doesn't look much like the figure in the book and there's reason to doubt the information conveying ability of 3D graphics, but whatever - it looks pretty cool!