Sunday, February 24, 2013

Mapping phenotype to cell type

Last month, our journal club read Chromatin marks identify critical cell types for fine mapping complex trait variants which features authors representing several of the anchor stores in Boston's shopping mall of biotech (the Broad, Dana Farber, and Brigham and Women's).

I gather that the term chromatin marks means modifications to the histones and other DNA packaging molecules. I'm not sure if the term is also meant to include DNA modifications like methylation or the arrangement of DNA around the histones. In any case, these modifications form part of the eukaryotic regulatory system and data for them is becoming available from ENCODE and other sources.

The cool thing is, at heart, the paper is about a data transformation, specifically a join, a mapping across biological data types. GWAS data links variants at particular loci on the genome to phenotype, typically disease, with a genetic component. Chromatin marks are thought to play a major role in differentiation. Marks at specific loci are associated with cell type. So, now we have a mapping from disease-associated variants to cell type. To complete the loop, many diseases are already known to play out in specific cell types.

The circle could be navigated in either direction, using one type of information to corroborate another. Known affected cell types for a disease may be used to select weak GWAS hits that are likely to be causal or informative about the mechanisms at work. The same might be applied to cancer, maybe helping to guide the search for causal variants implicated in cancers of particular cell types.

The paper categorized GWAS hits associated with LDL cholesterol hyperlipidemia, arthritis, psychiatric disorders, and type-2 diabetes based on one chromatin mark, trimethylation of histone H3 at lysine 4 (H3K4me3), and found a nice enrichment for the expected cell types.

In bloody red are cell types with immune function, blue is brain tissue, yellow groups together adipose, musculoskeletal and endocrine cells, and GI cell types are in bilious green.

The Broad does a nice job of writing up the paper in their news section: Chromatin marks the spot in search for disease pathways.


Wednesday, February 06, 2013

Data analysis class

I've been writing software to help others do data analysis for a number of years and at the same time trying to work up my nerve to try my own analysis. Why let other people have all the fun? So, when I saw that Jeffrey Leek, biostatistician at Johns Hopkins and coauthor of Simply Statistics, was teaching an online course in data analysis, I signed up.

The class starts off with an overview of the landscape of data analysis. Like the data-science venn diagram, Leek posits that data analysis is at the intersection of hacking, statistics and domain knowledge.

What follows is my crib-notes form Jeff's slides and from supplementary material. To get started in a cautious frame of mind, we get some wisdom from John Tukey:

“The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data...”

To that advice, Leek adds: matter how big the data are.

A cautious data analyst pursues the question at hand with the appropriate type of analysis and will avoid going further than the available data allows.

Types of data analysis

  • Descriptive - Summarize and highlight, leaving generalization, interpretation and modeling for later.
  • Exploratory - Discover new relationships and define future studies, requires confirmation.
  • Inferential - Estimate values on a large population based on a small sample and quantifying uncertainty.
  • Predictive - Use data to estimate unmeasured values. If X predicts Y, X does not necessarily cause Y, which is just another way of saying correlation does not imply causation.
  • Causal - Find effect on one variable of changes in another. Randomized studies are usually required.
  • Mechanistic - Typically, deterministic equations are known, but the parameters must be inferred. Think physics.

On process, Leek outlines a series of steps similar to those articulated by Hadley Wickham (Engineering with data analysis) and Jeffrey Heer.

Steps in a data analysis

  1. Define question
  2. Define ideal data set
  3. Determine what data you can access
  4. Obtain data
  5. Clean data
  6. Exploratory analysis
  7. Statistical prediction/modeling
  8. Interpret results
  9. Challenge results
  10. Synthesize/write-up results
  11. Create reproducible code

The class is taught in R. Early lectures cover basics like how R's type system represents continuous and categorical data. Next come basic data munging operations like binning with cut, subset, sort, merge and reshape.

The goal of data munging is to produce a clean data - data that is amenable to analysis. Hadley Wickham's paper on tidy data, defines a set of properties closely related to database normalization oriented towards getting data ready for further manipulation, visualization and modeling. This is part of what my colleague, Brig, calls data activation.

Properties of Tidy data

  • One variable per column
  • One observation per row
  • Tables hold elements of only one kind


  • Column names are easy to use and informative
  • Row names are easy to use and informative
  • Obvious mistakes in the data have been removed
  • Variable values are internally consistent
  • Appropriate transformed variables have been added

Luckily for us, data is the philosophy of the day. The unreasonable effectiveness of data is widely appreciated, and there is more data than analysis talent available. There are loads of resources for helping students of data analysis grow into data scientists.

Data sources