Wednesday, January 30, 2013

Javascript style objects in Python

As I've gotten older and crankier, one of the things I've gotten cranky about is Object Oriented Programming. Sometimes, it seems like rigid class hierarchies just get in the way. Sure, it's cool that the compiler can prove type-safety properties about your program, but whether the boilerplate is worth the benefit is situation dependent. When it just has to work once, coding for maintenance is a poor investment - quick-n-dirty is the way to go.

I've found class hierarchies in dynamic scripting languages to be especially unnecessary. Why use a language where you're not supposed to worry about typing to build elaborate user-defined type hierarchies?

In those quick-n-dirty scenarios, what I want is a big stinking bag of properties. Ideally, I don't care whether the values are data or functions. In other words, just like it's done in Javascript.

Here's one way to get something like that, in Python:

Monday, January 14, 2013

VCF: Variant Call Format

VCF, for Variant Call Format, is a tab delimited file format for storing genetic sequence variations developed in the context of the 1000 genomes project. There's a suite of open source vcftools and a pleasantly brief paper.

What follows serves as a dumping ground for knowledge and links about VCF.

VCF is a highly flexible format. There are 9 standard columns plus an additional column for each sample. The contents in the sample column are a colon-separated list of values. The FORMAT column specifies the layout of the sample columns, so that putting the two together yields a matching list of key/value pairs for each variant in each sample. A header section specifies metadata about the sample columns and the key/value properties.

A line of VCF is essentially a precomputed two-level join associating variants with samples and samples with key/value properties. On top of that, the INFO column associates key/value properties with the variant, so I suppose it's a 4-way join. Nice!

Terminology

Genotype: VCF handles genotypes of different ploidy. The genotype can contain just one allele (haploid), two alleles separated by one of '\', '|', '/' (diploid) or potentially more. Human VCF files will likely be mixed haploid and diploid due to the X and Y chromosomes in males and mitochondrial DNA.

Phasing: If the reference sequence assembly is reasonably complete, it may be possible to map the variant calls to a consistent strand. The term for this is phasing. A VCF file may be completely unphased or each call can indicate phased or unphased. The degree of completeness depends on how well and how unambiguously the calls map to the reference.

Filters: FILTER is one of the standard VCF columns. Valid values for FILTER column are "PASS" or a code for the filters that the variant call fails.

Tools

PyVCF is a nice python library for reading and writing VCF. It comes with a handy little utility vcf_filter, which does what it says. The reading part seems to be the most solid, with writing and filtering a little green, yet. But the code is nicely readable, and its proprietors are friendly to contributions.

SnpEff enhances VCF files by annotating variants and calculating the effects they produce on known genes, for example, amino acid changes, frame shifts and splice site changes.

VCF resources