Monday, January 14, 2013

VCF: Variant Call Format

VCF, for Variant Call Format, is a tab delimited file format for storing genetic sequence variations developed in the context of the 1000 genomes project. There's a suite of open source vcftools and a pleasantly brief paper.

What follows serves as a dumping ground for knowledge and links about VCF.

VCF is a highly flexible format. There are 9 standard columns plus an additional column for each sample. The contents in the sample column are a colon-separated list of values. The FORMAT column specifies the layout of the sample columns, so that putting the two together yields a matching list of key/value pairs for each variant in each sample. A header section specifies metadata about the sample columns and the key/value properties.

A line of VCF is essentially a precomputed two-level join associating variants with samples and samples with key/value properties. On top of that, the INFO column associates key/value properties with the variant, so I suppose it's a 4-way join. Nice!

Terminology

Genotype: VCF handles genotypes of different ploidy. The genotype can contain just one allele (haploid), two alleles separated by one of '\', '|', '/' (diploid) or potentially more. Human VCF files will likely be mixed haploid and diploid due to the X and Y chromosomes in males and mitochondrial DNA.

Phasing: If the reference sequence assembly is reasonably complete, it may be possible to map the variant calls to a consistent strand. The term for this is phasing. A VCF file may be completely unphased or each call can indicate phased or unphased. The degree of completeness depends on how well and how unambiguously the calls map to the reference.

Filters: FILTER is one of the standard VCF columns. Valid values for FILTER column are "PASS" or a code for the filters that the variant call fails.

Tools

PyVCF is a nice python library for reading and writing VCF. It comes with a handy little utility vcf_filter, which does what it says. The reading part seems to be the most solid, with writing and filtering a little green, yet. But the code is nicely readable, and its proprietors are friendly to contributions.

SnpEff enhances VCF files by annotating variants and calculating the effects they produce on known genes, for example, amino acid changes, frame shifts and splice site changes.

VCF resources