Here's another quick R vignette, in case I pick this up later and need to remind myself where I got stuck. I was trying to use R for a bit of basic sequence analysis, with mixed results.
First, install the BSgenome package, which is part of Bioconductor. Get GeneR while you're at it.
> source("http://bioconductor.org/biocLite.R") > biocLite("BSgenome") > biocLite("GeneR")
Follow the instructions in the document How to forge a BSgenome data package. You'll need to get fasta files from somewhere such as NCBI's Entrez Genome. Another nice data source is Regulatory Sequence Analysis Tools.
I created a BSgenome package for our favorite model organism Halobacterium salinarum NRC-1, which I named halo for short. Now, I can ask what sequences make up the halo genome and find out how long they are.
> library(BSgenome.halo.NCBI.1) > seqnames(halo) [1] "chr" "pNRC200" "pNRC100" > seqlengths(halo) chr pNRC200 pNRC100 2014239 365425 191346 > length(halo$chr) [1] 2014239
There are a few things I wanted to do next. First, I wanted to load a list of genes with their coordinates. That should allow me to quickly get the sequence for each gene, or get sequence of upstream regions for regulatory motif finding. Second, if I'm going to find any new protein coding regions, I'd like to have a function that could take a stretch of DNA and find ORFs (open reading frames). As far as I can tell, all there is to ORF finding is searching each reading frame for long stretches that start with a methionine (AUG) and end with a stop codon (UAG, UGA, and UAA ). Maybe there's more to it than that.
This is where I left off. GeneR seems to use an entirely different way of encoding sequence based on buffers. I have to admit to being a little disappointed. I hope it's just my cluelessness and there's really a reasonable way to do this kind of thing in R and Bioconductor.