Friday, March 20, 2009

More Hacking NCBI

Writing scripts to interface with NCBI's web site has it's challenges. Getting data from the UCSC genome browser is simpler.

If you need a list of complete genomes, that can be had from the NCBI Genome database. One form of list is the genlist.cgi script. The type parameter seems to be a flag that limits the list to chromosomes, plasmids, or organelle specific sequences. The name parameter seems to be there only for looks. So far, I haven't figured out how to make genlist spit out either XML or text.

Two other scripts can produce text output, lproks and leuks.

These two can be scripted like this using parameters like these: view=1 dump=selected p3=11:|12:Green Algae. This information is available by ftp from ftp://ftp.ncbi.nih.gov/genomes/genomeprj/. There are 3 lproks.txt files, which look to correspond to the three tabs Organism info, Complete genomes, Genomes in progress. lproks_1.txt is the one we want. There's a lot of good information in the ftp directories to plunder.

There seems to be yet a third script: GenomesGroup.cgi. This one is linked from the Virus genomes page.

If I really wanted to suffer, I'd look into NCBI's source. Does anyone know where the source of lproks.cgi or genlist.cgi are? Is that part of the NCBI C++ Toolkit? (which is on macports here.) Maybe it's buried in NCBI's ftp site? Maybe I should ask the NCBI Information Engineering Branch? Maybe I need to start doing something more productive!