Monday, January 18, 2010

Split with escaped delimiters

I have a lot of little scripts that read tab-delimited text files, 'cause that's the way things get done in the creaky world of bioinformatics. Due to the limitations of flat files, you often see an attributes field, which holds key-value pairs in some more-or-less parsable way, typically something like this:

key1=value1;key2=value2;

Parsing this with regular expressions is pretty easy. Take this fragment of Ruby:

 attributes = {}
 File.foreach('attributes.txt') do |line|
   line.chomp!()
   line.scan(/([^;=]+)=([^;]+)/) do |key, value|
     attributes[key] = value
   end
 end

 attributes.each_pair do |key, value|
   puts("#{key}=#{value}")
 end

Because I'm a knucklehead, I decided to make it hard. What if we wanted to allow escaped delimeter characters? RegexGuru tells you that Split() is Not Always The Best Way to Split a String. This will work.

 def unescape(string)
   string.gsub("\\\\","\\").gsub("\\=","=").gsub("\\;",";")
 end

 attributes = {}
 File.foreach('attributes.txt') do |line|
   line.chomp!()
   line.scan(/((?:\\\\|\\=|\\;|[^;=])+)=((?:\\\\|\\;|[^;])+)/) do |key, value|
     attributes[unescape(key)] = unescape(value)
   end
 end

 attributes.each_pair do |key, value|
   puts("#{key}=#{value}")
 end

Take home message: This is way more trouble than it's worth. Also, don't forget that split works differently with regards to empty fields from one language to another.