I have a lot of little scripts that read tab-delimited text files, 'cause that's the way things get done in the creaky world of bioinformatics. Due to the limitations of flat files, you often see an attributes field, which holds key-value pairs in some more-or-less parsable way, typically something like this:
key1=value1;key2=value2;
Parsing this with regular expressions is pretty easy. Take this fragment of Ruby:
attributes = {} File.foreach('attributes.txt') do |line| line.chomp!() line.scan(/([^;=]+)=([^;]+)/) do |key, value| attributes[key] = value end end attributes.each_pair do |key, value| puts("#{key}=#{value}") end
Because I'm a knucklehead, I decided to make it hard. What if we wanted to allow escaped delimeter characters? RegexGuru tells you that Split() is Not Always The Best Way to Split a String. This will work.
def unescape(string) string.gsub("\\\\","\\").gsub("\\=","=").gsub("\\;",";") end attributes = {} File.foreach('attributes.txt') do |line| line.chomp!() line.scan(/((?:\\\\|\\=|\\;|[^;=])+)=((?:\\\\|\\;|[^;])+)/) do |key, value| attributes[unescape(key)] = unescape(value) end end attributes.each_pair do |key, value| puts("#{key}=#{value}") end
Take home message: This is way more trouble than it's worth. Also, don't forget that split works differently with regards to empty fields from one language to another.
No comments:
Post a Comment