Thursday, March 12, 2009

Split split

Apparently, there's some disagreement about what it means to split a string into substrings. Biological data frequently comes in good old fashioned tab-delimited text files. That's OK 'cause they're easily parsed in the language and platform of your choice. Most languages with any pretention of string processing offer a split function. So, you read the files line-by-line and split each line on the tab character to get an array of fields.

The disagreement comes about when there are empty fields. Since we're talking text files, there's no saying, "NOT NULL", so it's my presumption that empty fields are possible. Consider the following JUnit test.

import org.apache.log4j.Logger;
import static org.junit.Assert.*;
import org.junit.Test;

public class TestSplit {
  private static final Logger log = Logger.getLogger("unit-test");

  @Test
  public void test1() {
    String[] fields = "foo\t\t\t\t\t\t\tbar".split("\t");
    log.info("fields.length = " + fields.length);
    assertEquals(fields.length, 8);
  }

  @Test
  public void test2() {
    // 7 tabs
    String[] fields = "\t\t\t\t\t\t\t".split("\t");
    log.info("fields.length = " + fields.length);
    assertEquals(fields.length, 8);
  }
}

The first test works. You end up with 8 fields, of which the middle 6 are empty. The second test fails. You get an empty array. I expected this to return an array of 8 empty strings. Java's mutant cousin, Javascript get's this right, as does Python.

Rhino 1.6 release 5 2006 11 18
js> a = "\t\t\t\t\t\t\t";
js> fields = a.split("\t")
,,,,,,,
js> fields.length
8
Python 2.5.1 (r251:54863, Jan 17 2008, 19:35:17)
>>> a = "\t\t\t\t\t\t\t"
>>> fields = a.split("\t")
['', '', '', '', '', '', '', '']
>>> len(fields)
8

Oddly enough, Ruby agrees with Java as does Perl.

>> str = "\t\t\t\t\t\t\t"
=> "\t\t\t\t\t\t\t"
>> fields = str.split("\t")
=> []
>> fields.length
=> 0

My perl is way rusty, so sue me. but, I think this is more or less it:

$str = "\t\t\t\t\t\t\t";
@fields = split(/\t/, $str);
print("fields = [@fields]\n");
$len = @fields;
print("length = $len\n");

Which yields:

fields = []
length = 0

How totally annoying!