Test Driven Design

So these days, I’ve started to put together a platform for some work at work. Something involving timing and scheduling. (Yay for woefully underdescript descriptions of what I’m doing). Now of course, the interesting tidbit about this is not the software itself, as code is code, and end the end, there are many end products. The cool part is the design process and architecture. And for me, this means a first venture into TDD(Test Driven Design).

On my own, without any outside suggestion to do so, I am experimenting with TDD for this project. Making it clear this is fully my idea and experiment. Though why would I want to do such a thing to myself? Why would I want to decree myself, and cast myself into the world of unit tests? It turns out, that this is suppose to be a platform type piece, so in the back of my head, when I started drawing out a design on paper, my thoughts traveled to how hard is this going to be to maintain down the line. After all, I am just one code monkey type person, and one person can only do soo much if all he does it sit around and fix bugs on his prior projects.  This more or less lead me down the thought-path of I should most likely make some sort of attempt to validate what I’m doing. Make sure I’m doing it right. And make sure that when I change something, it doesn’t break everything.

And this is renforced by my choice of language. The first draft for the first class was done in scala. But as one of my co-workers put it, I more or less got fed up with scala’s functional nature, its desire to hate mutable objects. So I rebeled. Hard. Into the most mutable language I could find. So this quickly became a javascript/node.js project. The only downside to node in my option though is the fact that node has no constants. So the only way to have personal guarantees of functionality is to make unit tests for them. Ruby, another dynamic language, also has a similar philosophy in development to my knowledge. For testing platform, I’m using mocha and chai.

So we set out to write code, and write unit tests for code. Of course, comming from a… code of the top of your head mentality I suppose. Actually have no idea what you’d call my development style, I attempted to incorperate into it some unit tests.

To start with, I wrote a function, then I wrote some tests for that function. It was good. made me feel a hair more certain that my function actually worked. Then I thought up what I wanted to do next. wrote some skeleton code, wrote some tests. Then went back in to making all the unit tests pass mode.

Today, I hit a new mark. coded the unit tests first. Never done that before. So lets see where this monstrosity is headed…

Kill The Persistance

One of the less fun things about playing with a lot of data is the fact that you never seem to have enough memory to hide it all in. Even in the magical land of the Hadoops, these issues can show up if you forget about the scale of your problem. Though, honestly, you should ignore the scale of the problem on truly big problems, as thinking about it too much only causes headaches.   Instead, Just remember, that when you play with a lot of data, everything must DIE! And the quicker, you, the glorious general-programmer performs the deed, the better. After all, every element we leave alive is another element that exists somewhere, eating our precious resources.   Its also funny to think that on this scale, even Java programmers have to worry about memory management (GASP! Java Programmers worrying about memory management? Isnt that why we ran away from the (argueablly) better (promised) lands of C/C++? 😛 )   Though of course, examples are needed for anything…   So a simple example. A Hadoop reducer.

protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
    JsonParser parser = new JsonParser();
    LinkedList<JsonObject> objects = new LinkedList<JsonObject>();
    JsonObject user = null;
    for (Iterator<Text> itt = values.iterator(); itt.hasNext();) {
    JsonObject object = parser.parse(itt.next().toString()).getAsJsonObject();
        if (object.has("userId")){
            user  = object;
        }else{
             objects.add(object)
        }
        for (Iterator<JsonObject> itt = values.iterator(); itt.hasNext();) {
            JsonObject object = itt.next()
            object.add("userId",user);
        context.write(key, new Text(object.toString()));
    }
}

Very few lines of code, a grand total of 2 loops. And do you know what’s really special about this function?   It does this when used on a real dataset:

com.google.gson.JsonParseException: Failed parsing JSON source: JsonReader at line 1 column 1621 to Json at com.google.gson.JsonParser.parse(JsonParser.java:88) at com.google.gson.JsonParser.parse(JsonParser.java:59) at com.google.gson.JsonParser.parse(JsonParser.java:45) at hadoop.sampling.JoinReducer.reduce(JoinReducer.java:22) at hadoop.sampling.JoinReducer.reduce(JoinReducer.java:1) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1178) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded at com.google.gson.stream.JsonReader.nextQuotedValue(JsonReader.java:977) at com.google.gson.stream.JsonReader.nextName(JsonReader.java:783) at com.google.gson.internal.bind.TypeAdapters$25.read(TypeAdapters.java:667) at com.google.gson.internal.bind.TypeAdapters$25.read(TypeAdapters.java:642) at com.google.gson.internal.Streams.parse(Streams.java:44) at com.google.gson.JsonParser.parse(JsonParser.java:84) … 12 more

So truly productive behavior. The real question though is why? Why would seemingly innocent and normal code blow up on us like that? (Yes, I know some of you may be facepalming at this by looking at the function above. But if there wasn’t a problem with it, what would there be to write about? )   The big issue is we are attempting to have this do work on a massive dataset. Think 10s of Terabytes of data. That scale of data. So any data that we expect to persist needs to be destroyed. After all, with such a dataset, if it was only 1 million entries, a single “java boolean” would consume 1,000,000 bytes. (java is silly and uses 8bit booleans, yes…) (And ok, maybe that number is deceiving… as it is tiny… honestly. About a single megabyte. But just imagine the bloat on something bigger then a byte.)   But, our goal is to kill all the factors that can eat memory, because we are going to be “memory-aware” java programmers. So we need to kill the persistence, in regards to data. Now, as far as peristant, growing things, likely suspects are anything that Java would consider a collection. Of course, in our little snippet, the cause of all our troubles was the LinkedList. It grew and grew untill the GC died. And unlike in C, when the GC gets desperate, it just starts to eat all your CPU time in a futile attempt to salvage the situation.

Task attempt failed to report status for 617 seconds. Killing!

Poor task. It died a horrible, self-disgestive death. Killed by its own garbage collector (Lets see that happen to a C program No! A C program would just start overwriting itself. A way better solution. :P)   How do we fix it? Get rid of unnecessary collections. After all, how many times do we use a collection when a bit of elegance could have done. Or we make a second collection that is near identical to one we already have?

protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
    JsonParser parser = new JsonParser();
    // LinkedList<JsonObject> objects = new LinkedList<JsonObject>(); //Memory Issues!!!!!!!!!! KILL ALL THE PERSISTANCE!
    JsonObject user = null;
    for (Iterator<Text> itt = values.iterator(); itt.hasNext();) {
        JsonObject object = parser.parse(itt.next().toString()).getAsJsonObject();
        if (object.has("userId"))
        {
            user = object;
            break;
        }
    }
    for (Iterator<Text> itt = values.iterator(); itt.hasNext();) {
        JsonObject object = parser.parse(itt.next().toString()).getAsJsonObject();
        if (!object.has("userId")) {
            object.add("userId",user);
            context.write(key, new Text(object.toString()));
            }
        }
    }

And its argueablly sexier code? Maybe?   But alas, remember, Kill the Persistence.       (Tangent: Not saying “Big Data”… Big Data has soooo many issues… Even if  the data in “lots of data” is larger then what some “Big Data” Professionals get to play with… Don’t give into the hype kids. It isn’t cool.)