Kill The Persistance

One of the less fun things about playing with a lot of data is the fact that you never seem to have enough memory to hide it all in. Even in the magical land of the Hadoops, these issues can show up if you forget about the scale of your problem. Though, honestly, you should ignore the scale of the problem on truly big problems, as thinking about it too much only causes headaches.   Instead, Just remember, that when you play with a lot of data, everything must DIE! And the quicker, you, the glorious general-programmer performs the deed, the better. After all, every element we leave alive is another element that exists somewhere, eating our precious resources.   Its also funny to think that on this scale, even Java programmers have to worry about memory management (GASP! Java Programmers worrying about memory management? Isnt that why we ran away from the (argueablly) better (promised) lands of C/C++? 😛 )   Though of course, examples are needed for anything…   So a simple example. A Hadoop reducer.

protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
    JsonParser parser = new JsonParser();
    LinkedList<JsonObject> objects = new LinkedList<JsonObject>();
    JsonObject user = null;
    for (Iterator<Text> itt = values.iterator(); itt.hasNext();) {
    JsonObject object = parser.parse(itt.next().toString()).getAsJsonObject();
        if (object.has("userId")){
            user  = object;
        }else{
             objects.add(object)
        }
        for (Iterator<JsonObject> itt = values.iterator(); itt.hasNext();) {
            JsonObject object = itt.next()
            object.add("userId",user);
        context.write(key, new Text(object.toString()));
    }
}

Very few lines of code, a grand total of 2 loops. And do you know what’s really special about this function?   It does this when used on a real dataset:

com.google.gson.JsonParseException: Failed parsing JSON source: JsonReader at line 1 column 1621 to Json at com.google.gson.JsonParser.parse(JsonParser.java:88) at com.google.gson.JsonParser.parse(JsonParser.java:59) at com.google.gson.JsonParser.parse(JsonParser.java:45) at hadoop.sampling.JoinReducer.reduce(JoinReducer.java:22) at hadoop.sampling.JoinReducer.reduce(JoinReducer.java:1) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1178) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded at com.google.gson.stream.JsonReader.nextQuotedValue(JsonReader.java:977) at com.google.gson.stream.JsonReader.nextName(JsonReader.java:783) at com.google.gson.internal.bind.TypeAdapters$25.read(TypeAdapters.java:667) at com.google.gson.internal.bind.TypeAdapters$25.read(TypeAdapters.java:642) at com.google.gson.internal.Streams.parse(Streams.java:44) at com.google.gson.JsonParser.parse(JsonParser.java:84) … 12 more

So truly productive behavior. The real question though is why? Why would seemingly innocent and normal code blow up on us like that? (Yes, I know some of you may be facepalming at this by looking at the function above. But if there wasn’t a problem with it, what would there be to write about? )   The big issue is we are attempting to have this do work on a massive dataset. Think 10s of Terabytes of data. That scale of data. So any data that we expect to persist needs to be destroyed. After all, with such a dataset, if it was only 1 million entries, a single “java boolean” would consume 1,000,000 bytes. (java is silly and uses 8bit booleans, yes…) (And ok, maybe that number is deceiving… as it is tiny… honestly. About a single megabyte. But just imagine the bloat on something bigger then a byte.)   But, our goal is to kill all the factors that can eat memory, because we are going to be “memory-aware” java programmers. So we need to kill the persistence, in regards to data. Now, as far as peristant, growing things, likely suspects are anything that Java would consider a collection. Of course, in our little snippet, the cause of all our troubles was the LinkedList. It grew and grew untill the GC died. And unlike in C, when the GC gets desperate, it just starts to eat all your CPU time in a futile attempt to salvage the situation.

Task attempt failed to report status for 617 seconds. Killing!

Poor task. It died a horrible, self-disgestive death. Killed by its own garbage collector (Lets see that happen to a C program No! A C program would just start overwriting itself. A way better solution. :P)   How do we fix it? Get rid of unnecessary collections. After all, how many times do we use a collection when a bit of elegance could have done. Or we make a second collection that is near identical to one we already have?

protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
    JsonParser parser = new JsonParser();
    // LinkedList<JsonObject> objects = new LinkedList<JsonObject>(); //Memory Issues!!!!!!!!!! KILL ALL THE PERSISTANCE!
    JsonObject user = null;
    for (Iterator<Text> itt = values.iterator(); itt.hasNext();) {
        JsonObject object = parser.parse(itt.next().toString()).getAsJsonObject();
        if (object.has("userId"))
        {
            user = object;
            break;
        }
    }
    for (Iterator<Text> itt = values.iterator(); itt.hasNext();) {
        JsonObject object = parser.parse(itt.next().toString()).getAsJsonObject();
        if (!object.has("userId")) {
            object.add("userId",user);
            context.write(key, new Text(object.toString()));
            }
        }
    }

And its argueablly sexier code? Maybe?   But alas, remember, Kill the Persistence.       (Tangent: Not saying “Big Data”… Big Data has soooo many issues… Even if  the data in “lots of data” is larger then what some “Big Data” Professionals get to play with… Don’t give into the hype kids. It isn’t cool.)