Wednesday, October 21, 2009

Super Debug Fun Time

Today I spent virtually my entire day on a single bug. The bug itself was reported as a timeout error by the client accessing a web service, but I suspect that's because of a client programming error. In fact when I tried to reproduce the error I got a response from the server detailing the error, kind of. It told me what exceptions had been thrown (a runtime exception caused by a transform exception caused by a null pointer exception) but not the actual stack traces. Attempting to reproduce this by running the web service on my own machine didn't get the same result though, it worked.

This was somewhat frustrating because in my experience, bugs reported against this particular web service are usually client errors and not actual bugs. That it was working on my machine but not on the test machine indicated otherwise. That still left the second most likely cause which would not involve a new release, i.e. data bugs. The machine where the failure was observed was pointing to a different database than I was in my local testing, so I switch the DB target on my laptop and tried again. Still no failure. Now we're getting into the 'crap, I have to do actual work' phase of debugging.

This led to about a half hour of fighting with remote debugging. Not because its hard or anything, but because the machine where the error was occuring only seems to have one unfirewalled port and that's the one the server is running on, leaving me no way to connect my debugger to the running instance of tomcat and actually test a the same time. Finally I realized I didn't have to open a port, since I had SSH access to the machine I could just use port forwarding. Someday I'll write another blog entry entitled "Port Forwarding, or why IT has no chance of actually isolating me from the production environment, ever".

Remote debugging allowed me to isolate the null pointer access fairly quickly, but since it turned out to be deep in some JDK XML formatting code, this brought me up short. Surely this isn't a JDK bug. I mean I've encountered JVM/JDK bugs before, but its extremely rare, and when you do find them, once you've isolated them to a given class, it usually only takes about 10 seconds of googling to find the actual bug report. Since I didn't find any such report this would imply an unlogged JDK bug, which is so unlikely as to be pretty much impossible. Especially since the code worked on my local machine and therefore if it was a JDK bug it would almost certainly have to be one fixed between build 12 and 16 of JDK 6, that being the difference between the two machines.

Going back to debugging, I decided to walk the code through the error and watch the values of the variables to try to find out when the null value that was being accessed was actually created. Stepping through the code line by line is fairly tedious, especially when you come out the other end and the error doesn't occur. Now that's weird. I tried this again, first running the code without stepping through it, and then running it again line by line. Same result: if I just ran the code it threw an exception, but if I stepped through the code line by line, it disappeared. This is known as a Heisenbug after Werner Heisenberg, a pioneer in quantum physics. It refers to a bug that only appears when you aren't looking too closely at it.


Stepping through the code a few more times I notice that sometimes the bug does occur. Eventually I'm able to divine that the bug only shows up if I don't actually examine the objects that are being manipulated in the debugger view. Realizing this it doesn't take too long to isolate exactly the object in question and the lines where the critical code is. In this case the critical error (the insertion of null values where there should be none) is happening in a different location from the actual exception being thrown, which is thrown when something actually tries to manipulate said null values.

The sequence of events is this
  • Document a gets created with a single parent node
  • Document b gets created which a bunch of child nodes under a single parent node
  • Document b's parent node is adopted by document a
  • Document a is rendered, throwing an exception
If you look at document b's elements before the adoption, they all look fine. If you inspect document b in the debugger and before its adopted by document a, everything runs fine. If you don't inspect document b and then try to render document a after it adopts the nodes from b, you get the error. I notice that the class name for the nodes I'm manipulating all begin with 'Deferred', as in DeferredTextNSImpl for a Text node. Further I notice that when I examine document b in the debugger its internal state changes, as if deferred actions are being taken in order to do the rendering, and such actions are causing the rest of the code to work fine. Light #1 comes on. Not inspecting the document b in the debugger means those deferred actions don't take place prior to the import and thus the import happens incorrectly. I still have trouble believing this because that would definitely be a JDK bug.

All of the functions involving document manipulation are part of the core JDK, and on my local machine I can debug right into them just fine, but via the remote debugger I can't. Chalking this up to the JDK difference I install the older JDK being used on my desktop. However, even after this I'm unable to duplicate the bug on my local machine. I decide to go back to remote debugging, being careful what I inspect in the debugger window, and deciding to trace into the JDK calls, hoping that having the exact same version will allow me to step into the functions. No luck though. Now I start fighting with the IDE trying to find out why it won't show me the source code for the document objects or let me trace into them. I discover oddly, that it refuses to show me the class that is being used. This is actually unprecedented. All the classes I use in the target environment should be on my classpath and even if eclipse doesn't have the source code for some of them it should always show me the exact class being used, and what jar it belongs to. Now I'm getting suspicious.

I take a look at the environment where the error is occuring and look at the available libraries. They all look fine... except I suddenly notice one file 'xercesImpl-2.6.2.jar' on the target machine. Xerces is the Apache Project's XML library and in fact its the one integrated into JDK 5 and up. There shouldn't be any 'implementation' jar in the classpath at all, certainly not some file from 2005. Looking at my own debugger I notice that on my machine it isn't listed. Clearly we've found the culprit. Now to figure out the cause.

We use Maven to manage dependencies and maven will tell you exactly how everything you're using got included, i.e. either directly or transitively, and from what parent dependency. Running this on my machine I see no listing for the xerces jar. I go to the machine where the project was actually built and do the same thing and there it is. We depend on XFire, which depends on Jaxen, which depends on XercesImpl, but on my machine Jaxen doesn't trigger any dependencies at all. Realizing exactly where the problem lies, I do some testing and discover we're not using the Jaxen functionality at all so I exclude it manually from the XFire dependency and do a new build, closing the bug.

It looks very much like my issue is related to this but I still can't determine why exactly my machine doesn't pick up the additional dependencies. They're listed in the POM on my local machine and both systems are running the same version of Maven. The sad thing is that in the space of several years using Maven, I've now experienced two issues related to weird dependency resolution in the space of two weeks, and almost none previously.

No comments:

Post a Comment