The analysis code for organizing my music collection was fun to write. I'm using my basic environment of Java 7, Guava, and SLF4j. For serialization I'm using Jackson, and for stopwatch functionality I'm using Spring, which has to be the stupidest possible reason to use spring, but since I'm using Maven for dependency management, it amounts to adding a line in a config file that say 'use spring'.
Going through 17k music files is prone to being slow, so I'm using multiple threads. I'm pretty sure I'm actually going to be bottlenecking on disk IO, since I'm hosting all these files on physical platters, not having a spare >80GB SSD to park them on. As I winnow down I may move them to the SSD so future operations are faster. Regardless, I still want the multiple threads because I'm going to be doing a lot of hash functions on each file, so I want to saturate my CPU's where I can at least. I have 8 cores (well.. 4 if you don't count hyper-threading) so I figured 6 threads wouldn't impact the responsiveness of the OS while I was running this, and it didn't.
However, using multiple threads is prone to race conditions, especially since I've got code that tries to report the progress through the data every 100 items processed or so.
Finally, Mp3agic, the MP3 parsing library I'm using, is a) not 100% bug free and b) further customized on my local machine from the distribution on Github. My initial work with Mp3agic was to submit some code to fix the handling of UTF-16 text data in the ID3 tags. Some IO refactoring had been done, and there was an issue with some code that was supposed to be counting characters and instead was counting bytes. This isn't normally an issue since mostly you just encounter UTF-8 with no multibyte characters. I mean, seriously, who *cough* Amazon *cough* would put UTF-16 into the comment field of an ID3 tag if they didn't have to. So I fixed that. However the Mp3agic code is written to work against File objects, and internally does a lot of seeking with a RandomAccessFile object when it parses the file. This strikes me as a waste because I've already turned the file into a byte so that I can hand it to the hashing function for the whole file hash, so I rewrite the MP3 parsing code to work directly against a byte. But that's a non-trivial change, and I can't run the unit tests on the library because I've broken some bits of the interface that I don't feel like fixing and that are of any use to me.
So, between the multiple threads, and the MP3 library instability, I want to make sure that my code is going to persist all the data it has so far every so often, and in addition, if I restart the program, it will load this checkpoint data and only work on stuff it hasn't already done. I actually didn't initially do this, but discovered a small bug in my Mp3agic changes after processing about 500 files on my first run, so I went ahead and implemented it.
Of course, having done that, it went all the way through the files on the next pass. The final result is about 500 files that aren't parsable, and 5393 unique audio hashes. Of those, only 71 have only one file, and therefore aren't duped. So maybe I'll be able to move to the SSD processing pretty soon. On the other hand I'm not sure I want to arbitrarily delete all but one of each of this files. Ideally I'd like to make sure I keep the file with the most complete and accurate, but that's not really going to be easy to determine heuristically.
On the other hand, the low hanging fruit is there. Fully duplicated files. There are 10944 file hashes, with 5204 of them having more than one file. So right off the back I should be able to get rid of about 5k files.