Dumptruck Full of Bits: Fixing my music... part 3

So I now officially know more about ID3 tags than I ever really wanted to know. Actually I think I may have passed that mark when I did the original work on the Mp3agic library to debug why it wouldn't parse my tags. Regardless, I now know even more.

Many of the files I'm trying to dedupe turn out to have exactly the same size. I have to attribute this to the likelihood that MP3 frames have some minimum size, so if the difference between two of them is only in a few frames that are under the minimum size, the total file size will be the same.

I've spent the bulk of my evening trying to figure out how to meaningfully compare the ID3 tag information from two files, and it ended up involving a lot of extending of the Mp3agic functionality. The library itself seems to focus on preservation of data from the source file. For instance, all text fields can be encoded with one of four possible encodings. If the same data is in two different files with two different encodings, the library preserves that information, so if you compare the two frames, they show up as different, even though the difference is actually academic. They represent the exact same information, but happen to have been written out by two different pieces of software that each had their own idea about the best encoding to use. I have my own opinion on the best encoding to use. It's called "UTF-8 or GTFO".

Additionally, a number of differences I'm seeing are attributable to the variations between sub versions of ID3v2. For instance, ID3v2.4 supports the TSOP frame type, which is for the artist name as it should be used for sorting, i.e. 'Beatles' instead of 'The Beatles'. ID3v2.3 didn't support this tag, but some programs apparently started populating a field called XSOP, which serves the same purpose. 2.3 also had a number of different fields for parts of the recording date, which have collectively been replaced by a single TDRC field which stores the full date. Mp3agic could smooth this over by supporting a normalized representation of the data which doesn't care about encoding and does its best to migrate data into a canonical internal format, but it doesn't.

Dumptruck Full of Bits

Monday, November 12, 2012

Fixing my music... part 3

1 comment: