Sunday, November 11, 2012

Fixing my music... part 1

I've been meaning to spend some time organizing my home media.  My ripped movies are already pretty well organized, and I'm happy to rely on XBMC to maintain the metadata for them.  However, my music files are a total disaster.  There's an order of magnitude more of them, and they have self-contained metadata.  There's a ton of duplication of music in various states of file name format, completeness of metadata, upgraded audio and so on. 

I've finally decided to tackle it.  I have almost 17k MP3 files currently copied to my desktop machine, basically by taking all the collections I could find in various places and dumping them all together.  There's easily going to be 66% duplication in there because I've just copied wholesale both my main storage and the results of previous attempts at organization in there, which includes a big pass with MusicBrainz Picard that only got half done, as well as the results of having pushed all my music to the Amazon cloud when they said it was going to be free, and then pulling it all back down when they changed their minds.  It may go back up, but only after I've curated the hell out of it.
  1. Analysis
  2. Eliminate duplication
  3. Eliminate bad files
  4. Bring all files up to a minimum standard regarding tagging
  5. Move to some form of master storage  
  6. Create a standard mechanism for preventing re-duplication
Today is step 1, and hopefully some work on steps 2 and 3.  Right now I'm gathering the data I'll need to dedupe the files.  Because of the pre-existing efforts to improve the tagging, I can't rely on duplicates actually being identical files.  In order to address this my analysis is going through all 17k files and producing an MD5 sum of the file (for low hanging fruit, duplicate-wise) as well as parsing the file with a modified version of mp3agic so that I can identify the actual MP3 audio frames and produce an MD5 hash of those specifically.   I'm also looking at the MP3 ID3v2 comment field, where apparently some songs have an Amazon ID stored (presumably if I've purchased them from Amazon, but possibly if I've simply stored them there and they've been upgraded).

Well, the analysis step just finished and has produced a 5 MB JSON file with the salient details.  I'll start working on identifying the files I can junk, and the files I need to curate.

No comments:

Post a Comment