This is basically an information dump to help me think about how to do this. Feel free to comment or suggest
My previous post regarding the skipping of folders is related. That snippet of code was for my mede8er utility that I wrote. You can read what it does on its download page.
It’s nifty in that you can at least update the jukebox databases outside of using the mede8er itself - very handy if you’re storing your media on a NAS-type system.
Since I don’t have a lot of media to process, it’s no skin off my back to let the utility run its course from scratch every time I add/remove media from the storage, however a fellow that has been testing my utility (and pointed out some issues that I resolved for him), has a MASSIVE amount of media - he sent me a copy of just his xml data that I could test my utility against, and it extracted to more than 128mb. He tells me it takes around 20 minutes for my utility to generate his databases. Ow!
My next planned step for my utility is to add the ability to update existing databases rather than creating them from scratch every time, I’m hoping this will be much faster than creating it from scratch. My stumbling block right now is how I would go about achieving this. Let me explain:
The time consuming bit (I think) is not the actual creation of the databases, it is the processing of the XML files.
##What it does now
As you may have seen with my previous post, I scan a specified path for XML files and add them to a list. I then process the list of XML files for the data that is required for the database, which is kind of bizarre if you ask me.
The database contains a single table with the following fields:
HashCode
,Genre
,Year
,Rating
,Folder
, Timestamp
Hashcode
is an MD5 hash of Folder
.
Genre
contains at most 3 genres pulled from the XML.
Year
contains the movie year pulled from the XML
Rating
contains a calculated rating with source from XML
Folder
is the relative path from the database to where the media is located
Timestamp
is the unix timestamp of the XML file’s last write time.
I have a function called populateRow
that does the “heavy” work of calculating the above field data per file.
Prior to my recent update to handle the folder ignores, I would directly add the populateRow
data to list of string arrays I would subsequently use to populate the actual SQLite database, however I added some additional error checking to the XML.load function within my populateRow function, and moved that to the first task in the function so that it could error out on invalid XML and skip the file. So now I’ve created a 6 value array called ‘value’ that receives the data, checks if there is an error value and then skips the file, or adds the value to my mede8erdb
list.
value = populateRow(xmlfile, JukeboxPath);
Once the mede8erdb list has been fully populated, I then go on to create the actual database, using the list of string arrays. In that function, I open the Mede8er.db file, drop the table and re-create it, if it exists (freshest data!), and then populate the database.
You may be wondering why I populateRow
to a list of arrays rather than straight in to the database – I figured that it might be faster to get that data in to memory first and then write it a bit later, rather than have the read and write IO all happening at the same time, at the cost of memory, which I’m sure for a huge media collection, may be large. I’m not sure what the best approach would be. Might have to change it to writing directly as it will make more sense with updating instead of creating from scratch.
##Potential Changes
What I’ll probably need to do is to change the process - once I have the list of XML files, I’ll have to obtain the unix timestamp, and the relative folder. Do a select from the database using the relative folder and the timestamp. If there’s a hit, discard and go to next. If no hit, then process and insert. If a hit on name, but timestamp differs, process and update.
My only concern is that this process ends up being slower than creating the database from scratch, but I guess I won’t know until I’ve actually branched the project, made the changes and tested it.