General Logic (Updating a database instead of creating from scratch)

teejo · March 18, 2014, 9:07am

This is basically an information dump to help me think about how to do this. Feel free to comment or suggest

My previous post regarding the skipping of folders is related. That snippet of code was for my mede8er utility that I wrote. You can read what it does on its download page.

It’s nifty in that you can at least update the jukebox databases outside of using the mede8er itself - very handy if you’re storing your media on a NAS-type system.

Since I don’t have a lot of media to process, it’s no skin off my back to let the utility run its course from scratch every time I add/remove media from the storage, however a fellow that has been testing my utility (and pointed out some issues that I resolved for him), has a MASSIVE amount of media - he sent me a copy of just his xml data that I could test my utility against, and it extracted to more than 128mb. He tells me it takes around 20 minutes for my utility to generate his databases. Ow!

My next planned step for my utility is to add the ability to update existing databases rather than creating them from scratch every time, I’m hoping this will be much faster than creating it from scratch. My stumbling block right now is how I would go about achieving this. Let me explain:
The time consuming bit (I think) is not the actual creation of the databases, it is the processing of the XML files.

##What it does now
As you may have seen with my previous post, I scan a specified path for XML files and add them to a list. I then process the list of XML files for the data that is required for the database, which is kind of bizarre if you ask me.

The database contains a single table with the following fields:
HashCode,Genre,Year,Rating,Folder, Timestamp

Hashcode is an MD5 hash of Folder.
Genre contains at most 3 genres pulled from the XML.
Year contains the movie year pulled from the XML
Rating contains a calculated rating with source from XML
Folder is the relative path from the database to where the media is located
Timestamp is the unix timestamp of the XML file’s last write time.

I have a function called populateRow that does the “heavy” work of calculating the above field data per file.

Prior to my recent update to handle the folder ignores, I would directly add the populateRow data to list of string arrays I would subsequently use to populate the actual SQLite database, however I added some additional error checking to the XML.load function within my populateRow function, and moved that to the first task in the function so that it could error out on invalid XML and skip the file. So now I’ve created a 6 value array called ‘value’ that receives the data, checks if there is an error value and then skips the file, or adds the value to my mede8erdb list.

 value = populateRow(xmlfile, JukeboxPath);

Once the mede8erdb list has been fully populated, I then go on to create the actual database, using the list of string arrays. In that function, I open the Mede8er.db file, drop the table and re-create it, if it exists (freshest data!), and then populate the database.

You may be wondering why I populateRow to a list of arrays rather than straight in to the database – I figured that it might be faster to get that data in to memory first and then write it a bit later, rather than have the read and write IO all happening at the same time, at the cost of memory, which I’m sure for a huge media collection, may be large. I’m not sure what the best approach would be. Might have to change it to writing directly as it will make more sense with updating instead of creating from scratch.

##Potential Changes
What I’ll probably need to do is to change the process - once I have the list of XML files, I’ll have to obtain the unix timestamp, and the relative folder. Do a select from the database using the relative folder and the timestamp. If there’s a hit, discard and go to next. If no hit, then process and insert. If a hit on name, but timestamp differs, process and update.

My only concern is that this process ends up being slower than creating the database from scratch, but I guess I won’t know until I’ve actually branched the project, made the changes and tested it.

teejo · March 18, 2014, 9:09am

At least typing this out has managed to straighten my head out… I know what I need to do. If anybody has any suggestions though, I’m open for it!

Bass · March 18, 2014, 10:26am

The biggest speed impact, IF you are doing a lot of string manipulation is to use stringbuilder rather than string. Strings are immutable, so “changing” one always creates a new one in memory … that might make a difference, but it depends on how much you are doing …

teejo · March 18, 2014, 12:36pm

There is a stringbuilder used in the md5 hash generation function. I stole this function from somewhere… probably stackoverflow.

private string CalculateMD5Hash(string input)
        {
            // step 1, calculate MD5 hash from input
            MD5 md5 = System.Security.Cryptography.MD5.Create();
            byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
            byte[] hash = md5.ComputeHash(inputBytes);

            // step 2, convert byte array to hex string
            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < hash.Length; i++)
            {
                sb.Append(hash[i].ToString("x2"));
            }
            return sb.ToString();
        }

teejo · March 18, 2014, 1:15pm

If I change the function to not use stringbuilder, it still works fine. I don’t get why it was used in the first place.
(Wrote a quick console app to compare the output of the two functions, they worked identically)

private static string CalculateMD5Hash(string input)
        {
            // step 1, calculate MD5 hash from input
            MD5 md5 = System.Security.Cryptography.MD5.Create();
            byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
            byte[] hash = md5.ComputeHash(inputBytes);

            // step 2, convert byte array to hex string
            string sb = "";

            for (int i = 0; i < hash.Length; i++)
            {
                sb +=hash[i].ToString("x2");
            }
            return sb.ToString();
        }

I see the difference between String and Stringbuilder is that String is immutable. Any modifications to it makes a new instance in memory, whereas Stringbuilder modifies the existing memory. I think the memory difference here is not an issue, it’s only working with tiny strings.

BlackSheep · March 18, 2014, 6:31pm

Does skipping the directories you have already scanned increase the speed that much, You still accessing the directory to see if you have processed it already, and what if a new file is in the directory?

Vorty made a really cool multithreaded find in files to replace the useless windows file search. It works really fast on a whole hard drive. He used the same technique to generate a pie chart of drives and folder sizes, its also very fast. I’m sure he would be happy to share his secrets.

gus · March 18, 2014, 7:08pm

Couple of opinions:

Do not use an MD5 hash, just use a normal hash with “var intValue = input.GetHashCode();”. The use “var hexValue = intValue.ToString(“X”);” to get the string. Unless you have a good reason for the hash to be un-reversable, there is no point in doing such a costly Crypto hash. Even better, use the integer as your hash key, storing it in the DB like that, rather than converting to a string. Databases love integer keys for indexing performance. The CPU can compare keys in a single CPU Op, rather than having to do a relatively expensive string compares.
For small strings, there is no real performance difference between Strings and StringBuilder. However, see comment (1) to completely avoid the need for it.
Parsing a 128mb XML string is just evil. With XML, the parser must constantly find the matching end tags, which are super expensive string operations. Doing it over such a big string is not a good idea. It is not clear to me from your description if you have a way of avoiding the big XML file, but if you can avoid it, I would recommend it.
Working directly with the DB is possibly more I/O, but often the operating system can buffer that for you, so a good change it will not have a significant performance impact. I think you should at least try the approach of doing database delta updates based on file timestamps. You can maybe also store the begin timestamp of your last run somewhere, so that you can know which files you do not even have to look in the database for, since they are older than the start of your previous run. In general performance comes from completely avoiding work.
Sqlite uses BTrees for the database, so there are no benefits to writing things in one go, as there would have been for say writing a sequential (flat) file.

teejo · March 18, 2014, 7:10pm

No, but it will avoid the overhead of loading and reading through the XML file and making calculations if I don’t have to. Also, I don’t care if any files are added to the folder, all I care about is the XML file and if THAT has changed. Chances of the actual XML file changing are slim to none. Most likely new folders/xml files will be added. Bugger. It occurs to me that I have not considered folders being removed. ARGH!

So, once I have listed all the XML files, I’ll need to grab the folder names & timestamps, then some way figure out what exists in the database and not in the file list. Bleh.

gus · March 18, 2014, 7:11pm

http://msdn.microsoft.com/en-us/library/ff477033(v=vs.110).aspx

teejo · March 18, 2014, 7:15pm

I did not design the database structure. The data as it is, is required by the Mede8er device.

Thanks, I have already removed it. See my earlier post where I compared the output of both functions with and without stringbuilder.

You misunderstand - there’s a whole bunch of xml files - one per media file - there are so many that it works out to more than 128mb… it’s not a 128mb xml string. Ouch.

Yes, I’ve branched my code and I’m changing it to do on the fly updates rather than buffering it. Will see how it works out. I just have to figure out how to handle the removals, which didn’t occur to me earlier. Doh.

gus · March 18, 2014, 7:18pm

You should be able to use GetLastWriteTime to check if a directory was changed and only look for deleted files/folders in the database for directories that were modified since your last sweep.

teejo · March 19, 2014, 7:01pm

I completely re-wrote how my utility does things, and there has been a significant performance boost.

For a full run from scratch:
Scanned 6,601 Files, 3,111 Folders consisting of 189Mb of XML data in total. By properly processing the ignore files that are in place, the utility retrieved 1663 legitimate XML files to process. The file scan took 19 seconds.
The processing of 1663 XML files and building the database took 39 seconds.

Me happy!

gus · March 19, 2014, 7:42pm

Are you saying you went from 20 odd minutes to 1 minute?

teejo · March 19, 2014, 8:08pm

I didn’t benchmark the old version against the same data. It is apparently the same data that a user ran the old version against that took 20 mins, but I don’t know that for sure. I also think he is running it against a samba share, which may slow it down considerably.

I’ve only just uploaded my new utility, it may take a few days before (if) I hear from the bloke again.