Who had the bright idea to release the freedb.org flat file database in a directory structure that has about 10 directories to contain almost 1,000,000 (thats one-million) files?
Honestly, what filesystems are up to managing those kinds of structures? NTFS? No way. HFS (OS X)? Nope. Ext3 ? Maybe. It worked better than the others. But none of the filesystems I work with are adept at listing or deleting those kinds of directories. It was silly, especially when i screwed things up and needed to delete everything and extract a clean database.
I've been working on a tool that parses the freedb data into a relational database. Up until yesterday I was working the files, around 2 GB of data, directly at the file level. Finally, I got fed up.
Freedb.org distributes the data in .bzip2 and .rar formats (compressed verions of a mammoth tarball). A few minutes of googling led me to a free bzip2 stream reader put out by Aftex SW. They derived it from code that came from the Apache Excalibur project. I looked around Apache.org for the original, but as is the case for most Apache projects, the project had become too complex for the kind of poking that I was doing. No problem though. I assumed that Aftex left the Apache license intact, which is fine with me.
The next thing I had to do was figure out how to read data out of a tarball. This ended up being a lot easier than I first thought. The process works something like this:
while (more bytes availble)
{
read 512 bytes into buf, this is the header.
extract file name out of header.
extract size of data out of header.
int blocks = size of data / 512.
if (size of data % 512 > 0)
blocks++; // extra block will contain some data padded with nulls.
for (int i = 0; i < blocks-1; i++)
read 512 bytes int buf
append block to data
// read last (partial block);
read 512 bytes into buf.
int end = 0;
while (end < 512 && buf[end++] != 0);
append bytes 0 through end-1 of buff.
// go on to the next entry.
}
It wasn't that bad. Of course, I was in a hurry, and I am sure there is something I missed. The hard part was figuring out how to parse octal in ascii form to compute the size. I've never done that before.
So now I have a reader that will read freedb records straight out of the bzip2 file (about 300 MB). There is a memory leak to hunt down. I think it is in Aftex code, but I could be wrong. At this point, I can shove records into the database at a rate of about 16 per second. The bottleneck is not the parser, but the database insertion code (key constraint enforcement). I am working on that.
0 comments:
Post a Comment