Wednesday, August 01, 2007

How To: Simple Query-able Compression (No need to decompress to read file)

Do you wish to compress your data without having to decompress it to read it? Most people will think of RAR or ZIP compression when they need to save space on the hard drive. This may save storage and I/O load, but the side effect of this approach will require you to decompress the file every time you need to access it. The following is a means to compress your data without requiring decompression to read it.

Normalization: Compression is a natural byproduct when normalizing your data (Please see articles below on normalization and modeling). By normalizing your data you remove redundant data. It’s an effective non-destructive means of compressing your data into a query enabled format.
Binary Conversion: Converting from a string formatted file into a binary formatted file is another natural means of compressing your data. Reducing a string value of “1002000032” in to 4 bytes saves 6 bytes. The strongly data typed binary file can be trusted and read by other business processes without the need to do string conversion.
Hashing Long String Values: Hashing long string values into a binary hash value and placing the string and corresponding hash value into a lookup table is another natural means of compressing your data. URL links are common storage hogs. Reducing a 255 byte URL string to a 64 bit hash can save lots of space if that URL string occurs multiple times within the file. (NOTE: Make sure you select the most appropriate Hashing algorithm and the right hash bit length to reduce your odds of collisions.)
Roll Ups (Aggregation): By only recording one unique row and placing an aggregation count for each time it was recorded within a unit of time you can reduce the amount of data being recorded(Example: John Doe hit your website home page 3 times in 1 hr. In the log there would be one record with an aggregate count value of 3). This is destructive to your data set, because you lose the retreading of a user’s event path. But this may be a minimal and acceptable loss of data depending on your business.

Or Get Up To 36x Compression
The typical compression results from using one or more of the above suggestions can result in 2x-6x compression ratio. The above suggestions are extremely valuable even if you don’t care about having query enabled compression. If you add RAR on top of it all you can save another 6x compression which can give you between 12x-36x compression. Not bad for saving space eh!

No comments: