LZIP LZMA performance benchmarks with Tar archives

LZIP is a top choice for reliable, robust long-term archiving of data. For geoscientists as myself, Lzip is therefore a top choice, as recommended by me for HamSci citizen (and professional) scientists.

With any compression algorithm, the defaults are often not the best choice for very large datasets as encountered in radio science or geoscience in general.

Lzip options for large datasets

Here are the LZIP options I currently use for my terabyte-class datasets (here with file extension .bin).

  1. create a file MANIFEST with a list of files to archive. You can do this manually or with a find command, like:

     find . *.bin > MANIFEST
    
  2. create a checksum of the files

     nice md5sum $(< MANIFEST) > MD5SUM
    
  3. Zip up the files into filename.tar.lz

     tar cvf - $(< MANIFEST) | plzip -0 > filename.tar.lz
    

NOTE: if I have only a single huge file I use: note the -k option, without which lzip DELETES the original file!!
This creates huge.bin.lz.

plzip -k -0 huge.big

LZIP options

plzip is the multithreaded version of lzip that uses all the virtual cores of your CPU, to go at least N times faster when N is the number of physical CPU cores you have in your PC.

This compresses the files down to 30-50 % of their original size while being as fast as possible. See the benchmark observations for that greatly increased CPU time doesn’t help compress much more.

tar -I didn’t work for lzip

For some reason, on my PC, the -I 'lzip -0' option of tar doesn’t have any effect–it uses the -9 option of lzip regardless..

Lzip benchmarks

For a 106.9 MByte 16-bit software defined radio dataset (a short test file) I found the table below. It’s immediately evident that for large, high-entropy (noisy natural geoscience data) that very low compression settings are appropriate. I have found similar results for LZMA compression options for large datasets of geoscience auroral video during my Ph.D. thesis work.

It may be possible to tweak further improvements by using dictionary size and match length options, if someone has an extremely large noisy dataset compression problem (e.g. CERN).

Lzip - Compression ratio time (seconds)
0 0.471 5.6
1 0.448 18.7
2 0.447 30.8
6 0.407 95.2
9 0.400 116.2

Compression of very noisy datasets

Why is there often little advantage in noisy geoscience datasets for high compression settings? At the most basic level, lossless compression is about finding redundancies in the files. Self-similarities, autocorrelation, and the like. Nature is an incredibly powerful random number generator–the opposite of what compression algorithms need. In contrast to the high-SNR image and text data used by most of the populace, scientists, and geoscientists in particular have instruments that use a very large dynamic range with high sensitivity. For the radio science and scientific camera domains (two areas of my expertise), this typically means 16-bit high speed ADCs where most of the time, several bits are uniformly zero, and the rest of the bits are highly random, with a slowly changing bias value.

In practical terms, a trivial lossless compression algorithm eliminates those high bits that are so often zero, but even a very advanced lossless algorithm will have trouble getting further compression benefit vs. CPU cycles on typical remote sensing datasets.

Leave a Comment