LZIP is a top choice for reliable, robust long-term archiving of data. For geoscientists as myself, Lzip is therefore a top choice, as recommended by me for HamSci citizen (and professional) scientists.
With any compression algorithm, the defaults are often not the best choice for very large datasets as encountered in radio science or geoscience in general.
Lzip options for large datasets
Here are the LZIP options I currently use for my terabyte-class datasets (here with file extension
create a file
MANIFESTwith a list of files to archive. You can do this manually or with a
find . *.bin > MANIFEST
create a checksum of the files
nice md5sum $(< MANIFEST) > MD5SUM
Zip up the files into
tar cvf - $(< MANIFEST) | plzip -0 > filename.tar.lz
NOTE: if I have only a single huge file I use: note the
-k option, without which
lzip DELETES the original file!!
plzip -k -0 huge.big
plzip is the multithreaded version of
lzip that uses all the virtual cores of your CPU, to go at least N times faster when N is the number of physical CPU cores you have in your PC.
This compresses the files down to 30-50 % of their original size while being as fast as possible. See the benchmark observations for that greatly increased CPU time doesn’t help compress much more.
tar -I didn’t work for lzip
For some reason, on my PC, the
-I 'lzip -0' option of
tar doesn’t have any effect–it uses the
-9 option of
For a 106.9 MByte 16-bit software defined radio dataset (a short test file) I found the table below. It’s immediately evident that for large, high-entropy (noisy natural geoscience data) that very low compression settings are appropriate. I have found similar results for LZMA compression options for large datasets of geoscience auroral video during my Ph.D. thesis work.
It may be possible to tweak further improvements by using dictionary size and match length options, if someone has an extremely large noisy dataset compression problem (e.g. CERN).
|Lzip -||Compression ratio||time (seconds)|
Compression of very noisy datasets
Why is there often little advantage in noisy geoscience datasets for high compression settings? At the most basic level, lossless compression is about finding redundancies in the files. Self-similarities, autocorrelation, and the like. Nature is an incredibly powerful random number generator–the opposite of what compression algorithms need. In contrast to the high-SNR image and text data used by most of the populace, scientists, and geoscientists in particular have instruments that use a very large dynamic range with high sensitivity. For the radio science and scientific camera domains (two areas of my expertise), this typically means 16-bit high speed ADCs where most of the time, several bits are uniformly zero, and the rest of the bits are highly random, with a slowly changing bias value.
In practical terms, a trivial lossless compression algorithm eliminates those high bits that are so often zero, but even a very advanced lossless algorithm will have trouble getting further compression benefit vs. CPU cycles on typical remote sensing datasets.