A Discussion of Compression
We have probably all come across PKZIP, and possibly other compression programs, but how does compression actually work? The science of compression is actually very complex and sophisticated and rather exciting to study, but for this article we will stick to the basic principles. Adding decompression to various MM/PC handlers and routines has contributed significantly to the following discussion.
The idea of compression is to store a fixed amount of data using fewer bytes. There are many ways that this can be done, and often there is a trade off between speed, efficiency and complexity. The basic concept behind all compression is to replace repeating characters or strings with fewer characters. A simple example is to replace multiple spaces with codes that are not otherwise used within a document. This is quick and quite useful for a text document but does not work for files that can contain all 256 possible codes. The next development is to replace any number of a repeating character. This method often requires that a specific code is used to indicate that a multiple character string is about to start; you then need the actual character and the number of times it occurs. Thus any run of a character from 4 to over 250 can be represented by just three bytes. This is very simple and fast but is only useful for certain types of data.
More effective programs all work in a similar way by analyzing patterns of data and seeing when they repeat. If a pattern is repeated, then a reference to the previous pattern is made rather than repeating the text. The reference is normally the location of the first text, and it's length. To make this slightly clearer, consider the following line.
The rain in Spain falls mainly on the plain.
You will see that the string 'ain' occurs a few times. Thus the line could be replaced by:
The rain in Sp(repeat) falls m(repeat)ly on t(repeat)pl(repeat)
As long as the repeat string is less than three characters, the line will be compressed. As text gets longer, so do the repeating strings and the length of the repeats.
There are two families of compression routines; one that is based on characters, and one that is based on bit patterns. Bit patterns such as PKZIP generally offer compression that is not very content dependant while one character based routine in a recent MM/PC development update could compress 4000 characters into 4, but it is very rare to get two identical strings of 4000 characters. At this stage, discussion of the ways that routines optimize for speed and efficiency is beyond the scope of this article, but often it involves fast access history tables to keep track of the most commonly used strings or patterns.
Most modern tape drives contain hardware compression. This compression is invisible to the user, but will often allow 2 to 5 times the native capacity to be stored depending on the type of data. Generally these routines are based on speed rather than maximum compression so it unlikely that any tape compression will compress to the same level as PKZIP. It must be noted that if a file has been compressed by an efficient routine it cannot be compressed further. Thus enabling hardware compression to backup zipped jpeg files will not increase the capacity of the tape. For text files, the capacity can easily increase by 5 fold, and for program files, 50% is often possible.
Will compression speed up tape speed? This is a complex question and often depends a lot on type of data being backed up and the type of drive and hardware used. If a fast PC is being used with an older DAT, say DDS-1 or DDS-2, compression will help with both speed and capacity. However on a modern fast drive such as DLT, at times the compression will mean that the SCSI interface or file system cannot keep up with amount of data a drive can accept. In this case the drive will start shoe shining and may end up transferring data at a slower speed than uncompressed. The only real way to tell is careful timing with the data being used. However the benefit of extra capacity will be had. As usual the use of compression can be good as well as bad.
As a postscript, this article was written at the end of a 9-hour transatlantic flight. It is a shame that airlines have not managed to create a painless compression routine.
This article may be re-published as long as the following resource box is included at the end of the article and as long as you link to the email address and the URL mentioned in the resource box:
Article by eMag Solutions. For more articles on eDiscovery and Data Restoration, subscribe to our e-mail Newsletter by sending a blank email to newsletter@emaglink.com or by going to http://www.emaglink.com.