From comp.compression.research Mon Jul 31 14:15:35 1995 Path: news.uwa.edu.au!classic.iinet.com.au!news.uoregon.edu!vixen.cso.uiuc.edu!howland.reston.ans.net!news.starnet.net!wupost!waikato!auckland.ac.nz!news From: gar@clark.net (George A. Resnick) Newsgroups: comp.compression.research Subject: LZRW1 without hashing. Date: 29 Jul 1995 03:53:53 GMT Organization: Clark Internet Services, Inc., Ellicott City, MD USA Lines: 1088 Sender: compression@cs.aukuni.ac.nz (comp.compression.research moderator) Approved: compression@cs.aukuni.ac.nz Message-ID: <3vcbch$6ur@net.auckland.ac.nz> Reply-To: gar@clark.net (George A. Resnick) NNTP-Posting-Host: cs26.cs.auckland.ac.nz Summary: described a new approach of finding longest matches for LZ compr. Keywords: compression, LZRW1, hashing, radix search X-Newsreader: NN version 6.5.0 #3 (NOV) Rev 0.1, 07-25-95 LZRW1 WITHOUT HASHING -=-=-=-=-=-=-=-=-=-=-=-=-=-=- (c) 1995 George A. Resnick ! N ! 1. The algorithm described in this article, source code of ! ! the LZRW1-# program, and all the ideas, that can be found ! O ! herein I wish to be in the public domain, and used freely, ! ! without restriction of any kind. ! T ! 2. In the event if the algorithm described here is not unique, ! ! or violates any existing, or filed before July 1, 1995 patents, ! E ! please let me know. ! ! 3. Any comments and/or questions concerning this message should ! S ! be sent to: George A. Resnick Internet: gar@clark.net, 5172-2 Brook Way gar@ua.net, gar@jam.kiev.ua Columbia, MD 21044 tel/fax: (410) 992-9824 0. Introduction. ~~~~~~~~~~~~~~~~~~ In April 1991, Ross Williams published a pretty simple and efficient data compression algorithm called LZRW1 [1]. The key feature of the LZRW1 is a fast scheme of finding longest matches, that utilizes hashing technique. This scheme uses only 4K words of memory, and allows to achieve reasonably good compression for text files. All these factors make LZRW1 an extremely attractive for various applications, especially for real-time systems. The only thing that prevents the use of LZRW1-style compressors now, is that LZRW1 and the whole family of LZ-compression algorithms with hashing appears to be patented. Not being able to describe here all the existing patents on data compression algorithms, validity of them, and any other related issues, I shall quote only two examples: 1) Gibson and Graybill patent [2], granted in Sep 1991, describes a data compression method based on hashing; 2) almost the same system invented in UK by J.Waterworth (1985), and patented in US in 1987 [3]. In this article I will briefly describe several existing methods of finding longest matches, and select an approach, suitable for real-time compression. The system I am presenting is (at my opinion) patent free and provides almost the same performance as LZRW1 (based on the Calgary Text Compression Corpus files [4]). Attached test program (file lzrw1-#.c) is implemented as series of #if-#else-#endif patches to the original lzrw1.c code so that it is very easy to understand changes in the algorithm. To make this article accessible to anybody who is interested in understanding of the real-time data compression technology, I added some introductory material in sections 1 and 2. For those, who already are familiar with LZ77 family of compressors and, in particular, with LZRW1 algorithm, I may suggest to start reading from the section 3. 1. LZ77 coding. ~~~~~~~~~~~~~~~~~~~ A typical LZ77 compressor maintains a sliding window of size N bytes and a lookahead buffer the contents of which it tries to find a match for in the window: |<--------------- Sliding window -----------------X-Lookahead buf.>| | | matching current position | | | | string in source buffer | match pattern | --------v--/---------------------------------------\-|---/---------------- xxx####[yabbaabb]######### TEXT ####################[yabbaabb]ooooooooo... --------|-------|------------------------------------|-------|------------ |<--------------- distance ------------------Xlength>| || || \===============\ /========/ ||| ----------------------------------------------\^/------------------ @@@@@@@@@@@@@@@@@@@@@@@@@@ CODE @@@@@@@@@@@@@[d,l]................. ----------------------------------------------|-\------------------ current position in | matching string destination buffer | codeword Fig. 1. The LZ77 coder. If the search is successful (this means, that the length of the matching string in window is larger than certain fixed value, usually 2 or 3), a code of the string, consisting of its "distance" and "length" parameters, is sent to the output buffer. Otherwise, a character from the current position in the input buffer is sent to the output buffer. This process should be repeated for the next characters. To separate codes of matching strings and codes of matching parameters additional bits are sent to the output buffer. 2. Parsing strategies and data structures for LZ77 coding. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The first problem that arised when we are trying to understand the logic of the string matcher for the LZ77 coding, is that there is more than one way to choose which strings in the input text shall be replaced by values. The most practical approach is called "greedy parsing", where at each step the compressor searches for the longest string that matches the next characters in text, and once it is found -- replaces it. Unfortunately, the greedy parsing is not optimal. A little bit better compression may be achieved by using so-called "lazy evaluation" technique. It was used in many file archivers like 'freeze' by L.Broukhis, 'comic' by Jan-Mark Wams, pkzip 1.1+ by P.Katz, etc. Lazy encoder uses two searches for longest matches: one to find string that matches characters starting from the current position in text; and second one -- to match characters starting from the current position + 1. Then, if the first match is longer or equal to the second one, lazy encoder replaces the first string; otherwise -- encodes a literal character and the second string. Lazy evaluation is, in fact, only an example of using of the heuristic rule called LFF (Longest Fragment First) to arrange the pseudo-optimal parsing. Although there are better pseudo-optimal and optimal parsing methods, they can improve the compression just by a few percents, but work two or three times slower than greedy encoding [5]. The second problem is that even if compressor uses greedy parsing, the operation of finding longest matches is still the most time-consuming task in the algorithm. Here we will look at some data structures that can perform the find operation efficiently. 2.1. Linear search. ~~~~~~~~~~~~~~~~~~~~~~ A trivial way to find the longest match is to perform a linear search of the text in the window. The advantage of this method is that it does not require any additional memory. The disadvantage is that it is very slow since the window of N characters requires at least N comparisons. Unfortunately, since we do not know the length of the matching string we are looking for, we cannot use fast string searching methods like Boyer-Moore algorithm, which invert the order of character comparisons. Other algorithms, like KMP (Knuth, Morris, Pratt) method can speed up searching only when the matching pattern is considerably long (>= 6 characters) and LZ compressible itself. A better way to speed up searching is to maintain a suitable index for the window using one of the following more sophisticated data structures. 2.2. Binary tree. ~~~~~~~~~~~~~~~~~~~ The use of binary trees for LZ parser was proposed by Tim Bell in 1986 [6]. Binary tree search was used, for example, in file archivers LARC and LHARC; another example is an algorithm in DriveSpace (a real-time disk compressor, bundled in MS-DOS 6.22 and Windows 95 systems). To maintain a search binary tree, it is necessary to store pointers to the left and right sons, and (if we are planning to dynamically reconstruct tree) a pointer to the parent node. x [ ] dad | [t] * ----------->[e]\ left / \ right [x] > phrase in the text son x x son [t]/ Fig. 2. A binary tree node. In addition, if the number of node is not in one-to-one correspondence with the position of phrase in text, the pointer to the phrase should be stored as well. Thus, we need to allocate from 3N to 4N words to represent a tree consisting of N nodes. Theoretically, a fully balanced binary tree shall reduce the number of comparisons to log(N). This usually works for natural text files. Nevertheless, when the input contains a lot of regularities, the tree can degenerate into a linked list, and special efforts should be taken to balance the tree. However, the main problem that should be solved before using binary search tree in LZ77 encoder is how to handle strings that leave sliding window. The standard method here is to delete each leaving string from the binary tree. Such deletion is performed by searching for the node to be deleted and then adjusting the tree around it. An alternative approach is to maintain several binary trees simultaneously. Each tree covers blocks of N characters, and these blocks overlap so that the characters in any window of size N are in no more than three trees. To find the longest match in the window, the first two trees must be searched (the overlapping ensures that third tree is useful only after the first is discarded). The better of the matches produced by the two trees is chosen [4]. Unfortunately, none of the methods above is applicable to real-time algorithms. Here we have to get the maximum speed, and such operations like node deletion, or scanning multiple trees can only kill this idea. A direct way to make binary tree search fast enough was used in the DriveSpace program. At the certain point, DriveSpace "freezes" the tree, and after each search it replaces the pointer to the closest string in the node by the pointer to current position in text. To make this idea working correctly, DriveSpace put some restrictions on strings to be added to the binary tree. Thus, for instance, DriveSpace compares only 3 first characters of each string (the LZ77 threshold), i.e. tree is used only to find the first matching string (not the longest one!), tree may contain strings that are no longer in the window, and after search DriveSpace ignores them, and so on. Of course, in such implementation binary tree loses such advantages as access to variable-length strings, possibility to find the longest match, etc. 2.3. Hash table. ~~~~~~~~~~~~~~~~~~ In standard LZ77 scheme there is a minimum length that a substring must be before it will be used as a match, typically about two or three characters. In this case a hash table could be used to index all strings with the same first two or three characters, and some other search could be used to check subsequent characters. A simple example of this technique is the LZRW1 algorithm, modified text of which is included in appendix A. To increase the probability that we will find the requested string, we may use chaining to resolve collisions (chains of string addresses with the same hash-indices). This means, that if the string stored in the hash-table does not match to the current one, we may keep looking for an exact match in the chain. To prevent possible slow down of the algorithm we could simply bound the number of solvable collisions. An example of using such technique could be found in the GZip package [7]. Another way(+) to avoid any hashing collisions is to have a slot for every possible pair (or three) characters. Practically, this will require either 256^2=64K slots (two characters), or 256^3=4M slots (three characters). Of course, if we are looking for just one of 4K strings in the window, the memory for the look-up table is used here quite inefficiently. A more compact approach to represent any existing n-characters' combinations in the window is to put them into the TRIE structure. --- + Athough this method was described in several works, published in 1989-1990, including, for instance, "Text Compression" [5], it was then successfully patented in Oct 1992 [8]. 2.4. TRIE. ~~~~~~~~~~~~~~ A trie, or digital search tree, is a multiway tree with a path from the root to a unique node for each string represented in the tree. For example, substrings of text "bcbacbabc" could be represented by the following trie: ,------*----. a / b | \ c * * * b / \ c a / \ c | b * * * * * b / \ c | a * * * b / \ c * * Fig. 3. An example of trie structure. Note that only the unique prefix of each string is stored, as the suffix can be determined by looking up the string in the window. A longest match is found by following down the tree until no match is found, or the path ends at a leaf. The most practical approach to implement the trie, is to maintain a hash table with an entry for each node. To determine the location of the child of the node at location n in the table for input character c, the hash function is supplied with both n and c. This algorithm was firstly used in Unix utility compress [9] by Thomas et al. (1985). A more sophisticated implementation of the trie (or, to be more precise -- Patricia tree) is used by the LZFG (+) algorithm by Fiala and Green [10]. There are several other representations of the trie structure without hashing. The fastest approach is to create an array of pointers for each node in the trie, with one pointer for each character. Unfortunately, this can waste considerable space, particularly if some characters are rarely used. For example, if we are maintaining a trie with nodes consisting of 256 pointers, and we are parsing English text (26 characters are used), then we will waste at least 90% ((256-26)/256*100) of memory. An alternative is to use a linked list at each node, with one item for each possible branch. Thus, the trie from the previous example will look like: ,-a---. / \ b b-----. \ / \ c a c / \ / b c b \ / c a / b \ c Fig. 4. Another representation of the trie structure. This uses memory economically, but can be significantly slower because of the time taken to search the list. In one particular case both representations above produce the same data structure: when the branching factor of the search tree (the radix of alphabet) is equal to 2. Search decisions here are made on a bit-by-bit basis, and as soon as the structure of the radix 2 tree will depend on the binary representation of strings inserted in it, rather than on the order in which they are inserted, this makes it suitable for maintaining index of adaptive dictionary. The discussion on how to make this idea working in context of LZ77 encoding is presented in the rest of this article. --- + LZFG algorithm and the structure it uses to maintain sliding dictionary also has been patented [11]. 3. Using radix 2 trees for LZ77 coding. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Although that radix 2 search was used in wide range of applications, including, information-retrieval systems, text data bases, etc., I have not seen any attempt to use it in dictionary compressors yet. One possible reason is that to replace string by the values it is necessary to determine exact match of several first characters, say, at least three, which means that we have to scan at least 3*8=24 nodes in the tree, which is very slow, of course. Another reason, is that the insertion of a character in the radix 2 tree may produce up to 8 new nodes, and if we present each node by pair of pointers, this will consume 2*8=16 words of memory, which is quite inefficient. So that, at the first look, it is not so obvious how we can use radix 2 trees, and what benefits we can get from it. To clarify these things a little bit, let us show how the radix-2 tree can be represented in computer's memory. 3.1. A representation of the radix 2 tree. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The simplest way to represent a radix 2 tree is to assign to each node two fields (left and right sons), that may contain either: 1) pointer to the next node (internal node), 2) pointer to the string (leaf node), or 3) a nil value (leaf node), which means that there is no string with given bit at current position. An exact diagram of all the combination of these values is given below: rson ptr nil +------+-----+-----+ lson | + | + | ? | +------+-----+-----+ ptr | + | + | + | +------+-----+-----+ nil | ? | + | - | +------+-----+-----+ where: + a possible combination - means an empty node, this combination is not allowed ? means one way branching: for most simple algorithms, it is allowed combination. Algorithm Patricia, however, avoids one way branching by keeping a counter that shows how may bits shall be skipped at each node. To determine whether the value is node/text pointer, we could either use an additional bit, or keep node/text pointers in different ranges. For example, a lzrw1-#.c code assumes that all text pointers are > 0, 0 means nil, and all the internal nodes are indexing from the end of the array, so that their relative addresses are < 0. The second question is how to allocate nodes of the tree. The most natural solution here, is to place left and right pointers of each node in adjacent cells of memory. An example or such organization is given in Fig.5. - - - - - - -\ * / \ 2 3 0 / \ 1 => [A|2|B|C] / \ 0 1 \ / A * - - - - - - -/ 0 / \ 1 B C Fig.5. Radix 2 tree representation. Here we need to allocate 4 cells, 3 of which contain pointers to strings inserted in the tree. Resulting coefficient of memory usage is 3/4=75%. An important practical property of such organization, is that at any point in the tree, it is possible to select the next node using an arithmetic addition of the bit which is responsible for the direction (see Fig.6). next_node = [..........] tree[current_node+ / [ left son ] <- current node address string_bit(i)]; \ [ right son] [..........] Fig.6. Selecting the next node. Thus we can scan the tree simply by using a series of arithmetic operations, and table lookups -- no comparisons, conditional jumps, loops, etc. In the lzrw1-#.c program the procedure of scanning radix 2 tree is implemented as an unrolled loop, which ensures the maximum speed of execution. 3.2. Putting the LZ77 dictionary into a radix 2 tree. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now, when the method of storing of the radix 2 tree in memory, and the algorithm of scanning it are determined, the only thing which remains to to describe to use it in LZ compressor, is how and when we have to insert new nodes to the tree. There are three main approaches here(+): strings are inserted on 1) per character-, 2) per phrase-, and 3) per no more than last K suffixes of phrase- basis. The first one provides the best compression, but it is slow, the second one - is fast, but gives worse compression, and the third -- is a compromise (selecting K we can tune speed/compression tradeoff). Because the LZRW1 algorithm itself uses the second approach, and we are trying to build an algorithm of competitive speed, we will use it too. Another problem is that, if we are trying to maintain a full radix 2 tree, that indexes all the strings inserted, it will be unrestrictedly expanding. To prevent this effect, and make the algorithm working with given amount of memory, we have to limit the maximum length of strings to be inserted, and/or the maximum number of nodes in the tree. For instance, a lzrw-#.c program, being compiled with #defined R2 variable, produces an algorithm that maintains a radix 2 tree of 8K nodes, and can hold in it up to 24-bits strings. It uses the following procedure of node insertion: /* now: q - index of the leaf node in the tree: p=tree[TREESIZE+q]; * s=p_src - pointer to current position in the text * p - pointer to the matching string, if any */ if (p<=0||(offset=s-p)>4095) /* is string in the LZ77 window */ goto update; if (PS || PS || PS) { /* PS is #defined here as (*s++!=*p++) */ /* s doesn't match p; * we have to add s into the tree */ if (lastnode<0) { /* is there space for a new node? */ tree[TREESIZE+q]=lastnode; q=lastnode+(k&1); tree[TREESIZE+q^1]=0; lastnode+=2; /* insert a new node */ } update: tree[TREESIZE+q]=p_src; /* put pointer to current string */ goto literal; } tree[TREESIZE+q]=p_src; /* replace old pointer */ PS || PS || PS ... /* ... and continue comparison */ Thus, an exactly matching string will be simply replaced by pointer to its newer copy in the text. In a case when strings are different, either a new node will be inserted in the tree, or (if the tree cannot be expanded any more) -- old string will be replaced by the current one. The tricky question here is that if we are restricting growth of the tree, this may produce collisions on leaves that were formed on some intermediate levels in the tree. This will obviously decrease the compression performance, but the worst thing is that this effect could be used to claim that the algorithm above is just a variation on the hashing theme. Fortunately, working with a sliding dictionary we need to keep only N strings in the tree simultaneously, and this allows us to use a fixed-size structure to one-to-one index them. Two possible ways to arrange this we have already discussed in section 2.2 for binary search trees. However, the nature of radix 2 tree and the replacement algorithm above makes this much simpler. If we are compressing some regular text, a supposition could be done, that the radix-2 tree with proposed insertion algorithm, will grow only until some critical size, and once it is achieved, all the future insertions will just replace pointers in leaves. Some experimental results that can prove this decision are given below: +---- Maximum number of nodes in the tree ------ File Size | 2K 4K 6K 8K 10K 12K --------+--------+-------+-------+-------+-------+-------+------- BIB 111261 1.617 1.683 1.690 1.691 1.691 1.691 BOOK1 768771 1.425 1.483 1.491 1.492 1.492 1.492 BOOK2 610856 1.636 1.695 1.706 1.707 1.707 1.707 GEO 102400 1.091 1.147 1.177 1.181 1.181 1.181 NEWS 377109 1.564 1.622 1.636 1.640 1.641 1.641 OBJ1 21504 1.554 1.575 1.577 1.577 1.577 1.577 OBJ2 246814 1.702 1.825 1.861 1.920 1.935 1.939 PAPER1 53161 1.639 1.689 1.692 1.692 1.692 1.692 PAPER2 82199 1.580 1.628 1.629 1.629 1.629 1.629 PAPER3 46526 1.524 1.559 1.559 1.559 1.559 1.559 PAPER4 13286 1.508 1.509 1.509 1.509 1.509 1.509 PAPER5 11954 1.517 1.519 1.519 1.519 1.519 1.519 PAPER6 38105 1.652 1.696 1.696 1.696 1.696 1.696 PIC 513216 3.920 3.941 3.943 3.943 3.943 3.943 PROGC 39611 1.719 1.781 1.781 1.781 1.781 1.781 PROGL 71646 2.221 2.252 2.252 2.252 2.252 2.252 PROGP 49379 2.234 2.253 2.253 2.253 2.253 2.253 TRANS 93695 2.072 2.138 2.142 2.142 2.142 2.142 --------+--------+-------+-------+-------+-------+-------+------- Total: 3251493 1.726 1.790 1.803 1.808 1.809 1.809 Table 1. Relationship between the size of tree and compression. Tests were performed using the lzrw-#.c (-DR2) program on the files from the Calgary Text Compression Corpus as experimental patterns. It could be seen that to achieve the maximum compression for all the text files in the corpus, it is enough to maintain 8K-nodes tree. The only file that could be compressed a little bit better with bigger tree is file obj2, which represents a class of well compressed binary data. --- + this covers only LZ77 coding; other dictionary-based compressors may use more sophisticated strategies, a good example here -- LZJ algorithm [12]. 4. Faster than LZRW1? (using tree of variable radix). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Unfortunately, when it is necessary to scan dozens of nodes per each input character, even the best implementation of the algorithm for scanning radix-2 trees will work very slow. The only way to speed-up this process -- is to reduce the number of nodes to be scanned. To illustrate the idea of eliminating commonly used nodes in the radix tree, consider a structure from the Fig.7. - - - - - - -\ * / \ 2 3 4 5 / \ => [2|4|A|B|C|D] 0 / \ 1 0 1 \ / \ / / - - - - - - - - - -/ / * * - - - - - - - -/ 0 / \ 1 0 / \ 1 A B C D Fig.7. A complete subtree of level 2. It contains pointers to 4 strings, which can be selected by passing two internal nodes in the tree. An obvious way to improve the situation is to join internal nodes, i.e. turn a full tree into a look-up table of corresponding size. ,---*---. / / \ \ => [A|B|C|D] A B C D Fig.8. An equivalent joined node. The next question that arises here is what is the criteria of joining nodes? For instance, if we need to get the maximum speed, we could try to join all the nodes in the tree, thus turning the tree into a direct look-up table (resulting method was already described in the section 2.3). Another example, is joining of each 8-levels chains, that produces the standard character-to-character trie structure. Two more suitable to our algorithm strategies, are: 1) to join complete subtrees (i.e. N = 2^L - 1, where N is the number of (internal) nodes in the tree, L is the maximum level of tree); 2) to join subtrees, for which: N >= 2^(L-1) - 1. The last formulae means that if we extend the tree and turn it to an array (2^L elements), it will consume not more memory than binary tree itself (2*(N+1) elements). 5. Implementation. ~~~~~~~~~~~~~~~~~~~~~~ Although the idea of dynamically-reconstructing tree of variable radix looks extremely attractive (I can see, for instance, a possibility to use Knuth's "buddy system" to maintain its nodes in memory), I did not have a chance to implement it yet. The lzrw1-#.c program uses a much simpler approach, that has been chosen from some practical reasons. It does not use the optimal radix tree, but it is simple, resonably fast, and works good enough to illustrate the idea. Thus, being compiled with #defined RX variable, file lzrw1-#.c produces a compressor based on combined trees of radix 12 and 2 correspondingly. The first one is used on the first step to pass bits 0..3,8..11,16..19 of the input string, the second one -- to consequently scan the other bits. 6. Benchmarks. ~~~~~~~~~~~~~~~~~~ The tables below contain the results of testing of 1+2*3=7 algorithms that could be obtained from the lzrw-#.c code. To compile them I used Symantec C++ 6.1 with DOSX 32bit DOS extender from FlashTek. Here is a list of commands I used: LZRW1: sc -mx -5 -o -dTEST lzrw1-#.c LZRW1-#/R2: sc -mx -5 -o -dTEST -dR2 lzrw1-#.c \ to set # of nodes LZRW1-#/RX: sc -mx -5 -o -dTEST -dRX lzrw1-#.c / modify the TREESIZE param. Tests were carried out on a Pentium/75MHz PC (SysInfo CPU index: 238.2), using Calgary Text Compression Corpus files as experimental patterns. +----- LZRW1-#/R2 ------+----- LZRW1-#/RX ------ File Size | LZRW1 | 2K 4K 8K | 2K 4K 8K --------+--------+-------+-------+-------+-------+-------+-------+------- BIB 111261 1.683 1.617 1.683 1.691 1.615 1.696 1.703 BOOK1 768771 1.470 1.425 1.483 1.492 1.425 1.483 1.493 BOOK2 610856 1.694 1.636 1.695 1.707 1.643 1.704 1.714 GEO 102400 1.184 1.091 1.147 1.181 1.000 1.102 1.176 NEWS 377109 1.628 1.564 1.622 1.640 1.576 1.631 1.647 OBJ1 21504 1.622 1.554 1.575 1.577 1.552 1.609 1.610 OBJ2 246814 1.951 1.702 1.825 1.920 1.799 1.881 1.947 PAPER1 53161 1.729 1.639 1.689 1.692 1.676 1.732 1.732 PAPER2 82199 1.639 1.580 1.628 1.629 1.592 1.645 1.647 PAPER3 46526 1.586 1.524 1.559 1.559 1.537 1.585 1.585 PAPER4 13286 1.613 1.508 1.509 1.509 1.557 1.600 1.600 PAPER5 11954 1.638 1.517 1.519 1.519 1.600 1.635 1.635 PAPER6 38105 1.746 1.652 1.696 1.696 1.704 1.749 1.749 PIC 513216 3.909 3.920 3.941 3.943 3.582 3.940 3.948 PROGC 39611 1.831 1.719 1.781 1.781 1.763 1.822 1.822 PROGL 71646 2.288 2.221 2.252 2.252 2.241 2.300 2.300 PROGP 49379 2.335 2.234 2.253 2.253 2.278 2.319 2.319 TRANS 93695 2.170 2.072 2.138 2.142 2.112 2.174 2.177 --------+--------+-------+-------+-------+-------+-------+-------+------- Total: 3251493 1.803 1.726 1.790 1.808 1.722 1.800 1.819 Table 2. Compression ratio [x.yy:1]. +----- LZRW1-#/R2 ------+----- LZRW1-#/RX ------ File Size | LZRW1 | 2K 4K 8K | 2K 4K 8K --------+--------+-------+-------+-------+-------+-------+-------+------- BIB 111261 4.753 4.947 4.753 4.731 4.954 4.717 4.698 BOOK1 768771 5.442 5.614 5.394 5.362 5.614 5.394 5.358 BOOK2 610856 4.723 4.890 4.720 4.687 4.869 4.695 4.667 GEO 102400 6.757 7.333 6.975 6.774 8.000 7.260 6.803 NEWS 377109 4.914 5.115 4.932 4.878 5.076 4.905 4.857 OBJ1 21504 4.932 5.148 5.079 5.073 5.155 4.972 4.969 OBJ2 246814 4.100 4.700 4.384 4.167 4.447 4.253 4.109 PAPER1 53161 4.627 4.881 4.737 4.728 4.773 4.619 4.619 PAPER2 82199 4.881 5.063 4.914 4.911 5.025 4.863 4.857 PAPER3 46526 5.044 5.249 5.131 5.131 5.205 5.047 5.047 PAPER4 13286 4.960 5.305 5.302 5.302 5.138 5.000 5.000 PAPER5 11954 4.884 5.274 5.267 5.267 5.000 4.893 4.893 PAPER6 38105 4.582 4.843 4.717 4.717 4.695 4.574 4.574 PIC 513216 2.047 2.041 2.030 2.029 2.233 2.030 2.026 PROGC 39611 4.369 4.654 4.492 4.492 4.538 4.391 4.391 PROGL 71646 3.497 3.602 3.552 3.552 3.570 3.478 3.478 PROGP 49379 3.426 3.581 3.551 3.551 3.512 3.450 3.450 TRANS 93695 3.687 3.861 3.742 3.735 3.788 3.680 3.675 --------+--------+-------+-------+-------+-------+-------+-------+------- Total: 3251493 4.437 4.635 4.469 4.425 4.646 4.444 4.398 Table 3. Compression ratio [bits/byte]. +----- LZRW1-#/R2 ------+----- LZRW1-#/RX ------ File Size | LZRW1 | 2K 4K 8K | 2K 4K 8K --------+--------+-------+-------+-------+-------+-------+-------+------- BIB 111261 1.912 0.819 0.786 0.755 1.912 1.484 1.424 BOOK1 768771 1.396 0.735 0.706 0.680 1.756 1.320 1.145 BOOK2 610856 1.945 0.855 0.819 0.791 1.958 1.503 1.349 GEO 102400 1.366 0.537 0.514 0.507 ----- 0.784 0.723 NEWS 377109 1.816 0.802 0.761 0.718 1.859 1.458 1.257 OBJ1 21504 1.783 0.911 0.804 0.820 1.414 1.414 1.465 OBJ2 246814 2.179 1.082 1.019 0.940 2.130 1.818 1.554 PAPER1 53161 1.811 0.867 0.831 0.831 2.028 1.662 1.513 PAPER2 82199 1.889 0.812 0.788 0.788 1.889 1.493 1.479 PAPER3 46526 1.775 0.765 0.765 0.765 1.775 1.455 1.431 PAPER4 13286 1.491 0.745 0.745 0.874 2.112 1.408 1.491 PAPER5 11954 1.341 0.786 0.814 0.814 1.900 1.267 1.900 PAPER6 38105 1.864 0.876 0.817 0.817 1.864 1.615 1.615 PIC 513216 4.028 2.022 1.927 1.912 3.612 2.821 2.653 PROGC 39611 1.889 0.910 0.910 0.849 1.889 1.679 1.679 PROGL 71646 2.485 1.231 1.120 1.178 2.440 2.040 2.040 PROGP 49379 2.415 1.290 1.207 1.207 2.854 2.093 2.093 TRANS 93695 2.482 1.117 1.045 1.039 2.448 1.901 1.901 --------+--------+-------+-------+-------+-------+-------+-------+------- Total: 3251493 1.919 0.909 0.869 0.840 2.011 1.573 1.425 Table 4. Compression speed [Mbytes/sec]. +----- LZRW1-#/R2 ------+----- LZRW1-#/RX ------ File Size | LZRW1 | 2K 4K 8K | 2K 4K 8K --------+--------+-------+-------+-------+-------+-------+-------+------- BIB 111261 3.167 3.167 3.167 3.167 3.167 3.167 3.167 BOOK1 768771 2.864 2.869 2.927 2.898 2.892 2.933 2.892 BOOK2 610856 3.157 3.107 3.201 3.157 3.157 3.157 3.107 GEO 102400 2.676 2.713 2.504 2.504 ----- 2.537 2.676 NEWS 377109 3.114 3.114 3.183 3.183 3.183 3.183 3.284 OBJ1 21504 2.279 2.279 3.418 3.418 3.729 3.418 3.418 OBJ2 246814 3.411 3.387 3.411 3.387 3.411 3.411 3.540 PAPER1 53161 2.982 2.982 2.982 2.982 2.982 2.982 2.982 PAPER2 82199 2.800 3.074 3.136 3.136 3.074 3.136 3.136 PAPER3 46526 3.060 3.060 3.060 3.060 2.610 2.610 3.060 PAPER4 13286 3.620 2.304 2.112 2.112 4.224 4.224 2.112 PAPER5 11954 3.257 3.800 3.257 1.900 3.800 3.800 1.900 PAPER6 38105 3.160 3.160 3.160 3.160 3.160 3.160 3.160 PIC 513216 4.798 4.661 4.822 4.684 4.661 4.798 4.661 PROGC 39611 3.285 3.285 3.285 3.285 3.285 2.605 2.605 PROGL 71646 3.416 3.504 3.504 3.416 4.019 3.504 3.504 PROGP 49379 4.095 4.095 3.364 4.095 4.095 3.248 3.248 TRANS 93695 3.971 3.504 3.504 3.971 3.971 3.574 3.971 --------+--------+-------+-------+-------+-------+-------+-------+------- Total: 3251493 3.295 3.271 3.320 3.299 3.354 3.306 3.297 Table 5. Decompression speed [Mbytes/sec]. 7. Conclusion. ~~~~~~~~~~~~~~~~~ Thus, looking at the results above we can notice, that the series of -#* algorithms can outperform an original lzrw1 scheme by compression factor (about +0.1%). In addition, algorithms based on combined radix tree (/RX), can acheive competitive speed. On average, lzrw1-#/rx works just 20% slower. This is not too bad, but again, I have to say that the search tree I used in lzrw1-#/rx is not the optimal one, and that here remains a lot of possibilities for further improvements... 8. Acknowlegements. ~~~~~~~~~~~~~~~~~~~~~ Many thanks to Vadim Vlasov and Peter Gutmann for valuable comments and help in improving the text of this article. 9. References. ~~~~~~~~~~~~~~~~ 1. Williams, R. "An Extremely Fast Ziv-Lempel Data Compression Algorithm", IEEE Data Compression Conference, Snowbird, Utah, 8-11 April 1991. 2. U.S. patent # 5,049,881 Apparatus and method for very high data rate-compression incorporating lossless data compression and expansion utilizing a hashing technique. inventors Dean K. Gibson, Mark D. Graybill assignee Intersecting Concepts, Inc. Filed 6/18/90, granted 9/17/91 3. U.S. patent # 4,701,745 Data compression system Inventor Waterworth John R, Assignee Ferranti PLC GB, Filed 03/03/1986 (03/06/1985 in GB), granted 10/20/1987 4. Calgary Text Compression Corpus: ftp ftp.cpsc.ucalgary.ca: /pub/projects/text.compression.corpus/text.compression.corpus.tar.Z. 5. Bell, T.C, Cleary, J.G. and Witten, I.H, "Text Compression", Prentice-Hall 1989. ISBN: 0-13-911991-4. 6. Bell, T.C. Better OPM/L text compression, IEEE Trans.Communications, COM-34(12), 1176-1182, December 1986. 7. GZip 1.2.4: ftp garbo.uwasa.fi:/unix/arcers/gzip-1.2.4.tar.Z 8. U.S. patent # 5,155,484 Fast data compressor with direct lookup table indexing into history buffer, filed 9/13/1991, granted 10/13/1992 inventor Chambers, IV, Lloyd L., assignee Salient Software, Inc. 9. Thomas,S., and Orost,J. 1985. Compress (version 4.0) program and documentation 10. Fiala,E.R., and Greene,D.H. Data compression with finite windows, Comm.of the ACM, 32(4), 490-505, April 1989. 11.U.S. patent # 4,906,991 Textual substitution data compression with finite length search window, filed 4/29/1988, granted 3/6/1990 inventors Fiala,E.R., and Greene,D.H. assignee Xerox Corporation 12.Jakobsson, M. Compression of character strings by an adaptive dictionary, BIT, 25(4) 1985, 593-603. Appendix A. The LZRW1-# source code. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /* lzrw1.c - LZRW1 algorithm by Ross Williams (1991) */ #if R2 || RX /* * LZRW1-#.C - replaces the hash table by a radix-search tree. * Radix search algorithm: (C) George A. Resnick, 1995. * See file LZRW1-#.TXT for details. */ #endif /* Types: */ #define UBYTE unsigned char /* Unsigned byte (1 byte ) */ #define UWORD unsigned int /* Unsigned word (2 bytes) */ #define ULONG unsigned long /* Unsigned longword (4 bytes) */ /* Constants: */ #define TRUE 1 #define FLAG_BYTES 4 /* Number of bytes used by copy flag. */ #define FLAG_COMPRESS 0 /* Signals that compression occurred. */ #define FLAG_COPY 1 /* Signals that a copyover occurred. */ /* Fast copy routine: */ void fast_copy(p_src,p_dst,len) UBYTE *p_src,*p_dst; {while (len--) *p_dst++=*p_src++;} #if R2 || RX #include #endif /* LZRW1 compression routine: */ void lzrw1_compress(p_src_first,src_len,p_dst_first,p_dst_len) /* Input : Specify input block using p_src_first and src_len. */ /* Input : Point p_dst_first to the start of the output zone (OZ). */ /* Input : Point p_dst_len to a ULONG to receive the output length. */ /* Input : Input block and output zone must not overlap. */ /* Output : Length of output block written to *p_dst_len. */ /* Output : Output block in Mem[p_dst_first..p_dst_first+*p_dst_len-1]. */ /* Output : May write in OZ=Mem[p_dst_first..p_dst_first+src_len+256-1].*/ /* Output : Upon completion guaranteed *p_dst_len<=src_len+FLAG_BYTES. */ UBYTE *p_src_first,*p_dst_first; ULONG src_len,*p_dst_len; #define PS *p++!=*s++ /* Body of inner unrolled matching loop. */ #define ITEMMAX 16 /* Maximum number of bytes in an expanded item. */ { UBYTE *p_src=p_src_first, *p_dst=p_dst_first; UBYTE *p_src_post=p_src_first+src_len, *p_dst_post=p_dst_first+src_len; UBYTE *p_src_max1=p_src_post-ITEMMAX, *p_src_max16=p_src_post-16*ITEMMAX; UBYTE *p_control; UWORD control=0,control_bits=0; #if R2 || RX # define TREESIZE (4096*2) static UBYTE *tree[TREESIZE]; int lastnode; /* radix search tree */ assert((int)p_src>0 && (int)p_src_post>0); /* text pointers must be > 0 */ #if R2 tree[0]=tree[1]=0; lastnode=2-TREESIZE; /* initialize root */ #else /* RX */ for (lastnode=0;lastnode<4096;lastnode++) tree[lastnode]=0; lastnode-=TREESIZE; #endif #else static UBYTE *hash[4096]; /* original LZRW1 hash table */ #endif *p_dst=FLAG_COMPRESS; p_dst+=FLAG_BYTES; p_control=p_dst; p_dst+=2; /* The Main Loop: */ while (TRUE) { UWORD unroll=16,len; ULONG offset; if (p_dst>p_dst_post) goto overrun; if (p_src>p_src_max16) { unroll=1; if (p_src>p_src_max1) { if (p_src==p_src_post) break; goto literal; } } /* Begin Unrolled Loop: */ do { register UBYTE *p,*q,*s; #if R2 || RX register ULONG k=*(ULONG*)(s=p_src); #if R2 p=(UBYTE*)-TREESIZE; /* root */ # define S(n) q=p+(k&1); k>>=1; p=tree[TREESIZE+(int)q]; \ if ((int)p>=0) goto chk_offset_##n; S(0);S(0);S(0);S(0);S(0);S(0);S(0); /* scan bits 0..6 */ S(1);S(1);S(1);S(1);S(1);S(1);S(1);S(1); /* bits 7..14 */ S(2);S(2);S(2);S(2);S(2);S(2);S(2);S(2);S(3); /* 15..23 */ #else /* RX */ q=(UBYTE*)((k&0x0F0F)+((k>>12)&0x0F0)-TREESIZE); p=tree[TREESIZE+(int)q]; k>>=4; /* scan bits 0..3|8..11|16..19 */ # define S(s,n) if ((int)p>=0) goto chk_offset_##n; \ q=p+(k&1); k>>=s; p=tree[TREESIZE+(int)q]; S(1,0);S(1,0);S(1,0);S(1+4,0); /* scan bits 4..7 */ S(1,1);S(1,1);S(1,1);S(1+4,1); /* scan bits 12..15 */ S(1,2);S(1,2);S(1,2);S(1+0,2); /* scan bits 20..23 */ #endif # define C(n) chk_offset_##n: if (p<=0||(offset=s-p)>4095) goto update; \ s+=n; p+=n; switch (n) {case 0: if (PS) goto add; case 1: if (PS) goto add; \ case 2: if (PS) goto add; case 3: goto cont; } C(3);C(2);C(1);C(0); /* check offsets */ add: if (lastnode<0) { tree[TREESIZE+(int)q]=(UBYTE*)lastnode; q=(UBYTE*)lastnode+(k&1); tree[TREESIZE+(int)q^1]=0; lastnode+=2; } update: tree[TREESIZE+(int)q]=p_src; goto literal; cont: tree[TREESIZE+(int)q]=p_src; #else UWORD index=((40543*((((p_src[0]<<4)^p_src[1])<<4)^p_src[2]))>>4) & 0xFFF; p=hash[index]; hash[index]=s=p_src; offset=s-p; if (offset>4095 || p>4)+(len-1); *p_dst++=offset&0xFF; p_src+=len; control=(control>>1)|0x8000; control_bits++; continue; literal: *p_dst++=*p_src++; control>>=1; control_bits++; } while (--unroll); /* Flush Control Bits */ if (control_bits==16) { *p_control=control&0xFF; *(p_control+1)=control>>8; p_control=p_dst; p_dst+=2; control=control_bits=0; } } control>>=16-control_bits; *p_control++=control&0xFF; *p_control++=control>>8; if (p_control==p_dst) p_dst-=2; *p_dst_len=p_dst-p_dst_first; return; overrun: fast_copy(p_src_first,p_dst_first+FLAG_BYTES,src_len); *p_dst_first=FLAG_COPY; *p_dst_len=src_len+FLAG_BYTES; } /* LZRW1 decompression routine: */ void lzrw1_decompress(p_src_first,src_len,p_dst_first,p_dst_len) /* Input : Specify input block using p_src_first and src_len. */ /* Input : Point p_dst_first to the start of the output zone. */ /* Input : Point p_dst_len to a ULONG to receive the output length. */ /* Input : Input block and output zone must not overlap. User knows */ /* Input : upperbound on output block length from earlier compression. */ /* Input : In any case, maximum expansion possible is eight times. */ /* Output : Length of output block written to *p_dst_len. */ /* Output : Output block in Mem[p_dst_first..p_dst_first+*p_dst_len-1]. */ /* Output : Writes only in Mem[p_dst_first..p_dst_first+*p_dst_len-1]. */ UBYTE *p_src_first, *p_dst_first; ULONG src_len, *p_dst_len; {UWORD controlbits=0, control; UBYTE *p_src=p_src_first+FLAG_BYTES, *p_dst=p_dst_first, *p_src_post=p_src_first+src_len; if (*p_src_first==FLAG_COPY) {fast_copy(p_src_first+FLAG_BYTES,p_dst_first,src_len-FLAG_BYTES); *p_dst_len=src_len-FLAG_BYTES; return;} while (p_src!=p_src_post) {if (controlbits==0) {control=*p_src++; control|=(*p_src++)<<8; controlbits=16;} if (control&1) {UWORD offset,len; UBYTE *p; offset=(*p_src&0xF0)<<4; len=1+(*p_src++&0xF); offset+=*p_src++&0xFF; p=p_dst-offset; while (len--) *p_dst++=*p++;} else *p_dst++=*p_src++; control>>=1; controlbits--; } *p_dst_len=p_dst-p_dst_first; } /* end of lzwr1.c */ #if TEST || TESTW /* * Here is a simple test program that uses the LZRW1 algorithm * to compress/expand files specified in the command line. */ #include #include #include #include #include #include #include #include static void error (const char* msg, ...) { vprintf(msg,(va_list)((char *)&(msg) + sizeof(msg))); exit (EXIT_FAILURE); } int main (int argc, char *argv[]) { /* block size */ # define MAXTEXTSIZE (256 * 1024) /* 256K */ int block_size = MAXTEXTSIZE; /* source/destination buffers */ int src_file, dest_file; long src_len, dest_len, up_len; static long tot_src_len = 0, tot_dest_len = 0, tot_up_len = 0; static unsigned char src_buf[MAXTEXTSIZE], dest_buf[MAXTEXTSIZE+256]; /* time counters */ unsigned long rd_time, wr_time, pk_time, up_time, start_time; static unsigned long tot_pk_time = 0, tot_up_time = 0; # define REP 20 /* use REP=10..20 for fine time measurement */ /* other variables */ char fname[127], *fn, *p; if (!--argc) error ("usage:\n\tLZRW1 [/B=block_size] files\n"); if (*(p = *++argv) == '/') { p ++; if (*p == 'B' || *p == 'b') { block_size = atoi (p+2); if (block_size <= 0 || block_size > MAXTEXTSIZE /1024) error ("Invalid block size.\n"); block_size *= 1024; } else error ("Invalid switch: %s\n", --p); if (!--argc) error ("File specification expected.\n"); argv++; } /* the main loop: */ do { long i, j, k, l; /* open source file */ fn = p = * argv++; if ((src_file = open (p, O_RDONLY | O_BINARY)) == -1) { _open_file_error: error ("Can't open file %s\n", p); } /* form destination file name */ p = strrchr (strcpy (fname, p), '.'); if (!p) p = fname + strlen(fname); strcpy (p, ".-#-"); /* add compressed file name extension */ p = fname; #if TESTW /* create destination file */ if ((dest_file = creat (p, S_IREAD | S_IWRITE)) == -1) goto _open_file_error; #endif src_len = dest_len = up_len = 0; rd_time = wr_time = pk_time = up_time = 0L; /* printf ("Compressing [/B=%d] [/M=%d] %s ... ", block_size/SECTORSIZE, method+1, fn); */ /* read/compress/expand blocks */ do { /* reading ... */ start_time = clock (); i = read (src_file, src_buf, block_size); rd_time += clock () - start_time; if (i == -1) error ("read error.\n"); if (i < 256L) break; src_len += i; /* compressing ... */ start_time = clock (); l = REP; do lzrw1_compress (src_buf, (ULONG)i, dest_buf, (ULONG*) &j); while (--l); pk_time += clock () - start_time; if (j < i) { /* expanding ... */ static unsigned char tmp_buf[MAXTEXTSIZE]; memset (tmp_buf, 0, sizeof src_buf); start_time = clock (); l = REP; do lzrw1_decompress (dest_buf, j, tmp_buf, (ULONG*)&k); while (--l); up_time += clock () - start_time; if (k != i || memcmp(src_buf,tmp_buf,i)) printf ("???\n"); up_len += i; } #if TESTW /* writting ... */ start_time = clock (); k = write (dest_file, dest_buf, j); wr_time += clock () - start_time; if (k == -1) error ("write error.\n"); if (k < j) { close (dest_file); unlink (p); error ("insufficient disk space.\n"); } #else k = j; #endif dest_len += k; } while (i == block_size); close (src_file); #if TESTW close (dest_file); #endif if (src_len && up_len) { # define MBS ((double)CLOCKS_PER_SEC/(1024.*1024.)) printf ("%-14s %8ld %7.03f %7.03f %7.03f\n", fn, src_len, (double)src_len/(double)dest_len, MBS * (double)(src_len) * REP / (double)(pk_time+1), MBS * (double)(up_len) * REP / (double)(up_time+1) ); } tot_src_len += src_len; tot_dest_len += dest_len; tot_up_len += up_len; tot_pk_time += pk_time; tot_up_time += up_time; } while (--argc); printf ("Total: %8ld %7.03f %7.03f %7.03f\n", tot_src_len, (double)tot_src_len/(double)tot_dest_len, MBS * (double)(tot_src_len) * REP / (double)(tot_pk_time+1), MBS * (double)(tot_up_len) * REP / (double)(tot_up_time+1) ); return EXIT_SUCCESS; } #endif