What is bzip2? ============== bzip2 compresses files using the Burrows-Wheeler block- sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors, and approaches the performance of the PPM family of statistical compressors. Further Reading =============== bzip2 is not research work, in the sense that it doesn't present any new ideas. Rather, it's an engineering exercise based on existing ideas. Four documents describe essentially all the ideas behind bzip2: Michael Burrows and D. J. Wheeler: "A block-sorting lossless data compression algorithm" 10th May 1994. Digital SRC Research Report 124. ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz If you have trouble finding it, try searching at the New Zealand Digital Library, http://www.nzdl.org. Daniel S. Hirschberg and Debra A. LeLewer "Efficient Decoding of Prefix Codes" Communications of the ACM, April 1990, Vol 33, Number 4. You might be able to get an electronic copy of this from the ACM Digital Library. David J. Wheeler Program bred3.c and accompanying document bred3.ps. This contains the idea behind the multi-table Huffman coding scheme. ftp://ftp.cl.cam.ac.uk/pub/user/djw3/ Jon L. Bentley and Robert Sedgewick "Fast Algorithms for Sorting and Searching Strings" Available from Sedgewick's web page, www.cs.princeton.edu/~rs The following paper gives valuable additional insights into the algorithm, but is not immediately the basis of any code used in bzip2. Peter Fenwick: Block Sorting Text Compression Proceedings of the 19th Australasian Computer Science Conference, Melbourne, Australia. Jan 31 - Feb 2, 1996. ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps Kunihiko Sadakane's sorting algorithm, mentioned above, is available from: http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz The Manber-Myers suffix array construction algorithm is described in a paper available from: http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps