LZW COMPRESSION AND DECOMPRESSION Copyright (C) 1993 Homer Wilson Smith Redistribution rights granted for non commercial purposes. ((This is long and thorough but not as clear as I would have liked, mostly due to confusion of words. A CODE is a number from 0 to 255 representing an ascii or ebcdic letter or character. ASCII stands for American Standard Code for Information Interchange. EBCDIC stands for Extended Binary Coded Decimal Interchange Code. The INPUT STREAM is a series of characters or INPUT CODES that will be compressed. An INPUT CODE is the input character that one wants to compress. INPUT CODES are 8 bit integers stored in a 16 bit integer. Thus the input stream is actually a stream of 16 bit integers and not 8 bit integers, even though each 16 bit integer holds a single 8 bit integer. An OUTPUT STREAM is a stream of OUTPUT CODES which is series of characters representing the compression of the INPUT STREAM. The OUTPUT CODE is the compressed version of the INPUT CODE. Output CODES are 12 bits each packed into a 16 bit integer output stream. Thus four 12 bit codes can be packed into three 16 bit integers. A TABLE is a 4096 row x 2 column matrix of 16 bit integers. A TABLE POSITION is any row in that matrix, such as row 235. Each TABLE ROW consists of two 16 bit integers of the form (INDEX,CODE) stored as column 1 and column 2. A TABLE INDEX is a pointer to a prior table row. The TABLE CODE is taken from the uncompressed input stream. So table ROW 234 might contain (100,128). The table index = 100 refers back to TABLE ROW 100 which will contain ANOTHER EARLIER (INDEX,CODE) pair. Thus each table row contains a LINK called a table INDEX which points back to a prior row. And each table row contains a CODE taken from the input stream. The series of TABLE INDEXES starting at the beginning of the table, forms the output code compressed stream, which is explained below.)) LZW COMPRESSION ALGORITHM The LZW compression algorithm is probably one of the most mind boggling algorithms to come along, but it is also one of the most brilliant, so it is worth understanding well. It's really quite mind opening. Have you ever programed something you didn't fully understand, but because it worked fine you let it go, and then months later found yourself still wondering why your code worked? Well that's what happened to me with LZW compression. My programs worked, for a very long time they worked, until one day I went to send a large tiff to a color separator, and bang nothing would read it. Corel Draw wouldn't read it, CSHOW wouldn't read it, my own damn tiff program wouldn't read it, and that REALLY pissed me off. I had written my tiff programs with barely passable understanding of the algorithm, following along the pseudo code provided by others, with little personal comprehension myself. I had read a number of descriptions trying to make LZW compression and decompression clear to me, but basically all I ended up with was a partial understanding of the encoding process and no understanding at all of the decoding process. I kind of programmed it blind, and hoped it would work. It took me many hours to find the bug, it was one of those 'it crops up every hundred million years' kind of bug, and I didn't know what I was doing anyhow, so it wasn't easy. But from all bad things comes something good (I think) and this paper is the result. I finally understood both sides of the compression and decompression dichotomy, and I am reasonably sure my programs will work because I understand them and not because I followed someone else's code. The proof of course is in the pudding, and the pudding is whether or not I can explain this to others so that they get it the first time. The algorithm is really quite brilliant, and worth knowing just for the fun of it. This paper is not meant to replace everything else you might read on the subject of LZW encoding or TIFF and GIF files, but it is intended to make all that other stuff clear. This is what I wish someone had written for me when I first approached the subject. WHY DATA COMPRESSION The problem with data is it takes up so much damn space, and with the cost of hard disk space and the slowness of telecommunications, it behooves us to find a way to make data smaller, to make it take up less room, so that we can then use up the space we free up to fill it with more data! It actually takes less time to compress data than it does to transmit it, so its a good cost saving to compress the data first, then transmit it and uncompress it at the other end. RUN LENGTH ENCODING (RLE) Various encoding schemes have been devised to take care of this need, one of the earliest most obvious ones being RUN LENGTH ENCODING. Run length encoding takes advantage of the fact that often many pixels right next to each other are the same color pixel, so rather than sending them all one by one, they are counted up into a RUN and two bytes are sent instead, the first byte telling how many pixels there were of one color number, and the next byte the number it was. Thus a run of pixels that looked like this 5 5 5 3 3 3 9 9 9 9 9 1 1 1 1 would be run length encoded as 3 5 3 3 5 9 4 1 an obvious saving of space. The longer the runs are the more space is saved as each color run is represented by only two bytes, a count and a color. If your counts can go only up to 255, then if the run is longer you have to start a new run, but if your counts can go up to 32767, then you can handle longer runs before you have to start a new run. The problem comes in when the data does not consist of large expanses of identical pixels, who wants to look at a single patch of red anyhow. A worst case scenario of course is when every pixel is different such as, 2 3 4 5 6 7 8 9 etc. Encoding this as RLE would produce, 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 which is an obvious waste of space. One solution to this is to insert an INTRODUCER byte before each sequence telling whether it is a horizontal run, or a stretch of one byte per pixel. Thus runs may start with positive counts, and stretches of one byte per pixel may start with a negative count. Thus the sequence 3 3 3 3 3 1 2 3 4 5 6 would be compressed as 5 3 -6 1 2 3 4 5 6 giving some compression, but its obviously not an ideal solution. There is also tremendous complexity involved in programming the ideal compression for data that has a lot of 2 byte runs in it. For example, 3 3 4 5 3 3 4 5 3 3 4 5 is better put out as a single stretch of one byte per pixel -12 3 3 4 5 3 3 4 5 3 3 4 5 rather than as a mixture 2 3 -2 4 5 2 3 -2 4 5 2 3 -2 4 5 because of the the one byte overhead in switching from horizontal runs to one byte per pixel runs. One might make a simple rule that all two byte runs should be put out as one byte per pixel, except that this is only true if you have already opened a one byte per pixel stretch, otherwise 2 byte runs should be put out as a horizontal run (count,byte). In any case run length encoding (RLE) only produces good compression when the data is even and unchanging. Fractals of course are the antithesis of unchanging data, so RLE does not work well on fractals or any other natural data like video frames or even written work. In other words if you have a patch of red to compress, use RLE. Otherwise keep reading. The LZW Algorithm There are four concepts that need to be understood. They are, 1.) INPUT CODE STREAM 2.) TABLE INDEX 3.) TABLE CODE 4.) OUTPUT CODE STREAM The purpose of LZW compression is to turn the input code stream into an output code stream of lesser length, and then later to reconstruct the input code stream from the output code stream. This accomplishes the purpose of compression and decompression. The above four concepts can be visualized as follows: INPUT CODE TABLE OUTPUT CODE STREAM INDEX CODE STREAM ------------------------------------------------------------------------ CODE, CODE, CODE... (INDEX, CODE) CODE, CODE.... (INDEX, CODE) (INDEX, CODE) . . CODE. A code is a byte wide entity consisting of 8 bits and ranging in decimal value from 0 to 255. They can be considered as characters, or as integer numbers, it doesn't matter. They are NOT signed integers however, not plus or minus. INPUT CODES, TABLE CODES and OUTPUT CODES are all of this format. INDEX. An index is a two byte wide SIGNED entity consisting of 16 bits and ranging in decimal value from -32768 to 32767. It is called an index because it refers to an earlier entry in the table. For example, if table row position number 400 consists of the (index,code) pair (312,5), then 312 is called a table index and points BACKWARDS to table row 312 which was generated in the PAST by the algorithm, and 5 is the table code generated in the PRESENT process of compression, and is stored in table row 400 as (312,5). Table row 312 will itself have an (index,code) pair that will point to an even earlier table row, which will point to an earlier table row ad infinitum until you come back to the first table entries before which there are none. These first entries in the table are called the ROOT ENTRIES or the ROOT AREA. If a table row has -1 for an INDEX, that means that that table row does NOT refer to any earlier table row. -1's appear as table indexes either: 1.) when it resides in the ROOT AREA as described below. 2.) when a table row is still unused. Table rows that haven't been used yet don't point to earlier table rows because they haven't been used yet! In either case the table index will be -1. 0 by the way is a valid table index which points to the very first table row in the root area, table row number 0. INPUT CODE STREAM. The input code stream is the sequence of bytes that you wish to compress into a hopefully shorter output code stream. OUTPUT CODE STREAM. The output code stream is the sequence of bytes that is the final product of compression, and is hopefully shorter than the input code stream and RECONSTRUCTABLE BACK into the input code stream. It's no good compressing something if you can't uncompress it! TABLE. The table is a 4096 row x 2 column table of integer numbers of the following form, it has 4096 rows numbered 0 to 4095. TABLE ROW (INDEX,CODE) ------------------------------------------------------------------------ ----- 0 (-1 , 000) | 1 (-1 , 001) | 2 (-1 , 002) | . . ROOT AREA . . | 255 (-1 , 254) | 255 (-1 , 255) | 256 (-1 , 256) Clear Code (CC) ----- 257 (-1 , 257) End of Information Code (EOIC) ----- 258 (INDEX,CODE) | 259 (INDEX,CODE) ADDED . ENTRY AREA . | 278 (INDEX,CODE) | 279 (INDEX,CODE) <---TABLE POINTER to latest row | 270 (-1 , 0) As yet unused rows | 271 (-1 , 0) " | . . ----- 4095 (-1 , 0) End of table. The table consists of only the two columns of numbers, the table row numbers are NOT in the table and are only shown for clarity. The table consists of 2 areas, the ROOT area and the ADDED ENTRY AREA. ROOT AREA. The root area consists of the first 258 rows in the table numbered from 0 to 257. The rows from 0 to 255 are initialized at program startup to the numbers shown and do not change during the duration of the program. Rows numbered 256 and 257 are also set during program start up and are used to control the process of compression and decompression. Their CODE values are otherwise arbitrarily chosen. Because a few of their values are greater than 255, table codes must use 16 bit wide integers, even though input codes themselves have only 8 significant bits from 0 to 255. The table indexes in the root area are set to -1's because they do not point to any earlier table row. The table codes in the root area are set to the basic alphabet of possible characters in the input code stream, in this case the integers from 0 to 255. The Codes for CLEARCODE and END OF INFORMATION CODE are arbitrarily set to 256 and 257. This leaves table row 258 as the first empty row to ADD new rows into the table when the compression process actually starts. ADDED ENTRY AREA. At program startup rows 0 through 257 are initialized as described above and a pointer is set to the last entered row in the table, in this case 257. The remaining table rows are initialized to (-1,0) as shown and usually there are a total of 4096 rows in the table numbered 0 to 4095, including the root area. During the process of compression the program scans the input code stream and fills in new rows to the table according to the LZW compression algorithm which we will discuss at length below. Each new table row consists of an (INDEX,CODE) pair which is determined by the input code stream and the LZW algorithm. As each new (INDEX,CODE) pair is found the table pointer is incremented by one and the new (INDEX,CODE) pair is inserted into the table at the row pointed to by the table pointer. In the following discussion we wish to prove the following points. 1.) The table is constructed from and totally determined by the input code stream. The table is usually shorter than the input code stream and so compression has taken place in the construction of the table. This is because each single table row can represent more than one character from the input code stream. 2.) From the table alone the input code stream can be reconstructed. Thus if you have the complete table you can reconstruct the input code stream. The right hand table codes are NOT the compressed input code stream, but are a subset of characters found in the input code stream. 3.) The right hand column of table codes can be reconstructed from the left hand column of table indexes alone. Thus if you know the left hand table indexes you can recompute from that information alone the right hand table codes and thus the input code stream. 4.) Thus the table indexes can act as the final output compressed data stream, or in other words, the output codes ARE the table indexes from the added row area. Do not get table CODES confused with the output codes. The table INDEXES become the output codes. The table CODES are taken from the input code stream, but are a specifically selected subset of the input code stream, and in the end can be thrown away. Because the table indexes can go up to 4095, they need 12 bits to represent them. Although they are computed as 16 bit wide integers, when the indexes are finally output to the output data stream, they are packed at 12 bits each. Thus an input data stream of 8 bit integers gets compressed and output as an output data stream of 12 bit integers. Since each table index can represent a unique string of varying length, overall compression takes place. For example if a single 12 bit wide table index represents the 3 character string of 'ABC' of 8 bits each or 24 bits, that is a compression factor of 2:1. Three characters of 8 bits each or 24 bits are compressed into one character of 12 bits. 5.) The process of uncompressing the output code stream then works as follows. First you must rebuild the table indexes from the output code stream. The ROOT AREA of the table is trivial as it is always the same. The ADDED ENTRY AREA is almost as trivial. The table INDEX column of the added row area IS the sequence of output codes, as the output codes ARE the table indexes. The table CODE column of the added row area is then recomputed from the table INDEX column by means that will become obvious when the full algorithm is discussed, and that completes the rebuilding of the table from the output codes. From the completed table one can then reconstruct the original input code stream. SO HOW DOES LZW WORK? The idea behind LZW is that not only do some pixels repeat themselves, but many SEQUENCES of pixels repeat themselves. For example take the input code stream (in bytes), ABCDE ABCDE ABCDE ABCDE ABCDE ignore spaces Such a sequence would not lend itself well to run length encoding because no single pixel forms a run. However the sequence of 5 pixels ABCDE certainly form a run as they appear one right after the other 5 times. One might even suggest a super run length encoding that could seek out repeating runs of SEQUENCES and perhaps encode a string like ABCDE ABCDE ABCDE ABCDE ABCDE FGH FGH FGH FGH (without spaces) stream as something like, 5(ABCDE) 4(FGH) However LZW goes one step further and recognizes that the string ABCDE can appear anywhere in the input code stream, not just in consecutive runs, so why not just assign the string ABCDE a unique CODE, say 234, and then every time the string ABCDE appears replace it with the single byte 234. So if 234 is assigned to ABCDE and 235 is assigned to FGH then the above 20 byte string could be compressed into the 9 bytes 234 234 234 234 234 235 235 235 235 Of course the number 236 could be assigned to the string ABCDEABCDE and 237 assigned to FGHFGHFGH, so now the same 20 byte input code stream could be output as 236 236 234 237 235 To make this easier to see: 236 ABCDEABCDE 236 ABCDEABCDE 234 ABCDE 237 FGHFGHFHG 235 FGH Obviously you stand to create a lot of compression this way, if you can find all the repeating strings in a text. Just as obviously the whole scheme is utterly preposterous, as you could just assign one number, say 234, to the whole damn text and then output only that one number! The problem then is clearly keeping track of what string each number stands for. That is where a string table comes in. STRING TABLES A string table is merely a table of strings, as they are found, of varying length and the code numbers that have been assigned to them. In this case the string is inserted into the table at the table row number corresponding to the code number assigned to that string. The code numbers do not also have to stored because the row number is the code number. The row number is the same as the code number for that string. Do not confuse strings tables from the TABLE talked about above. For example let's say we have collected the following table of strings. TABLE TABLE ROW STRING ENTRY ------------------------------------------------------------------------ 232 SDF 233 JHIKLNN 234 ABCDE 235 FGH 236 ABCDEABCDE 237 FGHFGHFGH With such a table in hand it would be easy enough to convert the output code stream, 236 236 234 237 235 back into the original input stream of ABCDEABCDE ABCDEABCDE ABCDE FGHFGHFGH FGH (Without the spaces of course.) Which is what we want. The problem of course with this method is that you have to have the string table in order to know what the code numbers stand for, and the space you save by compressing your input code stream into the smaller sequence of bytes is mostly lost in having to carry the whole string table along with you. This becomes obvious in the extreme example of assigning the whole text to one number and putting the whole text in the table as one row. You are actually up a byte from the original data, namely the damn code number. Not a good compression ratio. The key question is, can the string table be reconstructed FROM the output code stream? If not then you have to carry the string table with you to decode the output code stream, and that wastes space. If you CAN reconstruct the string table from the output code stream then you can throw the table away as you produce the output, and recover them later when you want to go in the other direction. LZW takes this one step further in that the output code stream IS the left hand column of original table indexes, and the right hand column of table codes can be easily computed from the table indexes which we shall see in a moment. Once you have both you can recompute the input code stream. So how does it work? THE LZW ALGORITHM The algorithm depends on the idea that any given input string either is or is not in the string table. If it is in the string table it is not put in again. If the string is not in the string table it is immediately put in. But let's look at how the algorithm gets its strings. Assume we want to compress the input code stream ABABABAB Remember the characters A and B are bytes to the computer and so are really integer numbers somewhere between 0 and 255. In this case A is 61 and B is 62. The algorithm starts with the first character in the input code stream and continues to grab characters from the input stream building a a multi character string until it builds a string that is NOT in the string table. The string table is first initialized with all possible single character strings, this forms the root area of the table. Let's simplify this and just assume for the moment that there are only 4 different single character strings rather than 255, and let's call them A, B, C, D. Let's also forget the clear code and the end of information code, so that our root area has just 4 rows in it, numbered 0 through 3, and the first added row will go in row 4. Then the table will look like this after initialization. INDEX CODE REPRESENTED STRING -1 0 A -1 1 B -1 2 C -1 3 D Since ALL possible single character strings ARE in the table in the root area the very first character that is taken from the input code stream will definitely be in the table. In this case that is the letter A. So nothing is added to the table and the next character from the input code stream is grabbed. This creates the string AB. AB is NOT in the table so we add it to the table at row 4. INDEX CODE STRING TABLE -1 0 A -1 1 B -1 2 C -1 3 D 0 4 AB We now do something very strange, something that will probably earn LZ&W their Nobel Prize in mathematics: we take this last table row string at row 4 (AB) and discard everything but the last letter of the string and start anew. So now we have the string B which is definitely in the table in the root area, so we go on and grab the next character in the input code stream which is another A. The string BA is NOT in the table so we add it to row 5. INDEX CODE STRING TABLE -1 0 A -1 1 B -1 2 C -1 3 D 0 4 AB 1 5 BA We discard all but the last letter of BA leaving us with A again which is in the table in the root area. We grab the next character in the input code stream forming AB. Now this time AB is already in the table from the last time we came across it, so we continue on to the next character. This gives us ABA. ABA is NOT in the table so we enter it in row 6. INDEX CODE STRING TABLE -1 0 A -1 1 B -1 2 C -1 3 D 0 4 AB 1 5 BA 6 ABA Keeping the last letter A, we grab the next character in the input code stream forming AB. This is in the table, so we get the next character forming ABA which is ALSO now in the table. Adding the next character we get ABAB which is NOT in the table so we add it at row 7. INDEX CODE STRING TABLE -1 0 A -1 1 B -1 2 C -1 3 D 0 4 AB 1 5 BA 6 ABA 7 ABAB At this point you can clearly see how to build the rest of the table from the input code stream. You do this until you run out of input code stream, or table room at which point you clear the table and start over again with a new table. Let's assume we ran out of input codes, so as usual we discard all but the last character of the last string and that leaves B. Since there are no further codes to add to B, it becomes the last string. INDEX CODE STRING TABLE -1 0 A -1 1 B -1 2 C -1 3 D 0 4 AB 1 5 BA 4 6 ABA 6 7 ABAB 1 8 B And there we have our completed table. Now here is where the brilliance comes in. Each string is added to the table because it was not already in the table. But each string is built up one character at a time so each one of its predecessor strings HAD TO BE IN THE TABLE for the last string to get as long as it did before it got put in the table. In other words its that LAST CHARACTER at the end of each string in the table that made that string NOT BE in the table. This means that if you take any string in the table and remove its last character you must have a string that was already in the table earlier. Thus each string can be split up into its last character and an index into an earlier row in the table. And voila you have the (INDEX,CODE) form of the string table. TABLE TABLE IN TABLE IN ROW STRING FORMAT (INDEX,CODE) FORMAT 0 A (-1 , A) 1 B (-1 , B) ROOT AREA 2 C (-1 , C) 3 D (-1 , D) 4 A,B ( 0 , B) | 5 B,A ( 1 , A) ADDED ROW AREA 6 AB,A ( 4 , A) | 7 ABA,B ( 6 , B) | 8 B,x ( 1 , x) x means doesn't matter. So the first thing we need to demonstrate is that the input code stream can be reconstructed from the table, particularly from the table indexes in the added row area, 0 1 4 6 1. The first thing we must realize is that each table index refers to an earlier complete string already in the table. For example in row 7 there is a (6,B). The 6 refers to row 6 where there is a (4,A). The 4 refers to row 4 where there is a (0,B) and the 0 refers to row 0 where there is a (-1,A) The -1 means you are in the root area and the backwards search ends. So the (6,B) in row 7 means that row 7 contains a string which is a B preceded by the string in row 6. But the string in row 6 is an A preceded by the string in row 4. The string in row 4 is a B preceded by the string in row 0. And the string in row 0 is an A preceded by nothing. So let's take the INDEX CODES 0 1 4 6 1 and figure out what they mean when added together. This time on the right we will rebuild the original string back up. TABLE TABLE TABLE ROW STRING (INDEX,CODE) 0 A (-1 , A) 1 B (-1 , B) 2 C (-1 , C) 3 D (-1 , D) 4 A,B ( 0 , B) A plus a B 5 B,A ( 1 , A) B plus an A 6 AB,A ( 4 , A) AB plus an A 7 ABA,B ( 6 , B) ABA plus a B 8 B,x ( 1 , x) B plus nothing That's A B AB ABA B which is ABABABA which is what we started with. Thus 0 1 4 6 1 = A B AB ABA B which is what we started with, minus the spaces which I added for clarity. Notice the TABLE CODES are the last letter of each string added to the table. Notice from the TABLE INDEXES (the OUTPUT CODES) and the TABLE CODES we were able to recreate the INPUT CODES. But we need to be able to do this without the TABLE CODES, only the TABLE INDEXES to get real compression. So what we are going to do is get the TABLE CODES FROM the TABLE INDEXES, and then using both TABLE CODES AND TABLE INDEXES get the original INPUT CODES. Then we are done with the decompression. GETTING THE TABLE CODES FROM THE TABLE INDEXES. So what happens if you don't have the table codes, but only the table indexes? Well you remember that when a string was built that was NOT in the table, it was then put in the table at the next available row. Then all characters before the last character were discarded and the last character was kept as the first character of the next string to be built. More accurately once a string is found that is NOT in the table, say ABAB, this is represented by an (index,code) pair such as (6,B). The part of the string corresponding to the index is then thrown away and the remaining single table code becomes the first character of the next string. Once a string is built starting with that first character it too is added to the table and the last character only is kept to become the first character of the next table row. Therefore it is obvious that the last character of each table row is the first character of the next table row! Thus the table is always of the form, TABLE TABLE IN POSITION STRING FORMAT 0 A 1 B (ROOT AREA) 2 C 3 D 4 A...D I have used random characters here for 5 D...C demonstration purposes only. The last 6 C...B character of the string in row 4 is 7 B...D the first character of the string in 8 D row 5, etc. The ... means an arbitrary number of 0 or more intervening characters. So how does this help us find the codes from the indexes alone? It should be clear from an earlier discussion that from the indexes AND codes the original string represented by any (index,code) pair can be reconstructed by following the indexes back to root and accumulating the codes of each row touched upon along the way. But if you don't have the codes how can you accumulate them?! Well the magic is although you may have no codes to accumulate, you always have the FIRST code of any string in the table because the first code of a string is in the root area whose codes are ALWAYS known. Thus from the INDEXES alone you can always find the FIRST character of any string represented by any index. Here's our complete table again without the codes. Remember root codes are always known. TABLE TABLE IN TABLE IN POSITION STRING FORMAT INDEX,CODE FORMAT 0 A -1 A 1 B -1 B ROOT AREA 2 C -1 C 3 D -1 D 4 AB 0 ? | 5 BA 1 ? ADDED ENTRY 6 ABA 4 ? | 7 ABAB 6 ? 8 B 1 x x means doesn't matter. For example, in table row 7 you have (6,x). Index 6 refers to row 6 which contains (4,?), which refers to row 4 which contains (0,?), which refers to row 0 which contains (-1,A). So the string in row 7 MUST start with an A! But this first character of ROW 7 is the LAST character of the immediate preceding string in the table, in this case the string in row 6. By definition the last character of a string is that string's table CODE. So the first character of the string in row 7 is the last character of the string in row 6 and is thus also 6's table CODE. Thus by finding the first character of any index string, you can find the table code of the immediately previous string. Going back to our previous example, let's assume we have only the indexes and no codes, except the root codes which are known. The missing codes are represented by ?'s in the below table. TABLE TABLE IN TABLE IN POSITION STRING FORMAT INDEX,CODE FORMAT 0 A -1 A 1 B -1 B ROOT AREA 2 C -1 C 3 D -1 D 4 AB 0 ? | 5 BA 1 ? ADDED ENTRY 6 ABA 4 ? | 7 ABAB 6 ? 8 B 1 x x means doesn't matter. In row 8 the index 1 points directly to row 1 in the root area which has a known code of B. Thus B is the first character of the string in row 8 and therefore the last character of the string in row 7. Thus we know the code for row 7, its a B! In row 7, the index 6 points to row 6 whose index of 4 points to row 4 whose index of 0 points to row 0 which being in the root area has a known code of A. Thus A is the first character of the string in row 7 and so also the last character of the string in row 6. Thus we have the code for 6, its an A. Visually any index can be followed back to root (6,?) ROW 7 just like we did before, the only difference being | that there are no known codes to accumulate until (4,?) ROW 6 we get to root where there will be just one known | code. That code is the first character of the (0,?) ROW 4 index we just traced back and the last character | of the index just before it in the table. (-1,A) ROW 0 Thus by starting with the last added index in the table, and tracing it back to its first character in the root area, we can find the table CODE of the immediately prior index. Repeating the process for that index, we can then fill in the entire table. Once the table is filled in completely, we can then start a second pass and accumulate the complete strings for each index using the codes we just found, and then concatenate those strings together to form the original input code stream. Totally amazing, isn't it? Homer