LZW COMPRESSION AND DECOMPRESSION
 
                 Copyright (C) 1993 Homer Wilson Smith
       Redistribution rights granted for non commercial purposes.

     ((This is long and thorough but not as clear as I would have liked,
mostly due to confusion of words.

     A CODE is a number from 0 to 255 representing an ascii or ebcdic
letter or character.

     ASCII stands for American Standard Code for Information Interchange.

     EBCDIC stands for Extended Binary Coded Decimal Interchange Code.

     The INPUT STREAM is a series of characters or INPUT CODES that will
be compressed.

     An INPUT CODE is the input character that one wants to compress.
INPUT CODES are 8 bit integers stored in a 16 bit integer.  Thus the
input stream is actually a stream of 16 bit integers and not 8 bit
integers, even though each 16 bit integer holds a single 8 bit integer.

     An OUTPUT STREAM is a stream of OUTPUT CODES which is series of
characters representing the compression of the INPUT STREAM.

     The OUTPUT CODE is the compressed version of the INPUT CODE.
Output CODES are 12 bits each packed into a 16 bit integer output
stream.  Thus four 12 bit codes can be packed into three 16 bit
integers.
 
     A TABLE is a 4096 row x 2 column matrix of 16 bit integers.

     A TABLE POSITION is any row in that matrix, such as row 235.

     Each TABLE ROW consists of two 16 bit integers of the form
(INDEX,CODE) stored as column 1 and column 2.

     A TABLE INDEX is a pointer to a prior table row.
 
     The TABLE CODE is taken from the uncompressed input stream.

     So table ROW 234 might contain (100,128).  The table index = 100
refers back to TABLE ROW 100 which will contain ANOTHER EARLIER
(INDEX,CODE) pair.

     Thus each table row contains a LINK called a table INDEX which
points back to a prior row.

     And each table row contains a CODE taken from the input stream.

     The series of TABLE INDEXES starting at the beginning of the table,
forms the output code compressed stream, which is explained below.))

 
     LZW COMPRESSION ALGORITHM

     The LZW compression algorithm is probably one of the most mind
boggling algorithms to come along, but it is also one of the most
brilliant, so it is worth understanding well.
 
     It's really quite mind opening.
 
     Have you ever programed something you didn't fully understand, but
because it worked fine you let it go, and then months later found
yourself still wondering why your code worked?
 
     Well that's what happened to me with LZW compression.
 
     My programs worked, for a very long time they worked, until one day
I went to send a large tiff to a color separator, and bang nothing would
read it.
 
     Corel Draw wouldn't read it, CSHOW wouldn't read it, my own damn
tiff program wouldn't read it, and that REALLY pissed me off.
 
     I had written my tiff programs with barely passable understanding
of the algorithm, following along the pseudo code provided by others,
with little personal comprehension myself.  I had read a number of
descriptions trying to make LZW compression and decompression clear to
me, but basically all I ended up with was a partial understanding of the
encoding process and no understanding at all of the decoding process.  I
kind of programmed it blind, and hoped it would work.
 
     It took me many hours to find the bug, it was one of those 'it
crops up every hundred million years' kind of bug, and I didn't know
what I was doing anyhow, so it wasn't easy.
 
     But from all bad things comes something good (I think) and this
paper is the result.  I finally understood both sides of the compression
and decompression dichotomy, and I am reasonably sure my programs will
work because I understand them and not because I followed someone else's
code.
 
     The proof of course is in the pudding, and the pudding is whether
or not I can explain this to others so that they get it the first time.
The algorithm is really quite brilliant, and worth knowing just for the
fun of it.
 
     This paper is not meant to replace everything else you might read
on the subject of LZW encoding or TIFF and GIF files, but it is intended
to make all that other stuff clear.  This is what I wish someone had
written for me when I first approached the subject.
 
     WHY DATA COMPRESSION
 
     The problem with data is it takes up so much damn space, and with
the cost of hard disk space and the slowness of telecommunications, it
behooves us to find a way to make data smaller, to make it take up less
room, so that we can then use up the space we free up to fill it with
more data!
 
     It actually takes less time to compress data than it does to
transmit it, so its a good cost saving to compress the data first, then
transmit it and uncompress it at the other end.
 
     RUN LENGTH ENCODING (RLE)
 
     Various encoding schemes have been devised to take care of this
need, one of the earliest most obvious ones being RUN LENGTH ENCODING.
Run length encoding takes advantage of the fact that often many pixels
right next to each other are the same color pixel, so rather than
sending them all one by one, they are counted up into a RUN and two
bytes are sent instead, the first byte telling how many pixels there
were of one color number, and the next byte the number it was.
 
     Thus a run of pixels that looked like this
 
     5 5 5 3 3 3 9 9 9 9 9 1 1 1 1
 
     would be run length encoded as
 
     3 5 3 3 5 9 4 1
 
     an obvious saving of space.  The longer the runs are the more space
is saved as each color run is represented by only two bytes, a count and
a color.  If your counts can go only up to 255, then if the run is
longer you have to start a new run, but if your counts can go up to
32767, then you can handle longer runs before you have to start a new
run.
 
     The problem comes in when the data does not consist of large
expanses of identical pixels, who wants to look at a single patch of red
anyhow.  A worst case scenario of course is when every pixel is
different such as,
 
     2 3 4 5 6 7 8 9   etc.
 
     Encoding this as RLE would produce,
 
     1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9
 
which is an obvious waste of space.
 
     One solution to this is to insert an INTRODUCER byte before each
sequence telling whether it is a horizontal run, or a stretch of one
byte per pixel.  Thus runs may start with positive counts, and stretches
of one byte per pixel may start with a negative count.
 
     Thus the sequence
 
     3 3 3 3 3 1 2 3 4 5 6
 
would be compressed as
 
     5 3 -6 1 2 3 4 5 6
 
giving some compression, but its obviously not an ideal solution.
 
     There is also tremendous complexity involved in programming the
ideal compression for data that has a lot of 2 byte runs in it.
 
     For example,
 
     3 3 4 5 3 3 4 5 3 3 4 5
 
is better put out as a single stretch of one byte per pixel
 
     -12 3 3 4 5 3 3 4 5 3 3 4 5
 
rather than as a mixture
 
     2 3 -2 4 5 2 3 -2 4 5 2 3 -2 4 5
 
because of the the one byte overhead in switching from horizontal runs
to one byte per pixel runs.
 

     One might make a simple rule that all two byte runs should be put
out as one byte per pixel, except that this is only true if you have
already opened a one byte per pixel stretch, otherwise 2 byte runs
should be put out as a horizontal run (count,byte).

 
     In any case run length encoding (RLE) only produces good
compression when the data is even and unchanging.  Fractals of course
are the antithesis of unchanging data, so RLE does not work well on
fractals or any other natural data like video frames or even written
work.
 
     In other words if you have a patch of red to compress, use RLE.
Otherwise keep reading.
 
                           The LZW Algorithm
 
     There are four concepts that need to be understood.  They are,
 
     1.) INPUT CODE STREAM
     2.) TABLE INDEX
     3.) TABLE CODE
     4.) OUTPUT CODE STREAM
 
     The purpose of LZW compression is to turn the input code stream
into an output code stream of lesser length, and then later to
reconstruct the input code stream from the output code stream.  This
accomplishes the purpose of compression and decompression.
 
     The above four concepts can be visualized as follows:
 
     INPUT CODE             TABLE                 OUTPUT CODE
     STREAM              INDEX  CODE              STREAM
------------------------------------------------------------------------
CODE, CODE, CODE...     (INDEX, CODE)             CODE, CODE....
                        (INDEX, CODE)
                        (INDEX, CODE)
                              .
                              .
 
     CODE.  A code is a byte wide entity consisting of 8 bits and
ranging in decimal value from 0 to 255.  They can be considered as
characters, or as integer numbers, it doesn't matter.  They are NOT
signed integers however, not plus or minus.  INPUT CODES, TABLE CODES
and OUTPUT CODES are all of this format.
 
     INDEX.  An index is a two byte wide SIGNED entity consisting of 16
bits and ranging in decimal value from -32768 to 32767.  It is called an
index because it refers to an earlier entry in the table.  For example,
if table row position number 400 consists of the (index,code) pair
(312,5), then 312 is called a table index and points BACKWARDS to table
row 312 which was generated in the PAST by the algorithm, and 5 is the
table code generated in the PRESENT process of compression, and is
stored in table row 400 as (312,5).
 
     Table row 312 will itself have an (index,code) pair that will point
to an even earlier table row, which will point to an earlier table row
ad infinitum until you come back to the first table entries before which
there are none.  These first entries in the table are called the ROOT
ENTRIES or the ROOT AREA.
 
     If a table row has -1 for an INDEX, that means that that table row
does NOT refer to any earlier table row.  -1's appear as table indexes
either:
 
     1.) when it resides in the ROOT AREA as described below.
 
     2.) when a table row is still unused.
 
     Table rows that haven't been used yet don't point to earlier
table rows because they haven't been used yet!
 
     In either case the table index will be -1.
 
     0 by the way is a valid table index which points to the very first
table row in the root area, table row number 0.
 
     INPUT CODE STREAM.  The input code stream is the sequence of bytes
that you wish to compress into a hopefully shorter output code stream.
 
     OUTPUT CODE STREAM.  The output code stream is the sequence of
bytes that is the final product of compression, and is hopefully shorter
than the input code stream and RECONSTRUCTABLE BACK into the input code
stream.  It's no good compressing something if you can't uncompress it!
 
     TABLE.  The table is a 4096 row x 2 column table of integer numbers
of the following form, it has 4096 rows numbered 0 to 4095.
 
               TABLE
                ROW    (INDEX,CODE)
------------------------------------------------------------------------
         -----    0    (-1   , 000)
           |      1    (-1   , 001)
           |      2    (-1   , 002)
           |      .       .
    ROOT AREA     .       .
           |    255    (-1   , 254)
           |    255    (-1   , 255)
           |    256    (-1   , 256)  Clear Code (CC)
         -----  257    (-1   , 257)  End of Information Code (EOIC)
         -----  258    (INDEX,CODE)
           |    259    (INDEX,CODE)
        ADDED             .
    ENTRY AREA            .
           |    278    (INDEX,CODE)
           |    279    (INDEX,CODE)  <---TABLE POINTER to latest row
           |    270    (-1   ,   0)  As yet unused rows
           |    271    (-1   ,   0)          "
           |     .           .
        -----  4095    (-1   ,   0)  End of table.
 
     The table consists of only the two columns of numbers, the table
row numbers are NOT in the table and are only shown for clarity.
 
     The table consists of 2 areas, the ROOT area and the ADDED ENTRY
AREA.
 
     ROOT AREA.  The root area consists of the first 258 rows in the
table numbered from 0 to 257.  The rows from 0 to 255 are initialized at
program startup to the numbers shown and do not change during the
duration of the program.  Rows numbered 256 and 257 are also set during
program start up and are used to control the process of compression and
decompression.  Their CODE values are otherwise arbitrarily chosen.
 
     Because a few of their values are greater than 255, table codes
must use 16 bit wide integers, even though input codes themselves have
only 8 significant bits from 0 to 255.
 
     The table indexes in the root area are set to -1's because they do
not point to any earlier table row.  The table codes in the root area
are set to the basic alphabet of possible characters in the input code
stream, in this case the integers from 0 to 255.  The Codes for
CLEARCODE and END OF INFORMATION CODE are arbitrarily set to 256 and
257.
 
     This leaves table row 258 as the first empty row to ADD new rows
into the table when the compression process actually starts.
 
     ADDED ENTRY AREA.  At program startup rows 0 through 257 are
initialized as described above and a pointer is set to the last entered
row in the table, in this case 257.  The remaining table rows are
initialized to (-1,0) as shown and usually there are a total of 4096
rows in the table numbered 0 to 4095, including the root area.
 
     During the process of compression the program scans the input code
stream and fills in new rows to the table according to the LZW
compression algorithm which we will discuss at length below.  Each new
table row consists of an (INDEX,CODE) pair which is determined by the
input code stream and the LZW algorithm.  As each new (INDEX,CODE) pair
is found the table pointer is incremented by one and the new
(INDEX,CODE) pair is inserted into the table at the row pointed to by
the table pointer.
 
     In the following discussion we wish to prove the following points.
 
     1.) The table is constructed from and totally determined by the
input code stream.  The table is usually shorter than the input code
stream and so compression has taken place in the construction of the
table.  This is because each single table row can represent more than
one character from the input code stream.
 
     2.) From the table alone the input code stream can be
reconstructed.  Thus if you have the complete table you can reconstruct
the input code stream.  The right hand table codes are NOT the
compressed input code stream, but are a subset of characters found in
the input code stream.
 
     3.) The right hand column of table codes can be reconstructed from
the left hand column of table indexes alone.  Thus if you know the left
hand table indexes you can recompute from that information alone the
right hand table codes and thus the input code stream.
 
     4.) Thus the table indexes can act as the final output compressed
data stream, or in other words, the output codes ARE the table indexes
from the added row area.
 
     Do not get table CODES confused with the output codes.  The table
INDEXES become the output codes.  The table CODES are taken from the
input code stream, but are a specifically selected subset of the input
code stream, and in the end can be thrown away.
 
     Because the table indexes can go up to 4095, they need 12 bits to
represent them.  Although they are computed as 16 bit wide integers,
when the indexes are finally output to the output data stream, they are
packed at 12 bits each.
 
     Thus an input data stream of 8 bit integers gets compressed and
output as an output data stream of 12 bit integers.
 
     Since each table index can represent a unique string of varying
length, overall compression takes place.
 
     For example if a single 12 bit wide table index represents the 3
character string of 'ABC' of 8 bits each or 24 bits, that is a
compression factor of 2:1.  Three characters of 8 bits each or 24 bits
are compressed into one character of 12 bits.
 
     5.) The process of uncompressing the output code stream then works
as follows.
 
     First you must rebuild the table indexes from the output code
stream.  The ROOT AREA of the table is trivial as it is always the same.
 
     The ADDED ENTRY AREA is almost as trivial.  The table INDEX column
of the added row area IS the sequence of output codes, as the output
codes ARE the table indexes.
 
     The table CODE column of the added row area is then recomputed from
the table INDEX column by means that will become obvious when the full
algorithm is discussed, and that completes the rebuilding of the table
from the output codes.  From the completed table one can then
reconstruct the original input code stream.
 
     SO HOW DOES LZW WORK?
 
     The idea behind LZW is that not only do some pixels repeat
themselves, but many SEQUENCES of pixels repeat themselves.
 
     For example take the input code stream (in bytes),
 
     ABCDE ABCDE ABCDE ABCDE ABCDE ignore spaces
 
     Such a sequence would not lend itself well to run length encoding
because no single pixel forms a run.  However the sequence of 5 pixels
ABCDE certainly form a run as they appear one right after the other 5
times.
 
     One might even suggest a super run length encoding that could seek
out repeating runs of SEQUENCES and perhaps encode a string like
 
     ABCDE ABCDE ABCDE ABCDE ABCDE FGH FGH FGH FGH (without spaces)
 
stream as something like,
 
     5(ABCDE) 4(FGH)
 
     However LZW goes one step further and recognizes that the string
ABCDE can appear anywhere in the input code stream, not just in
consecutive runs, so why not just assign the string ABCDE a unique CODE,
say 234, and then every time the string ABCDE appears replace it with
the single byte 234.
 
     So if 234 is assigned to ABCDE and 235 is assigned to FGH then the
above 20 byte string could be compressed into the 9 bytes
 
     234 234 234 234 234 235 235 235 235
 
     Of course the number 236 could be assigned to the string ABCDEABCDE
and 237 assigned to FGHFGHFGH, so now the same 20 byte input code stream
could be output as
 
     236 236 234 237 235

     To make this easier to see:

     236 ABCDEABCDE
     236 ABCDEABCDE
     234 ABCDE
     237 FGHFGHFHG
     235 FGH
 
     Obviously you stand to create a lot of compression this way, if you
can find all the repeating strings in a text.
 
     Just as obviously the whole scheme is utterly preposterous, as you
could just assign one number, say 234, to the whole damn text and then
output only that one number!
 
     The problem then is clearly keeping track of what string each
number stands for.  That is where a string table comes in.
 
     STRING TABLES
 
     A string table is merely a table of strings, as they are found, of
varying length and the code numbers that have been assigned to them.  In
this case the string is inserted into the table at the table row number
corresponding to the code number assigned to that string.
 
     The code numbers do not also have to stored because the row number
is the code number.
 
     The row number is the same as the code number for that string.

     Do not confuse strings tables from the TABLE talked about above.

     For example let's say we have collected the following table of
strings.
 
      TABLE      TABLE
      ROW        STRING ENTRY
------------------------------------------------------------------------
      232        SDF
      233        JHIKLNN
      234        ABCDE
      235        FGH
      236        ABCDEABCDE
      237        FGHFGHFGH
 
      With such a table in hand it would be easy enough to convert the
output code stream,
 
     236 236 234 237 235
 
back into the original input stream of
 
     ABCDEABCDE ABCDEABCDE ABCDE FGHFGHFGH FGH
 
     (Without the spaces of course.)
 
     Which is what we want.
 
     The problem of course with this method is that you have to have the
string table in order to know what the code numbers stand for, and the
space you save by compressing your input code stream into the smaller
sequence of bytes is mostly lost in having to carry the whole string
table along with you.
 
     This becomes obvious in the extreme example of assigning the whole
text to one number and putting the whole text in the table as one row.
You are actually up a byte from the original data, namely the damn code
number.  Not a good compression ratio.
 
     The key question is, can the string table be reconstructed FROM the
output code stream?  If not then you have to carry the string table with
you to decode the output code stream, and that wastes space.
 
     If you CAN reconstruct the string table from the output code stream
then you can throw the table away as you produce the output, and recover
them later when you want to go in the other direction.
 
     LZW takes this one step further in that the output code stream IS
the left hand column of original table indexes, and the right hand
column of table codes can be easily computed from the table indexes
which we shall see in a moment.  Once you have both you can recompute
the input code stream.
 
     So how does it work?
 
     THE LZW ALGORITHM
 
     The algorithm depends on the idea that any given input string
either is or is not in the string table.  If it is in the string table
it is not put in again.  If the string is not in the string table it is
immediately put in.
 
     But let's look at how the algorithm gets its strings.
 
     Assume we want to compress the input code stream
 
     ABABABAB
 
     Remember the characters A and B are bytes to the computer and so
are really integer numbers somewhere between 0 and 255.  In this case A
is 61 and B is 62.
 
     The algorithm starts with the first character in the input code
stream and continues to grab characters from the input stream building a
a multi character string until it builds a string that is NOT in the
string table.
 
     The string table is first initialized with all possible single
character strings, this forms the root area of the table.  Let's
simplify this and just assume for the moment that there are only 4
different single character strings rather than 255, and let's call them
A, B, C, D.  Let's also forget the clear code and the end of information
code, so that our root area has just 4 rows in it, numbered 0 through 3,
and the first added row will go in row 4.
 
     Then the table will look like this after initialization.
 
     INDEX    CODE REPRESENTED STRING
     -1       0    A
     -1       1    B
     -1       2    C
     -1       3    D
 
     Since ALL possible single character strings ARE in the table in the
root area the very first character that is taken from the input code
stream will definitely be in the table.  In this case that is the letter
A.
 
     So nothing is added to the table and the next character from the
input code stream is grabbed.  This creates the string AB.  AB is NOT in
the table so we add it to the table at row 4.
 
  INDEX CODE STRING TABLE
  -1    0    A
  -1    1    B
  -1    2    C
  -1    3    D
   0    4    AB
 
     We now do something very strange, something that will probably earn
LZ&W their Nobel Prize in mathematics: we take this last table row
string at row 4 (AB) and discard everything but the last letter of
the string and start anew.
 
     So now we have the string B which is definitely in the table in the
root area, so we go on and grab the next character in the input code
stream which is another A.  The string BA is NOT in the table so we add
it to row 5.
 
  INDEX CODE STRING TABLE
  -1    0    A
  -1    1    B
  -1    2    C
  -1    3    D
   0    4    AB
   1    5    BA
 
     We discard all but the last letter of BA leaving us with A again
which is in the table in the root area.  We grab the next character in
the input code stream forming AB.  Now this time AB is already in the
table from the last time we came across it, so we continue on to the
next character.  This gives us ABA.  ABA is NOT in the table so we enter
it in row 6.
 
  INDEX CODE STRING TABLE
  -1    0    A
  -1    1    B
  -1    2    C
  -1    3    D
   0    4    AB
   1    5    BA
        6    ABA
 
    Keeping the last letter A, we grab the next character in the input
code stream forming AB.  This is in the table, so we get the next
character forming ABA which is ALSO now in the table.  Adding the next
character we get ABAB which is NOT in the table so we add it at row
7.
 
  INDEX CODE STRING TABLE
  -1    0    A
  -1    1    B
  -1    2    C
  -1    3    D
   0    4    AB
   1    5    BA
        6    ABA
        7    ABAB
 
     At this point you can clearly see how to build the rest of the
table from the input code stream.  You do this until you run out of
input code stream, or table room at which point you clear the table and
start over again with a new table.
 
     Let's assume we ran out of input codes, so as usual we discard all
but the last character of the last string and that leaves B.  Since
there are no further codes to add to B, it becomes the last string.
 
  INDEX CODE STRING TABLE
  -1    0    A
  -1    1    B
  -1    2    C
  -1    3    D
   0    4    AB
   1    5    BA
   4    6    ABA
   6    7    ABAB
   1    8    B
 
     And there we have our completed table.
 
     Now here is where the brilliance comes in.  Each string is added to
the table because it was not already in the table.  But each string is
built up one character at a time so each one of its predecessor strings
HAD TO BE IN THE TABLE for the last string to get as long as it did
before it got put in the table.
 
     In other words its that LAST CHARACTER at the end of each string in
the table that made that string NOT BE in the table.  This means that if
you take any string in the table and remove its last character you must
have a string that was already in the table earlier.
 
     Thus each string can be split up into its last character and an
index into an earlier row in the table.  And voila you have the
(INDEX,CODE) form of the string table.
 
   TABLE     TABLE IN        TABLE IN
   ROW       STRING FORMAT  (INDEX,CODE) FORMAT
     0       A                (-1 , A)
     1       B                (-1 , B)     ROOT AREA
     2       C                (-1 , C)
     3       D                (-1 , D)
     4       A,B              ( 0 , B)         |
     5       B,A              ( 1 , A)     ADDED ROW AREA
     6       AB,A             ( 4 , A)         |
     7       ABA,B            ( 6 , B)         |
     8       B,x              ( 1 , x)     x means doesn't matter.
 
     So the first thing we need to demonstrate is that the input code
stream can be reconstructed from the table, particularly from the table
indexes in the added row area, 0 1 4 6 1.
 
     The first thing we must realize is that each table index refers to
an earlier complete string already in the table.
 
     For example in row 7 there is a (6,B).  The 6 refers to
row 6 where there is a (4,A).  The 4 refers to row 4 where
there is a (0,B) and the 0 refers to row 0 where there is a (-1,A)
The -1 means you are in the root area and the backwards search ends.
 
     So the (6,B) in row 7 means that row 7 contains a string
which is a B preceded by the string in row 6.  But the string in
row 6 is an A preceded by the string in row 4.  The string in
row 4 is a B preceded by the string in row 0.  And the string
in row 0 is an A preceded by nothing.
 
     So let's take the INDEX CODES 0 1 4 6 1 and figure out what they
mean when added together.
     
     This time on the right we will rebuild the original string back up.

   TABLE     TABLE           TABLE
   ROW       STRING         (INDEX,CODE)
     0       A                (-1 , A)
     1       B                (-1 , B)
     2       C                (-1 , C)
     3       D                (-1 , D)
     4       A,B              ( 0 , B)     A    plus a  B
     5       B,A              ( 1 , A)     B    plus an A
     6       AB,A             ( 4 , A)     AB   plus an A
     7       ABA,B            ( 6 , B)     ABA  plus a B
     8       B,x              ( 1 , x)     B    plus nothing

     That's A B AB ABA B which is ABABABA which is what we
started with.
 
     Thus 0 1 4 6 1 = A B AB ABA B which is what we started with, minus
the spaces which I added for clarity.

     Notice the TABLE CODES are the last letter of each string
added to the table.

     Notice from the TABLE INDEXES (the OUTPUT CODES) and the
TABLE CODES we were able to recreate the INPUT CODES.

     But we need to be able to do this without the TABLE CODES,
only the TABLE INDEXES to get real compression.

     So what we are going to do is get the TABLE CODES
FROM the TABLE INDEXES, and then using both TABLE CODES AND
TABLE INDEXES get the original INPUT CODES.

     Then we are done with the decompression.
 
     GETTING THE TABLE CODES FROM THE TABLE INDEXES.
 
     So what happens if you don't have the table codes, but only the
table indexes?  Well you remember that when a string was built that was
NOT in the table, it was then put in the table at the next available
row.  Then all characters before the last character were discarded and
the last character was kept as the first character of the next string to
be built.
 
     More accurately once a string is found that is NOT in the table,
say ABAB, this is represented by an (index,code) pair such as (6,B).
The part of the string corresponding to the index is then thrown away
and the remaining single table code becomes the first character of the
next string.
 
     Once a string is built starting with that first character it too is
added to the table and the last character only is kept to become the
first character of the next table row.
 
     Therefore it is obvious that the last character of each table row
is the first character of the next table row!  Thus the table is
always of the form,
 
   TABLE     TABLE IN
  POSITION   STRING FORMAT
     0       A
     1       B     (ROOT AREA)
     2       C
     3       D
     4       A...D            I have used random characters here for
     5       D...C            demonstration purposes only.  The last
     6       C...B            character of the string in row 4 is
     7       B...D            the first character of the string in
     8       D                row 5, etc.  The ... means an
                              arbitrary number of 0 or more
                              intervening characters.
 
     So how does this help us find the codes from the indexes alone?  It
should be clear from an earlier discussion that from the indexes AND
codes the original string represented by any (index,code) pair can be
reconstructed by following the indexes back to root and accumulating the
codes of each row touched upon along the way.  But if you don't have
the codes how can you accumulate them?!
 
     Well the magic is although you may have no codes to accumulate, you
always have the FIRST code of any string in the table because the first
code of a string is in the root area whose codes are ALWAYS known.  Thus
from the INDEXES alone you can always find the FIRST character of any
string represented by any index.
 
     Here's our complete table again without the codes.  Remember root
codes are always known.
 
   TABLE     TABLE IN        TABLE IN
  POSITION   STRING FORMAT   INDEX,CODE FORMAT
     0       A                 -1   A
     1       B                 -1   B     ROOT AREA
     2       C                 -1   C
     3       D                 -1   D
     4       AB                 0   ?         |
     5       BA                 1   ?     ADDED ENTRY
     6       ABA                4   ?         |
     7       ABAB               6   ?
     8       B                  1   x     x means doesn't matter.
 
     For example, in table row 7 you have (6,x).
 
     Index 6 refers to row 6 which contains (4,?), which refers to
row 4 which contains (0,?), which refers to row 0 which
contains (-1,A).
 
     So the string in row 7 MUST start with an A!
 
     But this first character of ROW 7 is the LAST character of the
immediate preceding string in the table, in this case the string in row
6.
 
     By definition the last character of a string is that string's table
CODE.  So the first character of the string in row 7 is the last
character of the string in row 6 and is thus also 6's table CODE.
 
     Thus by finding the first character of any index string, you can
find the table code of the immediately previous string.
 
     Going back to our previous example, let's assume we have only the
indexes and no codes, except the root codes which are known.  The
missing codes are represented by ?'s in the below table.
 
   TABLE     TABLE IN        TABLE IN
  POSITION   STRING FORMAT   INDEX,CODE FORMAT
     0       A                 -1   A
     1       B                 -1   B     ROOT AREA
     2       C                 -1   C
     3       D                 -1   D
     4       AB                 0   ?         |
     5       BA                 1   ?     ADDED ENTRY
     6       ABA                4   ?         |
     7       ABAB               6   ?
     8       B                  1   x     x means doesn't matter.
 
     In row 8 the index 1 points directly to row 1 in the
root area which has a known code of B.  Thus B is the first character of
the string in row 8 and therefore the last character of the string
in row 7.  Thus we know the code for row 7, its a B!
 
     In row 7, the index 6 points to row 6 whose index of 4
points to row 4 whose index of 0 points to row 0 which being
in the root area has a known code of A.  Thus A is the first character
of the string in row 7 and so also the last character of the string
in row 6.  Thus we have the code for 6, its an A.
 
     Visually any index can be followed back to root   (6,?)   ROW 7
just like we did before, the only difference being      |
that there are no known codes to accumulate until    (4,?)     ROW 6
we get to root where there will be just one known     |
code.  That code is the first character of the     (0,?)       ROW 4
index we just traced back and the last character    |
of the index just before it in the table.       (-1,A)         ROW 0
 
     Thus by starting with the last added index in the table, and
tracing it back to its first character in the root area, we can find the
table CODE of the immediately prior index.  Repeating the process for
that index, we can then fill in the entire table.
 
     Once the table is filled in completely, we can then start a second
pass and accumulate the complete strings for each index using the codes
we just found, and then concatenate those strings together to form the
original input code stream.
 
     Totally amazing, isn't it?
 
     Homer