A method for compressing data
Field of Invention
The present invention relates to a method for compressing data according to the preamble of Claim 1. The invention also relates to a device according to the preamble of Claim 9.
The most powerful data compression systems today, such as the algorithms of Ziv-Lempel type, PPM [4] and Block-Sorting [5] are adaptive, requiring long data strings to be efficient. However, there are important data compression problems, when the files to be compressed are short. Examples are email messages and letters, where the file length varies from tens to thousands of characters. Since encoding and decoding of such files must be done separately (see Figure 1), they cannot in general be concatenated to form long strings, which would allow compression with adaptive algorithms.
Actually, the early data compression systems before the advent of adaptive algorithms were non-adaptive, where data were modelled by a fixed Markov-type finite state machine. This was constructed by gathering the statistics of the symbol occurrences at each state from training data, which was thought to be representative of the files to be compressed. An example of such a data compression system for black and white images is presented in Langdon, G.G., Jr. and Rissanen, J.J., 'Compression of Black-White Images with Arithmetic Coding', IEEE Trans. Communication, Vol. Com-29, No. 6, pp 858 - 867, June 1981. There is a problem with such systems, however, especially for large alphabets such as in natural languages, and it is the selection of the structure and size of the finite state machine to be used. Up to a point the compression achieved improves with the number of states given to the machine. However, this creates a problem that the number of parameters grows exponentially with the order of the Markov machine the data is intended to fit to. For instance, a first order Markov machine for an alphabet of size 256 will have the same number of states and 2562 parameters, while a second order machine will perform
better but requires 2563 parameters and so on. Clearly, many of the states will never occur in any training data, which creates the problem of how to select the most relevant states from among those that do occur, and in order to keep the machine of manageable size, how to select the states of any desired number that give the best compression.
Summary of Invention
The purpose of the present invention is to provide an improved method and a device to losslessly compress short data. The invention is based on the idea that an algorithm is used to examine the nodes of the Markov machine to determine the leaves of the nodes that can be eliminated. This algorithm then produces a simpler machine for compressing data. This compression machine is also called a tree machine in this description. More precisely, the method according to the invention is characterized in that what will be presented in the characterising part of Claim 1. The device according to the invention is characterized in that what will be is presented in the characterising part of Claim 9.
By not taking all the states in a Markov machine of some order we actually end up fitting Tree Machines (TM) to data [3]. In this invention it is shown how to design a data compression system aimed at compressing short files with a fixed Tree Machine of a desired size. This will be a sub Tree Machine of a maximal one which is optimal among all sub Tree Machines of the same size. The maximal Tree Machine can be determined in a straightforward way from a large collection of short files, called the training set, which are representative of the files to be compressed, as will be described later in this description. Once this is done, a sub Tree Machine with a desired number of nodes is constructed which is optimal for the training data in that it permits their compression with the shortest ideal code length. Such a machine, then, will incorporate the statistical features of the files in the collection better than any Tree Machine of the same size. At the second stage, an arithmetic code is designed for encoding each short file symbol for symbol using predesigned (non-adaptive) addend (codeword) tables at the so-called encoding nodes. An encoding node
is the deepest node in the tree where the symbol occurring in the training data has positive count. The symbols that do not occur will be encoded with a suitably selected escape mechanism to be described later in this disclosure.
The invention gives several advantages related to prior art. The compression of especially short texts is performed much more efficiently than with prior art methods. This improved compression efficiency reduces the amount of memory space required for storing such compressed data, information transferring demands, and the time needed for transferring the data.
In the following, the invention will be described in more detail with reference to the appended drawings, in which
Fig. 1a is a schematic view showing the steps for compressing a short data file according to a method of an advantageous embodiment of the present invention,
Fig. 1 b is a simplified block diagram of an electronic device in which the method of an advantageous embodiment of the present invention can be applied,
Fig. 2 describes the construction of a maximal tree machine from the training data and the construction of a sub tree machine by pruning the maximal tree with an algorithm according to an advantageous embodiment of the present invention,
Fig. 3 describes the compression of a data block using an arithmetic coder or a Huffman code and the probabilities from the pruned TM,
Fig. 4 describes the decompression of the compressed data block,
Fig. 5 shows simulation results of compression with an advantageous method of the invention and with some prior art methods,
Fig. 6 illustrates an example of TM built from an example training data,
Fig. 7 illustrates testing of some nodes of the context tree presented in Figure 6 for pruning,
Fig. 8 illustrates an example of pruning a node,
Fig. 9 illustrates an example of checking a node for pruning,
Fig. 10 shows an example of a context tree,
Fig. 11 shows the recursive implementation of an advantageous algorithm of the method according to the invention and an example of a context tree, and
Fig. 12 shows a block diagram of the implementation of another advantageous algorithm of the method according to the invention.
Detailed Description
Maximal Tree Machine
A Tree Machine for an alphabet A of size d will now be described, the preferred embodiment being d = 256. A TM is a tree with at most d branches, or son-nodes, emanating from each node s, where the nodes have counts π,1s indicating the number of times symbol / of the alphabet 'occurs' at 'context' s. Roughly speaking a 'context' s plays the role of a state in a Markov type finite state machine. It will be explained how the training data, formed by a large set of files of the
general nature written one after another and forming a long string x" = xi, ..., xn of symbols x,- of the alphabet A, determine a tree machine of maximal depth K.
Each string s = /
', j, ..., m of length k, k≤ K, that occurs as a substring s = X
t-
k+i, ... , X
t of x", for some t, defines a path s = m, ..., /, i= x
t, M, ... ,
f-
/+1 from the root to node s, defined by reading the symbols of s in reverse order. Let the string s occur n
s number of times. In these occurrences, the string s will be immediately followed by symbols of A unless it is the very last string in the training data. Let A
s be the set of distinct symbols immediately following the various occurrences of s, and let the symbol / occur n,
| S times so that n
s ; here, it is
assumed that s does not consist of the very last symbols of x
n so that some symbol can follow it. The TM has the counts { π
; | S} stored at the node s. An important fact should be noticed that if the string x
t-
k, ... ,x
t occurs in x" so does the shorter string defined by s = x
t-
k+ι, ■■■ , *
t, or that whatever symbol x
M occurs at a son-node x
t, ..., x
t-
k+ι, X
t-
k, ie
> immediately follows x
t, it surely occurs also at the shorter father node s = x
t, ..., Xt-k+ι- This implies that if s' = sx
t-k = Xt, M Xt-k+i, *t-k denotes its son-node, then
s'~ for all /' ε As>, while the union of the sets A.- over the son-nodes is a subset of A;. In other words, the tree is incomplete: some son-nodes are missing in the tree and not all symbols in the alphabet need occur in every node of the tree.
Figure 6 illustrates a TM built from training data "VESIHIISI SIHISI HISSISSA". Such a TM and, in fact, any of its sub TM's rooted at the root of the maximal one, can be used to encode any string of the symbols in the alphabet given. The encoding process will be described later, but first the so-called ideal code length or the empirical entropy each TM assigns to the training string is described.
Ideal Code Length for Training Data
Consider any subtree W of the maximal tree defined by the nodes λ, s1 ; ..., sw, where λ denotes the root node. After having scanned through the entire x" each node s has the counts {/?/( s}. The ideal code length of symbol /that occurs at node s is defined as log(/?s/ π,| S). As said above the same symbol also occurs at the father node. Let s(t) denote the deepest node in the tree, call it a leaf node, where symbol xt of the training data occurs. This node is defined by climbing the tree by reading the symbols of the past string ^~1 from right to left. The sum of the ideal code lengths of all the symbols xt of the training data is the ideal code length assigned to the training string x" by the Tree Machine W, or
Ew (*") = ∑log-^- . (1 ) t n <t)
If the first few symbols are ignored, no more than K, a more convenient formula for the ideal code length is obtained,
Lψ (
χH ) log
, (2)
where E denotes the set of the deepest nodes in the tree at which symbols / occur. These nodes clearly specify the size and the shape of the maximal tree.
Coding
A Tree Machine is used to encode strings much as a finite state machine: The string to be compressed, say y" = y-i, 72, .... yn, is read advantageously from left to right, and for each symbol yM the tree is climbed by reading the preceding symbols backwards yt, yt--\ ... = s until a node s as the context of the symbol yf+1 is found, after which the symbol is encoded with the conditional probability P(yt+ι \ s*) defined by the counts at this node. If all d symbols in the alphabet would occur at all the nodes, the nodes s could be taken as the deepest nodes in
the tree along the path yt, yM, ■■•■ However, for large alphabets not all the symbols occur at every node, which creates the problem of how to encode symbols for which the counts /?/| s are zero.
There are a number of ways to solve the problem. Consider a context that has appeared 20 times. If "A" is the only character to have followed the context in all these occurrences, the probability of the next character being different should be low. However, if e.g. 10 different characters had followed, it would seem that the text being compressed is quite variable and that some new character is likely to appear next time. If q symbols have appeared in some context, then the probability q/ns of the event "the next symbol is not previously seen in this context" is assigned. This way, the conditional probability of any previously unseen symbol is ^ 11 ', =- ns^ +- q — d - q (3) where d - q is the number of all unseen symbols in the context. The conditional probabilities of the previously seen symbols are
With these probabilities, an arithmetic code or a Huffman code at each deepest coding node can be designed.
Algorithms for Optimal Subtree of Desired Size
The proper subtree of the maximal tree with M nodes that has the smallest ideal code length for x" will be searched. Such an optimal subtree is the maximum likelihood tree with M nodes, because the ideal code length is the negative logarithm of the maximized likelihood for x" within the class of the subtrees of the maximal tree with conditional probabilities ψ for the symbols occurring at each node s. Before describing the algorithm to prune the maximal tree into the optimal subtree of the desired size, a simpler algorithm is described, which is optimal with the restriction that either all the sons of a node are pruned or none is. This clearly reduces the sets E at each father node examined not necessarily by unity but by the number of son nodes. This algorithm is appropriate for small alphabets.
Algorithm A
1. Initialize: For every node s of the tree, calculate the code length L(s) = ∑ n
s (- log
2
) using the counts in s and the equation (4) for P
i]s ■
2. Recursively, starting at the leaves, compute at each parent node s: l(s) = ∑L(sj) . If /(s) ≤ L(s) + ε, prune all the children sj and their descendants; else leave all sj intact. Set L(s) = min{/(s), L(s)}.
3. Continue until the root is reached.
Here is the second pruning algorithm, which examines one son at the time, rather than all of them at once:
Algorithm B
In the beginning of the execution of the algorithm B the value ε is defined (block 1 in the block diagram of Fig. 11). After that, the pruning is started from the root node.
1. Initialize: For every leaf node s of the tree, calculate the code length L(s) = ∑7i (-iog2 ^ ) using the counts in s and the equations (3) and (4) for Pi]s . For every internal node s, set L(s) =
0. This initialising phase is mainly described by blocks 2 to 6 of Fig. 11.
2. Recursively, starting at the leaves, compute at each node sj: 7(^') = ∑nI|s (-iog2 Ei|5); that is, using the counts of the child node sj and the probabilities of the parent node s. Increase the parent node's code length L(s) by min{/(s/), L(sj)}. If l(sj) ≤ L(sj) + ε, prune the child sj and all its children, else leave sj intact. This phase is mainly described by blocks 7 to 10 of Fig. 11.
3. Continue until the root is reached.
The size of the resulting tree depends on the value ε. When ε = 0, the pruned tree will compress at least as well as the unpruned tree, but is usually smaller. When ε > 0, the pruned tree will be smaller than needed for the maximal compression, but is still the best one among all trees of its size. For ε big enough, the pruned tree size will be 1 , consisting only the root node.
An easy way to determine the correct value of ε to form the subtree of M nodes is by binary search:
Algorithm C
1. Set fihigh as the max{L(sj) - /(s/)} over all internal nodes s in the tree. Set ε](m= 0 (block 11 in the block diagram of Fig. 12).
2. Prune the tree with £= (fhigh + £jow) 12 and count the nodes of the resulting tree (block 12). If the count didn't change (block
13) or the count equals the set number M (block 14), exit. If the count is less than the count we want (block 15), set £-
high = c
hi
gh / 2 (block 16). If the count is greater than the count we want, set
2 (block 17). Restore the original tree and return to 2 (block 12).
3. Continue until the last count is no different from the previous one.
For an example of the pruning method B, the context tree shown in Figure 6 is now examined. Starting at the leaves, the two children of the node "HI" - that is, nodes "_HI" and "IHI" (see Figure 7) are taken into consideration. The code lengths L of these leaf nodes are set in the initialisation step, and are calculated using the equation (4):
EC mi" ) = ∑ (n
/r ffljr- (- io
g2 ^mr ))
Again, qs stands for the total number of different symbols seen in the node s, and n,|S is the count of symbol / in the node s. The sum of the counts in the node s is denoted as ns.
In the pruning step, new code lengths / are calculated using the same formula with the counts of the child and the probabilities of the parent node, "HI":
/(•• _ Hi" ) = ∑ („lT. _ H (- iog2 P^HΓ ))
/(" MI") = ∑ (n^
m (- log
2 PiY
HF ))
("HI") is the sum of min{ L("_HI"), /("_HI") } and min{ ("IHI"), /("IHI") }, which equals to L("_HI") + /("IHI") = 4.64. The pruning occurs whenever /(s)≤L(s)+£ . At this point it is assumed that ε= 0. Thus, node "_HI" remains intact and node "IHI" is pruned away (see Figure 8).
In the next step, the /.("HI") which was just calculated is compared to /("HI"):
I("HI") ^(ni >H (- \ g2 Pi r))
m'Ηi" - log
2 -7.10
/("HI") is much bigger than L("HI") + ε and the node "HI" is not pruned (see Figure 9). Should the ("HI") + ε be greater than or equal to /("HI"), the node "HI" and all its children would be pruned out.
The pruning procedure is continued recursively until the root node is reached, which cannot be pruned (since it has no parent node and cannot be compared to anything).
In Patricia tree structure, which have been used in this example, there may exist nodes that represent several contexts in one placeholder - like the node in the top right corner of the tree for context "SSA". The node "SSA" and its counts could (and should, for proper pruning) be written as three separate nodes: "A", "SA" and "SSA", all having the same seen symbols and counts. However, since these three nodes have equal counts, /("SSA") = L("SA") = /("SA"). Since this node is a leaf, L("SSA") is not received from the node's children, and is also equal to /("SSA"). Thus, /("SSA") < ("SSA") + ε and /("SA") < ("SA") + ε (ε is always greater than or equal to zero) and these nodes would be pruned. Effectively, every leaf node sharing several contexts can be converted to the shortest context it contains (here, node "SSA", containing nodes "SSA", "SA" and "A", becomes node "A"). Every internal node sharing several contexts can be treated as an ordinary node while pruning, and can be unrolled into separate nodes after pruning is completed, making it easier to handle the tree.
After pruning and unrolling is completed, the context tree is finished. Figure 10 shows an example of such a context tree.
The recursive implementation of Algorithm B is shown in Figure 11 as a block diagram. Figure 12 shows as a block diagram the implementation of Algorithm C.
An Example of Compression and Decompression
By equation (4), the probabilities of the symbols in the root node are:
(λ stands for the root context, an empty string.) By equation (3), the probabilities of the symbols E and V not seen in the root node are:
The probabilities of the eight symbols should add up to 1. This can be checked:
The probabilities of the symbols in the node "I" are:
and the probabilities of the symbols E, V, A and end not seen in the context "I" are:
They sum up to 1 , too:
2 2 1 4 . 1 13 ,
— + — + — + — + 4 — = — = 1.
13 13 13 13 13 13
Likewise, the probabilities of the symbols in any node can be calculated. The results in some of the nodes are listed in the following table:
Table 1. The probabilities of the symbols in some of the tree nodes.
By using these probabilities, we can design a Huffman code for each of these contexts. The results are shown in the following table:
Table 2. The Huffman codes for the symbols from Table 1 ,
Next, a message "VIISI" is compressed using these code words. For every symbol of the message to be compressed, the longest match of the preceding symbols found in the model tree is to be searched for. For example, the context of the symbol S is the longest of the contexts "VII", "II", "I" and λ that exists in the model tree. For V, the match found is empty (λ). For I, I, S, I and end, the found matches are λ, "I", "II", "IIS" and "SI", respectively.
The codeword for V in the context λ is 010. The codeword for I in the context λ is 10. The codeword for I in the context "I" is 110 and so on. The compressed string is 010 10 110 0 0 1111.
Decompression is done similarly. The model to be used is the same as in the compression, so the Huffman code table is already known. For the first symbol, it is known that its context is empty. The first codeword (010) is read and the symbol V is the result. For the next symbol, the context is the longest of the contexts "V" and λ that exists in the model tree. The next codeword is now read and the result is the symbol I. The context of the next symbol is the longest of the contexts "VI", "I" and λ that exists in the model tree. Again, the next codeword is read and the result is again symbol I and so on. The compressed string 010 10 110 0 0 1111 translates to the original message "VIISI".
For faster implementation, the model tree is converted into a state machine. Each node (or context) of the tree becomes a state. Each symbol to be compressed causes a state transition to the state for the context of the next symbol (which is known by the symbols read this far). Thus, the tree (which now is a state machine) needs not to be searched every time a symbol is read, and the context finding algorithm becomes 0(1)-function.
The compression method described above and the following commonly used compression programs have been simulated:
• COMPRESS, based on Ziv's and Lempel's LZ77 family of algorithms [2].
• PKZIP v2.06, based on Ziv's and Lempel's LZ78 family of algorithms [2]. • ACE v2.0 by Marcel Lemke.
• BZIP2, based on Burrows' and Wheeler's block sorting algorithm [5]-
• PPMZ v9.1 by Charles Bloom, based on PPM family of algorithms.
It should be noted that the method according to the invention works best when the training data used has the same language. This means that for compressing different texts with different languages also different tree machines should be used for each language.
The tree machine was built by using English literature text as training data. All the files were converted into lower case to decrease the alphabet size. The results of compressing the files with the given five programs and an advantageous method of the present invention, described in this description (denoted here as CORESSION), are shown in Figure 5. The algorithm described, CORESSION, compresses small files much better than the other five algorithms.
The approximated average compression and decompression speeds of the six algorithms used in the previous test are as follows: COMPRESS: 600 kilobytes per second.
PKZIP: 800 kilobytes per second.
ACE: 100 kilobytes per second.
BZIP2: 200 kilobytes per second. • PPMZ: 20 kilobytes per second.
CORESSION: 400 kilobytes per second with Arithmetic coding, or 900 kilobytes per second with Huffman codes
In an advantageous embodiment of the invention, the data structures used are fixed. That is, the data structures do not need to be modified during compression or decompression, as opposed to any adaptive compression program. Thus, the method of the present invention can be made at least as fast as any adaptive algorithm having at least as complex context and count structures and coding scheme as the method of the present invention.
The approximated memory usages of the six algorithms are as follows:
• COMPRESS: a few hundred kilobytes.
• PKZIP: a few hundred kilobytes.
• ACE: up to 36 megabytes, typically several megabytes.
• BZIP2: two to six times the size of the file to be compressed.
• PPMZ: 60 to 90 times the size of the file to be compressed.
• CORESSION: from a few hundred kilobytes to several megabytes, depending on the tree size. The memory usage of a typical adaptive algorithm increases during compression and decompression, and it depends on the size of the file to be compressed as well as on the desired compression ratio. With the method of the present invention, the memory usage is fixed and independent of the file size, making the algorithm very well suited for environments with limited memory. Furthermore, almost all the needed memory can be read-only. The device applying the method of the present invention needs just a few hundred bytes of random access memory for both compression and decompression of files of any size. All the memory used by adaptive algorithms must be random access.
In Fig. 1b there is shown as a simplified block diagram of an electronic device 18 in which the method of an advantageous embodiment of the present invention can be applied. The electronic device comprises at least a first input/output block 19 for inputting the training data from e.g. a database 23. The controlling unit 20 is provided for controlling the electronic device 18 and for performing the steps of the method of the present invention. The controlling unit 20 may comprise one or more processor, such as a microprocessor and/or a digital signal processor. The memory means 21 are provided for storing necessary program codes for the operation of the controlling unit, for storing temporary data, for storing the data structures, for storing the data to be compressed, etc. The data to be compressed can be read e.g. from the memory means 21 , from the database 23, from the transmission channel 24, and/or from a keyboard (not shown) of the electronic device 18. The compressed data can be saved into the memory means 21 , into the database 23, and/or it can be transmitted to e.g. a data transmission channel 24 for transmission to a receiving device (not shown) where the compressed data can be decompressed. The receiving device can also be similar than the electronic device presented in Fig. 1. The decompression can be performed by using
similar data structures (Tree Machine, Finite State Machine) than is used in the compression.
The pruned tree machine can be transformed to a finite state machine, which is known as such. The compression can normally be performed faster by the finite state machine than by the pruned tree machine.
The present invention can be applied in many applications. For example, in mobile telecommunication environment short messages can be compressed for reducing needed transmission capacity before sending them in a mobile network. The present invention can also be applied in computers for compressing text files which then can be saved into storage medium.
It is obvious to a man skilled in the art that the invention is not limited solely to the examples presented above, but the embodiments of the invention can vary within the scope of the claims below.
References
[1] Langdon, G.G., Jr. and Rissanen, J.J., 'Compression of Black- White Images with Arithmetic Coding', IEEE Trans. Communication, Vol. Com-29, No. 6, pp 858 - 867, June 1981
[2] Lempel, A. and Ziv, J., 'Compression of Individual Sequences via Variable Rate Coding', IEEE Trans. Information Theory, Vol. IT-24, No. 5, pp 530 - 536, September 1978
[3] Weinberger, M.J., Rissanen, J., and Feder, M. (1995), 'A Universal Finite Memory Source', IEEE Trans, on Information Theory, Vol. IT-
41, No. 3, pp 643 - 652, May 1995
[4] Bell, T.C., Cleary, J.G. and Witten, I.H., 'Text Compression', Prentice Hall, NJ, pp 140 - 153, 1990.
[5] Burrows, M. & Wheeler, D., 'A Block-Sorting Lossless Data Compression Algorithm', Digital Equipment Corporation, SRC
Research Report 124, 1994.