CN1031680C

CN1031680C - Chinese file compression processing method and device

Info

Publication number: CN1031680C
Application number: CN 92109706
Authority: CN
Inventors: 张俊盛; 刘显仲; 柯淑津
Original assignee: Shengbao Co ltd
Current assignee: Shengbao Co ltd
Priority date: 1992-08-22
Filing date: 1992-08-22
Publication date: 1996-04-24
Anticipated expiration: 2007-08-22
Also published as: CN1083638A

Abstract

By utilizing the characteristics of Chinese, Chinese words are used as coding units, the Chinese file is cut into units with indefinite length which are most suitable for compression according to the occurrence probability of the Chinese words in the Chinese file, and the processed file can be coded into various frequency codes including Huffman codes and the like. The Chinese file processed by the device of the invention can obtain about 35-55% compression ratio after being converted into Huffman code. The invention also provides a method for determining the Chinese processing unit.

Description

Chinese file shelves compression processing method and device

The invention relates to a kind of Chinese file shelves compression processing method and device, particularly about a kind of Chinese word that utilizes as processing unit, with the archives compress technique, reach the method and the device of the shared memory capacity of a large amount of compression Chinese file shelves.

After Chinese Computerization is popularized gradually, produce a large amount of computer Chinese files, for example publish every day 28 spaces of a whole page the day newspaper office after computerization, promptly produce the literal shelves of about 2,500,000 middle literal every day, other is as publishing house, government bodies, the middle literal shelves that produced every day such as company are difficult to counting especially.Under the situation of this data blast, or be the waste that reduces the storage area, or be raising data transmitting speed, the necessity with the Chinese file compression is arranged in fact.

Being commonly used at present compress in the technology of Chinese file, a kind of ARC program is arranged, is to be the unit with the byte, that is utilize the bit group of literal in the representative, according to its frequency that in Chinese file, occurs, form Huffman code, utilize known Hoffmannshen Method, the compression Chinese file.Because this kind mode is that Chinese file is compressed in the mode of compressing English data, do not use the characteristic of Chinese, so compression factor is not high, can reach about compression ratio about in the of 75%.

General Chinese sign indicating number no matter be BIG-5 yard or TCA sign indicating number, is compiled sign indicating number with two bytes to literal in each.This kind coded system because practicality reaches and the consideration of English ASCII character compatibility, is not used operable code space fully.Therefore, another kind of possible File Compress method is with middle literal recompile, reduces the figure place of each yard, can obtain some compressions very easily.

For example,, only need sign indicating number (annotate: 14 times of 2 is 16384) recompile in addition, can obtain the compression ratio of 14/16=87.5% with 14 positions of every word to literal in more than 10,000.

Above-mentioned dual mode all is to be coding unit with " word ".But the logical block that Chinese uses is actually " speech " rather than word, and Chinese word coding should obtain preferable compression effectiveness in theory.For example.The number of Chinese word commonly used is greatly about about 60,000, if with Chinese word coding, the sign indicating number (annotate: 16 times of 2 is 65536) with 16 positions can be compiled in each speech, can apply the needs of 60,000 speech enough, this mode, average number of words with reference to each Chinese word is about 1.6, and the code length that can approximately calculate average each word is 16/1.6=10 position.Therefore compression ratio then is approximately 10/16=62.5%.More efficient far beyond the coding that with the word is the unit.

Said method all only relates to the completeness (or imperfection incompleteness) of coding to the research of compression unit, use the BIC-5 yard imperfection with the TCA sign indicating number, again the sign indicating number that word or speech are compiled to make full use of code space reaches compression effectiveness, and handled all be the sign indicating number of equal length.In traditional data compression method, emphasize to utilize the appearance probability of coding unit, to the high unit of probability occurring, reach the purpose of compression by this to compile with long sign indicating number, the compression efficiency of these class methods is decided by choosing of compression unit usually.Yet known analogy probability Methods for Coding all can not be utilized Chinese characteristic, provides bigger compression factor, to conform with user's needs.

Purpose of the present invention is to provide a kind of characteristic of utilizing Chinese, the Chinese file shelves compression processing method and the device of compression Chinese file shelves.

Another object of the present invention is also at compression method that a kind of Chinese file shelves are provided and device, so that the storing memory amount of Chinese file shelves is done more efficient compression.

Another object of the present invention is also in the method that a kind of definite Chinese compression processing unit is provided, so that more effective Chinese compression effect to be provided.

Find that through the inventor past, encoded method reached the technology of compression effectiveness, its key is in randomness (Entropy) value of file---a number that promptly calculates according to the appearance probability of all compression units.Though the meaning of unrest degree value and computational methods are not within discussion scope of the present invention, but assist the technological thought of File Compress to be summarized as follows: in the archives of institute's desire compression with the notion of random degree value, other probability occurs all compression units more consistent (predictability is low, the randomness height), compression effectiveness is poorer; Otherwise if probability more inconsistent (predictability height, randomness is low) appears individually in compression unit, then compression effectiveness is better.

Each file all has its max. Possible compression ratio rate, and the preposition means of File Compress are promptly being found out the coding unit that one group of randomness that makes file reduces as far as possible.Therefore, though the present invention needn't be subject to its theory, but the inventor thinks, can make the Chinese file randomness drop to the compression unit group of theoretical minimum, should be the most approaching Chinese vocabulary of definition naturally, in other words, be the Huffman code that compression unit is weaved into similar Chinese vocabulary, Chinese file can be compressed near the theoretic limit.

Comprehensive above observation and analysis, the inventor thinks, if find a compression unit group of approaching Chinese words and phrases, is the compression unit coding with Chinese with this compression unit group, the at first compression that can obtain the first step because of the completeness and the logicality of its coding, add because of each compression unit uses the predictability of probability and weave into Huffman code, further more compressed.

Following with reference to description of drawings method and apparatus of the present invention.

The 1st figure represents the compression method of Chinese file shelves of the present invention and installs the system diagram of an embodiment.

The 2nd figure represents the compression method of Chinese file shelves of the present invention and installs the moving system figure of the compression processing unit cutter sweep of an embodiment.

In theory philological, how to seek out an objectively best compression unit group (or title " coding dictionary "), and the file shelves are compressed to theoretical minimum, be a very good problem to study.As long as can provide effective compression effectiveness but on using, find out one " coding dictionary " desirable, that be easy to processing (acquisition).Below how explanation method and apparatus of the present invention utilizes this one " coding dictionary " that the sign indicating number that Chinese file effectively cuts into a low randomness is gone here and there, and uses and compresses processing.

(1) compression of Chinese file shelves

The 1st figure represents the system diagram of one embodiment of the invention.Compression proposed by the invention at first comprises a compression processing unit cutter sweep (100) and one " coding dictionary " (101).Be marked with the use probability of each compression processing unit in the coding dictionary (101).Compression processing unit cutter sweep (100) can cut into the minimum compression unit sign indicating number string of randomness with file according to the use probability of each compression unit of record in the coding dictionary (101).

Coding table of comparisons generation device (102) can be according to the use probability of each compression unit of putting down in writing in the coding dictionary (101), derive Hofman tree and produce a coding table of comparisons (103) with the method that is suitable for, put down in writing the corresponding relation of each coding unit and Huffman code.Usually, this coding table of comparisons (103) is a table of making in advance on using, and is stored in the memory in a suitable manner.(104) be a memory, for storing treated file shelves.

When utilizing Chinese file shelves compression treatment device compression of the present invention to handle the Chinese file shelves, at first by system Chinese original document is read in, it is medium pending to be placed on memory (104).When doing the compression processing, by the use probability of compression processing unit cutter sweep (100) according to each compression unit of record in the coding dictionary (101), file according to method of the present invention, is cut into the minimum compression unit sign indicating number string of randomness, be stored in once again in the memory (104).The Chinese file shelves that compression set (105) is crossed encoding process, the reference encoder table of comparisons (103), the compression unit that constitutes a row sign indicating number string in the file shelves is converted to Huffman code one by one and obtains the compression shelves, be stored in once again in the memory (104), finish compression and handle.

Contract when handling carrying out back-pressure, back-pressure contracts processing unit (106) earlier with the compression shelves in the memory (104), and the reference encoder table of comparisons (103) is inverted to the original compression unit one by one to the Huffman code of representative compression shelves, and obtains original file shelves.

(2) cutting of compression processing unit

As previously mentioned, be the use probability of complying with each compression unit of record in the coding dictionary (101) earlier with compression processing unit cutter sweep (100) doing when compression is handled, file is cut into the minimum compression unit sign indicating number string of randomness.At this moment, the quality of cutting effect will become a treatment effect big key how.

Known cutting method is to utilize the means of long compression unit, with a Chinese fragment (normally Chinese sentence), searches the longest compression unit by known cut point toward sentence is terminal.Seek after, be cut point on next with this compression unit tail end, use these means then repeatedly, cut.

For example Chinese fragment " research institute carries out moving really " will be cut into the method:

" research institute | really | carry out | moving ",

This is not a desirable patterning method.The patterning method that randomness is lower should be:

" research institute | | really | action ",

So said method also is not suitable as Chinese file cutting process means.

The present invention is the cutting means that propose " probability competition " aspect cutting process.Via means, the Chinese file shelves can be cut into the coding unit sign indicating number string of a string low randomness, convert this yard string to Huffman code after, original Chinese file shelves can obtain very big compression.The cutting means of this kind probability competition can be applicable to various coding dictionary, and it can provide the minimum cutting effect of randomness under the coding dictionary of institute's reference; The coding dictionary more is an optimization, and resulting result is better.Though theoretic forced coding component dictionary is not an emphasis of the present invention,, can reach extraordinary compression effectiveness with general Chinese dictionary and the present invention use of arranging in pairs or groups.

Following foundation the 2nd figure illustrates the motion flow of Chinese file compression processing unit cutter sweep of the present invention (100).

The present invention at first obtains a Chinese fragment after starting in the Chinese file shelves (104) in (200) from memory, prepare to handle.This Chinese fragment is sentence normally.

With headed by the lead-in of this Chinese fragment, whether the different length word string of composition lists among the coding dictionary (101) in (201) systems inspection.To each seek coding unit, set up a virtual route, and be labeled as cut point at the relative position of compression unit tail end, then index is pointed to second word.This place accent virtual route promptly is the description to the cut point position.

Whether checkpoint (204) is checked has cut point above one in present index location.As surpass, then carry out action (202), otherwise, action (203) then carried out.

Relatively all have the path of cut point, the comprehensive probability that the cutting mode from lead-in to present index location may take place in present index location in action (202).The algorithm of comprehensive probability multiplies each other for the appearance probability with all compression units that are connected.Relatively, stay the probability maximum path takes place, delete all the other paths.It must be the path that the probability maximum takes place that this comparison makes last result.Action (202) need check whether index has pointed to this Chinese fragment tail end after finishing.In this way, then finish cutting to this Chinese fragment; As not, then down carry out action (203).

Action (203) is at first moved a position forward with index, points to literal in the next one.Then all cut points are done following action with the path that index location is identical at present: whether inspection lists among the coding dictionary (101) with composition different length word string headed by the word of present index location, then each is sought coding unit, form one or several new routes after being connected on original route, and be labeled as cut point at each new route tail end.List in the dictionary if all word strings are neither.Then delete this path, because this kind patterning method generation probability is zero.After action (202) finishes, carry out flow process rebound checkpoint (204), carry out above-mentioned exercises then repeatedly and finish up to sentence.At last, the cutting result with gained is indicated in the file shelves, is stored in the container, finishes cutting process.

Following example illustrates the handling process of compression processing unit cutter sweep shown in the 2nd figure.

Suppose that the partial content that stores is in coding dictionary (101):

Table one

Coding unit generation Ji Shuai

Grind 0.000009

Research 0.000496

Research institute 0.000018

Institute 0.002450

0.054438

Really 0.000070

True 0.000077

Certain 0.000067

Real 0.000179

Carry out 0.000084

Action 0.000167

Moving 0.000175

The probability of above annotation is 1,000,000 times (ten quadratic powers of ten) of actual value.

Step 1:

Action (200) obtains a Chinese fragment from the Chinese file shelves of memory (104), prepare to handle.

The Chinese fragment that obtains is:

" research institute carries out moving really "

Step 2:

, form the different length word string and whether list among the coding dictionary (101) with headed by the lead-in (grinding) of this Chinese fragment in (201) systems inspection.To each seek coding unit, set up a virtual route, and be labeled as cut point at the relative position of coding unit tail end.

Index refers at " grinding " word.Set up three initial paths.

Path: grind | 9000000

Research | 496000000

Research institute | 18000000

Step 3:

Checkpoint (204) checks that whether surpassing a path has cut point in present index location.

In this example, checkpoint (204) check result finds that not surpassing a path has cut point in target index location (grinding).Index moves to " studying carefully " word.

Whether (203) are positioned at the path of present index location (studying carefully) to all cut points at this moment, check with the different length word string of forming headed by the word of present index location and list among the coding dictionary (101).Found that with the word string headed by " studying carefully " and to study carefully institute as " studying carefully " ", " studying carefully " etc. is neither in the coding dictionary, delete this path " grind |, remaining:

Path: research | 496000000

Research institute | 18000000

Step 4

Index moves to the 3rd word.Repetitive operation (203), all cut points are done following action with the path that index location is identical at present: whether inspection lists among the coding dictionary (101) with composition different length word string headed by the word of present index location (institute).If any.With each seek coding unit, be connected on and form one or several new routes after the original route, and be labeled as cut point at each new route tail end.

In this example, " institute " word becomes so meet " institute " after path " research | " " study | institute | " among coding dictionary (101).

Path: research | institute | 1215200

Research institute | 18000000

Step 5:

(204) check that whether surpassing a path has cut point in present index location in the checkpoint equally.As surpass, then carry out action (202), calculate the path that the present index location in place has cutting, comprehensive probability may take place in the cutting mode of the index location from lead-in to present index.The algorithm of comprehensive probability is taken advantage of for the appearance probability of all compression units that are connected is looked into.Compare its size.Relatively, stay the path that the probability maximum takes place, delete all the other paths.

Check result finds to surpass a path has cut point in present index location, so relatively its generation probability is big little.The path: " research | institute | " deletion, because the probability (18000000) that probability (1215200) is lower than path " research institute | " takes place in it.The result is remaining:

Path: research institute | 18000000

Step 6:

Index moves to the 4th word.Repeat (203) action.Obtain two possible compression units " " and " really " be connected on " research institute | " afterwards, each forms a new route.

Path: research institute | | 979884

Research institute | really | 1260

Step 7:

Index moves to the 5th word.Repeat (203) action.Obtaining two other possible compression unit " really " and " really " is connected on " research institute | | " afterwards, each forms a new route.

Path: research institute | | really | 65.652228

Research institute | | really | 75.451068

Research institute | really | 1260.000000

Step 8

The judgement of foundation (204) repeats (202) action, and finding to surpass a path has cut point in present index location, and relatively probability takes place for it.

Path: research institute | | really | 65.652228

Research institute | really | 1265.000000

Step 9

Index moves to the 6th word.According to the judgement of (204), repeat the action of (203).Obtain two other possible compression unit " reality " and " implementation " and be connected on " research | really | " afterwards, each forms a path, that is:

Path: research institute | | really | 65.652228

Research institute | really | real | 0.225540

Research institute | really | carry out | 0.105840

Step 10:

According to the judgement of (204), repeat the action of (202), finding to surpass a path has cut point in present index location, and relatively probability takes place in it.

Path: research institute | | really | 65.652228

Research institute | really | carry out | 0.105840

Step 11:

Index moves to the 7th word.Checkpoint (204) check result finds not surpass a path has cut point in present index location (moving).Index moves to " moving " word:

Path: research institute | | really | action | 0.01 0964

Research institute | really | carry out | 0.105840

Step 12

Index moves to the Eight characters:

Path: research institute | | really | action | 0.010964

Research institute | really | carry out | moving | 0.000019

Step 13:

Path: research institute | | really | action | 0.010964

Step 14:

Reach the sentence tail end, the compression processing unit cutting action finishes.

(3) invention effect

As shown in the above description, utilize Chinese file shelves compression processing method of the present invention and device, can obtain correct cutting effect, and can reach the good compression effect.

Table 2 shows the effect of utilizing Chinese file shelves compression processing method of the present invention and device to handle 3 Chinese file shelves gained.

Table 2

File name	Original shelves length (byles)	Original shelves length (bits)	The compression shelves ⁶Length (bits)	Coding schedule ⁷Length (bils)	Compression ratio I ¹	Compression ratio II ²
File name	Original shelves length (byles)	Original shelves length (bits)	The compression shelves ⁶Length (bits)	Coding schedule ⁷Length (bils)	Compression ratio I ¹	Compression ratio II ²	SOLVER ³ S370 ⁴ LOTUS ⁵	49592 589238 243212	396736 4713904 1945696	141671 1909176 744952	36250 125091 70754	35.709 40.501 38.287	44.846 43.155 41.924

¹Compression ratio ¹=compression grade long/original shelves are long. ²Compression ratio ²=(compression grade long+coding schedule is long)/original shelves are long. ³Data source: Lotus 1-2-2f or Windows applied problem solution. ⁴Data source: IBM System370 service manual. ⁵Data source: Lotus 1-2-3 for Windows Chinese service manual. ⁶Punctuate and English in the original shelves all do not give processing. ⁷Coding schedule is made up of contained Chinese word in the file shelves.

By shown in this table as can be known, the present invention can obtain about compression ratio about in the of 35-55% when doing the compression of Chinese archives and handle, really have the raising effect.Moreover applicable scope of the present invention can comprise various data compression or other processing through frequency coding, also has purposes is extensively arranged.

It more than is embodiment of the invention explanation.Those skilled in the art are not difficult by understanding spirit of the present invention in the above explanation, and make different variations according to this and extend, in any case but, all belong to the present invention's category.

Claims

1. A compression processing device for Chinese file files, comprising

An input device for inputting Chinese documents to be processed;

a memory for storing the processed documents;

a coding dictionary, recording the usage probability of each compression processing unit;

A compression processing unit cutting device, which can cut the file into a low-confusion compression processing unit code string;

A code comparison table, which records the compression codes representing each compression processing unit. The comparison table is coded according to the average occurrence probability of each compression processing unit in the Chinese file. The lower one is the longer code;

A compression device, the compression processing unit cutting device cuts the processed Chinese file, refers to the compilation comparison table, converts the compression processing units in the file file into compressed codes one by one to obtain a compressed file;

Among them, the compression processing unit cutting device contains:

A temporary storage device, which can take out a Chinese segment from the Chinese file to be processed;

The first checking device checks whether the word strings of different lengths formed by the specific word of the Chinese segment are listed in the code dictionary, and the end of the word string found is marked as a cutting point, and all consecutive words String, establish a virtual path;

The second checking device checks whether more than one path has a cutting point at the same position, and for all paths where the cutting point is located at this position, checks whether the word strings of different lengths formed with the word at this position are listed in the coding dictionary, and cannot If the string is found in the dictionary, delete the path;

A calculation device, referring to the use probability of each word string recorded in the code dictionary, to calculate the probability fraction of each path;

A comparison means compares the probability distribution of each path, and takes the path with the highest dispersion as the result of cutting.

It is characterized in that the cutting device of the compression processing unit uses the use probability of each compression processing unit recorded in the code dictionary to calculate the probability scores of various possible cutting results, and supplies the cutting result with the highest probability score to the compression device for processing.

2. The device as claimed in claim 1, further containing an anti-compression processing device, which can refer to the code comparison table for the compressed file in the memory, and reverse the compressed code of the compressed file into the original one by one. A compression unit is used to convert the compressed file in the memory into an original file.

3. The device according to claim 1, wherein the temporary storage device of the compression processing unit and the cutting device can take out a Chinese sentence from the Chinese file to be processed as a processing segment.

4. The device according to claim 1, wherein the compression processing unit cuts out the coding dictionary of the device, and the compression processing unit recorded is a word in Chinese.

5. The device as claimed in claim 1, wherein the computing device of the compression processing unit cutting device multiplies the usage probabilities recorded in the code dictionary of the compression processing units contained in each path to calculate the probability score of each path.

6. A compression processing method for Chinese files, comprising

An input step, input the Chinese file to be processed, and store it in the memory;

A cutting step, cutting the file into low-confusion compression processing unit code strings:

A compression step, the cutting step cuts the processed Chinese file, refers to a code comparison table that records the compression codes of each compression processing unit, and converts the compression processing units in the file file into compressed codes one by one to obtain a compressed file; wherein, The comparison table is coded according to the average occurrence probability of each compression processing unit in the Chinese file, and the one with the higher probability of occurrence is given a shorter code, and the one with a lower probability of appearance is given a longer code;

The cutting steps include:

A temporary storage step, taking out a Chinese segment from the Chinese file to be processed;

In the first checking step, the index is placed on the first character of the Chinese segment, and whether the character strings of different lengths formed with the character as the beginning are listed in the code dictionary, and for each found character string, a virtual path is established, and the The end of the string is marked as a cutting point;

The second checking step is to check whether more than one path has a cutting point at the current pointer position, if not, the pointer moves forward by one position;

The third checking step is to check whether the strings of different lengths formed with the word at the current pointer position are listed in the code dictionary for all paths whose cutting points are the same as the current pointer position, and connect each found code to the original path Form a new path afterwards, and mark it as a cutting point at the end of each new path, if the word string cannot be found in the code dictionary, then delete the path;

A calculation step, for the second checking step, check out more than one path with cutting points at the current index position, refer to the use probability of each word string recorded in the code dictionary, calculate the path, from the first word to the current index position probability score;

A comparison step, comparing the probability scores of each path, and using the path with the highest score for compression processing;

The fourth step is to check whether the last character of the processed Chinese segment has been processed, if not, repeat the above processing, and if so, end the segmentation process;

It is characterized in that the cutting step uses the usage probability of each compression unit recorded in a coding dictionary to calculate the probability scores of various possible cutting results, and the cutting result with the highest probability score is provided to the compression step for processing.

7. The method according to claim 6, further comprising an anti-compression step, which can refer to the code comparison table for the compressed file, and reverse the compressed code of the compressed file into the original compression unit one by one, and the The compressed file in memory is converted to the original file.

8. The method according to claim 6, wherein the temporary storage step of the cutting step is a segment of a Chinese sentence taken out from the Chinese file to be processed for processing.

9. The method according to claim 6, wherein the compression processing unit recorded in the coding dictionary used in the cutting step is a Chinese word.

10. The method according to claim 6, wherein the calculation step of the cutting step is to multiply the usage probabilities of the compression processing units contained in each path recorded in the coding dictionary to calculate the probability score of each path.