CN1031680C - Chinese file compression processing method and device - Google Patents
Chinese file compression processing method and device Download PDFInfo
- Publication number
- CN1031680C CN1031680C CN 92109706 CN92109706A CN1031680C CN 1031680 C CN1031680 C CN 1031680C CN 92109706 CN92109706 CN 92109706 CN 92109706 A CN92109706 A CN 92109706A CN 1031680 C CN1031680 C CN 1031680C
- Authority
- CN
- China
- Prior art keywords
- chinese
- file
- compression
- path
- cutting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000007906 compression Methods 0.000 title claims abstract description 133
- 230000006835 compression Effects 0.000 title claims abstract description 129
- 238000003672 processing method Methods 0.000 title claims description 7
- 238000000034 method Methods 0.000 claims abstract description 32
- 238000005520 cutting process Methods 0.000 claims description 43
- 239000006185 dispersion Substances 0.000 claims 1
- 230000011218 segmentation Effects 0.000 claims 1
- 239000012634 fragment Substances 0.000 description 10
- 230000000694 effects Effects 0.000 description 9
- 238000012217 deletion Methods 0.000 description 4
- 230000037430 deletion Effects 0.000 description 4
- 238000007689 inspection Methods 0.000 description 4
- 235000006508 Nelumbo nucifera Nutrition 0.000 description 3
- 240000002853 Nelumbo nucifera Species 0.000 description 3
- 235000006510 Nelumbo pentapetala Nutrition 0.000 description 3
- 230000003203 everyday effect Effects 0.000 description 3
- 238000000059 patterning Methods 0.000 description 3
- 238000013144 data compression Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 206010038743 Restlessness Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- NNKKTZOEKDFTBU-YBEGLDIGSA-N cinidon ethyl Chemical compound C1=C(Cl)C(/C=C(\Cl)C(=O)OCC)=CC(N2C(C3=C(CCCC3)C2=O)=O)=C1 NNKKTZOEKDFTBU-YBEGLDIGSA-N 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Landscapes
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Document Processing Apparatus (AREA)
Abstract
By utilizing the characteristics of Chinese, Chinese words are used as coding units, the Chinese file is cut into units with indefinite length which are most suitable for compression according to the occurrence probability of the Chinese words in the Chinese file, and the processed file can be coded into various frequency codes including Huffman codes and the like. The Chinese file processed by the device of the invention can obtain about 35-55% compression ratio after being converted into Huffman code. The invention also provides a method for determining the Chinese processing unit.
Description
The invention relates to a kind of Chinese file shelves compression processing method and device, particularly about a kind of Chinese word that utilizes as processing unit, with the archives compress technique, reach the method and the device of the shared memory capacity of a large amount of compression Chinese file shelves.
After Chinese Computerization is popularized gradually, produce a large amount of computer Chinese files, for example publish every day 28 spaces of a whole page the day newspaper office after computerization, promptly produce the literal shelves of about 2,500,000 middle literal every day, other is as publishing house, government bodies, the middle literal shelves that produced every day such as company are difficult to counting especially.Under the situation of this data blast, or be the waste that reduces the storage area, or be raising data transmitting speed, the necessity with the Chinese file compression is arranged in fact.
Being commonly used at present compress in the technology of Chinese file, a kind of ARC program is arranged, is to be the unit with the byte, that is utilize the bit group of literal in the representative, according to its frequency that in Chinese file, occurs, form Huffman code, utilize known Hoffmannshen Method, the compression Chinese file.Because this kind mode is that Chinese file is compressed in the mode of compressing English data, do not use the characteristic of Chinese, so compression factor is not high, can reach about compression ratio about in the of 75%.
General Chinese sign indicating number no matter be BIG-5 yard or TCA sign indicating number, is compiled sign indicating number with two bytes to literal in each.This kind coded system because practicality reaches and the consideration of English ASCII character compatibility, is not used operable code space fully.Therefore, another kind of possible File Compress method is with middle literal recompile, reduces the figure place of each yard, can obtain some compressions very easily.
For example,, only need sign indicating number (annotate: 14 times of 2 is 16384) recompile in addition, can obtain the compression ratio of 14/16=87.5% with 14 positions of every word to literal in more than 10,000.
Above-mentioned dual mode all is to be coding unit with " word ".But the logical block that Chinese uses is actually " speech " rather than word, and Chinese word coding should obtain preferable compression effectiveness in theory.For example.The number of Chinese word commonly used is greatly about about 60,000, if with Chinese word coding, the sign indicating number (annotate: 16 times of 2 is 65536) with 16 positions can be compiled in each speech, can apply the needs of 60,000 speech enough, this mode, average number of words with reference to each Chinese word is about 1.6, and the code length that can approximately calculate average each word is 16/1.6=10 position.Therefore compression ratio then is approximately 10/16=62.5%.More efficient far beyond the coding that with the word is the unit.
Said method all only relates to the completeness (or imperfection incompleteness) of coding to the research of compression unit, use the BIC-5 yard imperfection with the TCA sign indicating number, again the sign indicating number that word or speech are compiled to make full use of code space reaches compression effectiveness, and handled all be the sign indicating number of equal length.In traditional data compression method, emphasize to utilize the appearance probability of coding unit, to the high unit of probability occurring, reach the purpose of compression by this to compile with long sign indicating number, the compression efficiency of these class methods is decided by choosing of compression unit usually.Yet known analogy probability Methods for Coding all can not be utilized Chinese characteristic, provides bigger compression factor, to conform with user's needs.
Purpose of the present invention is to provide a kind of characteristic of utilizing Chinese, the Chinese file shelves compression processing method and the device of compression Chinese file shelves.
Another object of the present invention is also at compression method that a kind of Chinese file shelves are provided and device, so that the storing memory amount of Chinese file shelves is done more efficient compression.
Another object of the present invention is also in the method that a kind of definite Chinese compression processing unit is provided, so that more effective Chinese compression effect to be provided.
Find that through the inventor past, encoded method reached the technology of compression effectiveness, its key is in randomness (Entropy) value of file---a number that promptly calculates according to the appearance probability of all compression units.Though the meaning of unrest degree value and computational methods are not within discussion scope of the present invention, but assist the technological thought of File Compress to be summarized as follows: in the archives of institute's desire compression with the notion of random degree value, other probability occurs all compression units more consistent (predictability is low, the randomness height), compression effectiveness is poorer; Otherwise if probability more inconsistent (predictability height, randomness is low) appears individually in compression unit, then compression effectiveness is better.
Each file all has its max. Possible compression ratio rate, and the preposition means of File Compress are promptly being found out the coding unit that one group of randomness that makes file reduces as far as possible.Therefore, though the present invention needn't be subject to its theory, but the inventor thinks, can make the Chinese file randomness drop to the compression unit group of theoretical minimum, should be the most approaching Chinese vocabulary of definition naturally, in other words, be the Huffman code that compression unit is weaved into similar Chinese vocabulary, Chinese file can be compressed near the theoretic limit.
Comprehensive above observation and analysis, the inventor thinks, if find a compression unit group of approaching Chinese words and phrases, is the compression unit coding with Chinese with this compression unit group, the at first compression that can obtain the first step because of the completeness and the logicality of its coding, add because of each compression unit uses the predictability of probability and weave into Huffman code, further more compressed.
Following with reference to description of drawings method and apparatus of the present invention.
The 1st figure represents the compression method of Chinese file shelves of the present invention and installs the system diagram of an embodiment.
The 2nd figure represents the compression method of Chinese file shelves of the present invention and installs the moving system figure of the compression processing unit cutter sweep of an embodiment.
In theory philological, how to seek out an objectively best compression unit group (or title " coding dictionary "), and the file shelves are compressed to theoretical minimum, be a very good problem to study.As long as can provide effective compression effectiveness but on using, find out one " coding dictionary " desirable, that be easy to processing (acquisition).Below how explanation method and apparatus of the present invention utilizes this one " coding dictionary " that the sign indicating number that Chinese file effectively cuts into a low randomness is gone here and there, and uses and compresses processing.
(1) compression of Chinese file shelves
The 1st figure represents the system diagram of one embodiment of the invention.Compression proposed by the invention at first comprises a compression processing unit cutter sweep (100) and one " coding dictionary " (101).Be marked with the use probability of each compression processing unit in the coding dictionary (101).Compression processing unit cutter sweep (100) can cut into the minimum compression unit sign indicating number string of randomness with file according to the use probability of each compression unit of record in the coding dictionary (101).
Coding table of comparisons generation device (102) can be according to the use probability of each compression unit of putting down in writing in the coding dictionary (101), derive Hofman tree and produce a coding table of comparisons (103) with the method that is suitable for, put down in writing the corresponding relation of each coding unit and Huffman code.Usually, this coding table of comparisons (103) is a table of making in advance on using, and is stored in the memory in a suitable manner.(104) be a memory, for storing treated file shelves.
When utilizing Chinese file shelves compression treatment device compression of the present invention to handle the Chinese file shelves, at first by system Chinese original document is read in, it is medium pending to be placed on memory (104).When doing the compression processing, by the use probability of compression processing unit cutter sweep (100) according to each compression unit of record in the coding dictionary (101), file according to method of the present invention, is cut into the minimum compression unit sign indicating number string of randomness, be stored in once again in the memory (104).The Chinese file shelves that compression set (105) is crossed encoding process, the reference encoder table of comparisons (103), the compression unit that constitutes a row sign indicating number string in the file shelves is converted to Huffman code one by one and obtains the compression shelves, be stored in once again in the memory (104), finish compression and handle.
Contract when handling carrying out back-pressure, back-pressure contracts processing unit (106) earlier with the compression shelves in the memory (104), and the reference encoder table of comparisons (103) is inverted to the original compression unit one by one to the Huffman code of representative compression shelves, and obtains original file shelves.
(2) cutting of compression processing unit
As previously mentioned, be the use probability of complying with each compression unit of record in the coding dictionary (101) earlier with compression processing unit cutter sweep (100) doing when compression is handled, file is cut into the minimum compression unit sign indicating number string of randomness.At this moment, the quality of cutting effect will become a treatment effect big key how.
Known cutting method is to utilize the means of long compression unit, with a Chinese fragment (normally Chinese sentence), searches the longest compression unit by known cut point toward sentence is terminal.Seek after, be cut point on next with this compression unit tail end, use these means then repeatedly, cut.
For example Chinese fragment " research institute carries out moving really " will be cut into the method:
" research institute | really | carry out | moving ",
This is not a desirable patterning method.The patterning method that randomness is lower should be:
" research institute | | really | action ",
So said method also is not suitable as Chinese file cutting process means.
The present invention is the cutting means that propose " probability competition " aspect cutting process.Via means, the Chinese file shelves can be cut into the coding unit sign indicating number string of a string low randomness, convert this yard string to Huffman code after, original Chinese file shelves can obtain very big compression.The cutting means of this kind probability competition can be applicable to various coding dictionary, and it can provide the minimum cutting effect of randomness under the coding dictionary of institute's reference; The coding dictionary more is an optimization, and resulting result is better.Though theoretic forced coding component dictionary is not an emphasis of the present invention,, can reach extraordinary compression effectiveness with general Chinese dictionary and the present invention use of arranging in pairs or groups.
Following foundation the 2nd figure illustrates the motion flow of Chinese file compression processing unit cutter sweep of the present invention (100).
The present invention at first obtains a Chinese fragment after starting in the Chinese file shelves (104) in (200) from memory, prepare to handle.This Chinese fragment is sentence normally.
With headed by the lead-in of this Chinese fragment, whether the different length word string of composition lists among the coding dictionary (101) in (201) systems inspection.To each seek coding unit, set up a virtual route, and be labeled as cut point at the relative position of compression unit tail end, then index is pointed to second word.This place accent virtual route promptly is the description to the cut point position.
Whether checkpoint (204) is checked has cut point above one in present index location.As surpass, then carry out action (202), otherwise, action (203) then carried out.
Relatively all have the path of cut point, the comprehensive probability that the cutting mode from lead-in to present index location may take place in present index location in action (202).The algorithm of comprehensive probability multiplies each other for the appearance probability with all compression units that are connected.Relatively, stay the probability maximum path takes place, delete all the other paths.It must be the path that the probability maximum takes place that this comparison makes last result.Action (202) need check whether index has pointed to this Chinese fragment tail end after finishing.In this way, then finish cutting to this Chinese fragment; As not, then down carry out action (203).
Action (203) is at first moved a position forward with index, points to literal in the next one.Then all cut points are done following action with the path that index location is identical at present: whether inspection lists among the coding dictionary (101) with composition different length word string headed by the word of present index location, then each is sought coding unit, form one or several new routes after being connected on original route, and be labeled as cut point at each new route tail end.List in the dictionary if all word strings are neither.Then delete this path, because this kind patterning method generation probability is zero.After action (202) finishes, carry out flow process rebound checkpoint (204), carry out above-mentioned exercises then repeatedly and finish up to sentence.At last, the cutting result with gained is indicated in the file shelves, is stored in the container, finishes cutting process.
Following example illustrates the handling process of compression processing unit cutter sweep shown in the 2nd figure.
Suppose that the partial content that stores is in coding dictionary (101):
Table one
Coding unit generation Ji Shuai
Grind 0.000009
Research 0.000496
Research institute 0.000018
Institute 0.002450
0.054438
Really 0.000070
True 0.000077
Certain 0.000067
Real 0.000179
Carry out 0.000084
Action 0.000167
Moving 0.000175
The probability of above annotation is 1,000,000 times (ten quadratic powers of ten) of actual value.
Step 1:
Action (200) obtains a Chinese fragment from the Chinese file shelves of memory (104), prepare to handle.
The Chinese fragment that obtains is:
" research institute carries out moving really "
Step 2:
, form the different length word string and whether list among the coding dictionary (101) with headed by the lead-in (grinding) of this Chinese fragment in (201) systems inspection.To each seek coding unit, set up a virtual route, and be labeled as cut point at the relative position of coding unit tail end.
Index refers at " grinding " word.Set up three initial paths.
Path: grind | 9000000
Research | 496000000
Research institute | 18000000
Step 3:
Checkpoint (204) checks that whether surpassing a path has cut point in present index location.
In this example, checkpoint (204) check result finds that not surpassing a path has cut point in target index location (grinding).Index moves to " studying carefully " word.
Whether (203) are positioned at the path of present index location (studying carefully) to all cut points at this moment, check with the different length word string of forming headed by the word of present index location and list among the coding dictionary (101).Found that with the word string headed by " studying carefully " and to study carefully institute as " studying carefully " ", " studying carefully " etc. is neither in the coding dictionary, delete this path " grind |, remaining:
Path: research | 496000000
Research institute | 18000000
Step 4
Index moves to the 3rd word.Repetitive operation (203), all cut points are done following action with the path that index location is identical at present: whether inspection lists among the coding dictionary (101) with composition different length word string headed by the word of present index location (institute).If any.With each seek coding unit, be connected on and form one or several new routes after the original route, and be labeled as cut point at each new route tail end.
In this example, " institute " word becomes so meet " institute " after path " research | " " study | institute | " among coding dictionary (101).
Path: research | institute | 1215200
Research institute | 18000000
Step 5:
(204) check that whether surpassing a path has cut point in present index location in the checkpoint equally.As surpass, then carry out action (202), calculate the path that the present index location in place has cutting, comprehensive probability may take place in the cutting mode of the index location from lead-in to present index.The algorithm of comprehensive probability is taken advantage of for the appearance probability of all compression units that are connected is looked into.Compare its size.Relatively, stay the path that the probability maximum takes place, delete all the other paths.
Check result finds to surpass a path has cut point in present index location, so relatively its generation probability is big little.The path: " research | institute | " deletion, because the probability (18000000) that probability (1215200) is lower than path " research institute | " takes place in it.The result is remaining:
Path: research institute | 18000000
Step 6:
Index moves to the 4th word.Repeat (203) action.Obtain two possible compression units " " and " really " be connected on " research institute | " afterwards, each forms a new route.
Path: research institute | | 979884
Research institute | really | 1260
Step 7:
Index moves to the 5th word.Repeat (203) action.Obtaining two other possible compression unit " really " and " really " is connected on " research institute | | " afterwards, each forms a new route.
Path: research institute | | really | 65.652228
Research institute | | really | 75.451068
Research institute | really | 1260.000000
Step 8
The judgement of foundation (204) repeats (202) action, and finding to surpass a path has cut point in present index location, and relatively probability takes place for it.
The path " research institute | | really |, deletion lower because probability (75.451068) takes place for it than path " research institute | really | " (1260.000000).Obtain:
Path: research institute | | really | 65.652228
Research institute | really | 1265.000000
Step 9
Index moves to the 6th word.According to the judgement of (204), repeat the action of (203).Obtain two other possible compression unit " reality " and " implementation " and be connected on " research | really | " afterwards, each forms a path, that is:
Path: research institute | | really | 65.652228
Research institute | really | real | 0.225540
Research institute | really | carry out | 0.105840
Step 10:
According to the judgement of (204), repeat the action of (202), finding to surpass a path has cut point in present index location, and relatively probability takes place in it.
The path " research institute | really | real | " deletion because its take place probability (0.225540) than the path " research institute | | really | " (65.652228) low.Obtain:
Path: research institute | | really | 65.652228
Research institute | really | carry out | 0.105840
Step 11:
Index moves to the 7th word.Checkpoint (204) check result finds not surpass a path has cut point in present index location (moving).Index moves to " moving " word:
Path: research institute | | really | action | 0.01 0964
Research institute | really | carry out | 0.105840
Step 12
Index moves to the Eight characters:
Path: research institute | | really | action | 0.010964
Research institute | really | carry out | moving | 0.000019
Step 13:
According to the judgement of (204), repeat the action of (202), finding to surpass a path has cut point in present index location, and relatively probability takes place in it.
The path " research institute | really | carry out | moving | " deletion because its take place probability (0.000019) than the path " research institute | | really | action | " (0.010964) low.Obtain:
Path: research institute | | really | action | 0.010964
Step 14:
Reach the sentence tail end, the compression processing unit cutting action finishes.
(3) invention effect
As shown in the above description, utilize Chinese file shelves compression processing method of the present invention and device, can obtain correct cutting effect, and can reach the good compression effect.
Table 2 shows the effect of utilizing Chinese file shelves compression processing method of the present invention and device to handle 3 Chinese file shelves gained.
Table 2
| File name | Original shelves length (byles) | Original shelves length (bits) | The compression shelves 6Length (bits) | Coding schedule 7Length (bils) | Compression ratio I 1 | Compression ratio II 2 |
| SOLVER 3 S370 4 LOTUS 5 | 49592 589238 243212 | 396736 4713904 1945696 | 141671 1909176 744952 | 36250 125091 70754 | 35.709 40.501 38.287 | 44.846 43.155 41.924 |
By shown in this table as can be known, the present invention can obtain about compression ratio about in the of 35-55% when doing the compression of Chinese archives and handle, really have the raising effect.Moreover applicable scope of the present invention can comprise various data compression or other processing through frequency coding, also has purposes is extensively arranged.
It more than is embodiment of the invention explanation.Those skilled in the art are not difficult by understanding spirit of the present invention in the above explanation, and make different variations according to this and extend, in any case but, all belong to the present invention's category.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN 92109706 CN1031680C (en) | 1992-08-22 | 1992-08-22 | Chinese file compression processing method and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN 92109706 CN1031680C (en) | 1992-08-22 | 1992-08-22 | Chinese file compression processing method and device |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN1083638A CN1083638A (en) | 1994-03-09 |
| CN1031680C true CN1031680C (en) | 1996-04-24 |
Family
ID=4944307
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN 92109706 Expired - Fee Related CN1031680C (en) | 1992-08-22 | 1992-08-22 | Chinese file compression processing method and device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN1031680C (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8175164B2 (en) | 2005-09-01 | 2012-05-08 | Samsung Electronics Co., Ltd. | Devices and methods for data compression and decompression |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101751451B (en) * | 2008-12-11 | 2012-04-25 | 高德软件有限公司 | A Chinese data compression and decompression method and related equipment |
| CN101520771A (en) * | 2009-03-27 | 2009-09-02 | 广东国笔科技股份有限公司 | Method and system for code compression and decoding for word library |
-
1992
- 1992-08-22 CN CN 92109706 patent/CN1031680C/en not_active Expired - Fee Related
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8175164B2 (en) | 2005-09-01 | 2012-05-08 | Samsung Electronics Co., Ltd. | Devices and methods for data compression and decompression |
Also Published As
| Publication number | Publication date |
|---|---|
| CN1083638A (en) | 1994-03-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN1183683C (en) | Position adaptive coding method using prefix prediction | |
| CN1205574C (en) | Method of and apparatus for compressing and expanding data and data processing apparatus and network system using same | |
| KR101157693B1 (en) | Multi-stage query processing system and method for use with tokenspace repository | |
| CN1228887A (en) | Data compression and decompression system interleaved with string search for on-the-fly dictionary updates | |
| CN101079028A (en) | On-line translation model selection method of statistic machine translation | |
| CN1585968A (en) | Method for compressing dictionary data | |
| CN1388443A (en) | Method and apparatus for data compression, computer program and storage media | |
| CN1868127A (en) | Data compression system and method | |
| CN1475907A (en) | Machine translation system based on examples | |
| CN1163841C (en) | Online Handwritten Chinese Character Recognition Device | |
| KR20110007865A (en) | Data Compression Method | |
| De Moura et al. | Direct pattern matching on compressed text | |
| CN101036298A (en) | System and method for static Huffman decoding | |
| CN101751416A (en) | Method for ordering and seeking character strings | |
| CN1287320C (en) | Method of compressing digital ink | |
| CN1031680C (en) | Chinese file compression processing method and device | |
| CN1267963A (en) | Data compression equipment and data restorer | |
| CN101075261A (en) | Method and device for compressing index | |
| Wiseman et al. | Conjugation-based compression for Hebrew texts | |
| CN101055593A (en) | Tibetan web page and its code identification method | |
| CN100343851C (en) | Database compression and decompression method | |
| CN1949221A (en) | Method and system of storing element and method and system of searching element | |
| CN1916940A (en) | Template optimized character recognition method and system | |
| CN1301596C (en) | Compression and decompression method of digital image data | |
| CN1189839C (en) | Word identifying device and method, and memory medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C06 | Publication | ||
| PB01 | Publication | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant | ||
| C19 | Lapse of patent right due to non-payment of the annual fee | ||
| CF01 | Termination of patent right due to non-payment of annual fee |