[go: up one dir, main page]

CN1031680C - Chinese file compression processing method and device - Google Patents

Chinese file compression processing method and device Download PDF

Info

Publication number
CN1031680C
CN1031680C CN 92109706 CN92109706A CN1031680C CN 1031680 C CN1031680 C CN 1031680C CN 92109706 CN92109706 CN 92109706 CN 92109706 A CN92109706 A CN 92109706A CN 1031680 C CN1031680 C CN 1031680C
Authority
CN
China
Prior art keywords
chinese
file
compression
path
cutting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 92109706
Other languages
Chinese (zh)
Other versions
CN1083638A (en
Inventor
张俊盛
刘显仲
柯淑津
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shengbao Co ltd
Original Assignee
Shengbao Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shengbao Co ltd filed Critical Shengbao Co ltd
Priority to CN 92109706 priority Critical patent/CN1031680C/en
Publication of CN1083638A publication Critical patent/CN1083638A/en
Application granted granted Critical
Publication of CN1031680C publication Critical patent/CN1031680C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Document Processing Apparatus (AREA)

Abstract

By utilizing the characteristics of Chinese, Chinese words are used as coding units, the Chinese file is cut into units with indefinite length which are most suitable for compression according to the occurrence probability of the Chinese words in the Chinese file, and the processed file can be coded into various frequency codes including Huffman codes and the like. The Chinese file processed by the device of the invention can obtain about 35-55% compression ratio after being converted into Huffman code. The invention also provides a method for determining the Chinese processing unit.

Description

Chinese file shelves compression processing method and device
The invention relates to a kind of Chinese file shelves compression processing method and device, particularly about a kind of Chinese word that utilizes as processing unit, with the archives compress technique, reach the method and the device of the shared memory capacity of a large amount of compression Chinese file shelves.
After Chinese Computerization is popularized gradually, produce a large amount of computer Chinese files, for example publish every day 28 spaces of a whole page the day newspaper office after computerization, promptly produce the literal shelves of about 2,500,000 middle literal every day, other is as publishing house, government bodies, the middle literal shelves that produced every day such as company are difficult to counting especially.Under the situation of this data blast, or be the waste that reduces the storage area, or be raising data transmitting speed, the necessity with the Chinese file compression is arranged in fact.
Being commonly used at present compress in the technology of Chinese file, a kind of ARC program is arranged, is to be the unit with the byte, that is utilize the bit group of literal in the representative, according to its frequency that in Chinese file, occurs, form Huffman code, utilize known Hoffmannshen Method, the compression Chinese file.Because this kind mode is that Chinese file is compressed in the mode of compressing English data, do not use the characteristic of Chinese, so compression factor is not high, can reach about compression ratio about in the of 75%.
General Chinese sign indicating number no matter be BIG-5 yard or TCA sign indicating number, is compiled sign indicating number with two bytes to literal in each.This kind coded system because practicality reaches and the consideration of English ASCII character compatibility, is not used operable code space fully.Therefore, another kind of possible File Compress method is with middle literal recompile, reduces the figure place of each yard, can obtain some compressions very easily.
For example,, only need sign indicating number (annotate: 14 times of 2 is 16384) recompile in addition, can obtain the compression ratio of 14/16=87.5% with 14 positions of every word to literal in more than 10,000.
Above-mentioned dual mode all is to be coding unit with " word ".But the logical block that Chinese uses is actually " speech " rather than word, and Chinese word coding should obtain preferable compression effectiveness in theory.For example.The number of Chinese word commonly used is greatly about about 60,000, if with Chinese word coding, the sign indicating number (annotate: 16 times of 2 is 65536) with 16 positions can be compiled in each speech, can apply the needs of 60,000 speech enough, this mode, average number of words with reference to each Chinese word is about 1.6, and the code length that can approximately calculate average each word is 16/1.6=10 position.Therefore compression ratio then is approximately 10/16=62.5%.More efficient far beyond the coding that with the word is the unit.
Said method all only relates to the completeness (or imperfection incompleteness) of coding to the research of compression unit, use the BIC-5 yard imperfection with the TCA sign indicating number, again the sign indicating number that word or speech are compiled to make full use of code space reaches compression effectiveness, and handled all be the sign indicating number of equal length.In traditional data compression method, emphasize to utilize the appearance probability of coding unit, to the high unit of probability occurring, reach the purpose of compression by this to compile with long sign indicating number, the compression efficiency of these class methods is decided by choosing of compression unit usually.Yet known analogy probability Methods for Coding all can not be utilized Chinese characteristic, provides bigger compression factor, to conform with user's needs.
Purpose of the present invention is to provide a kind of characteristic of utilizing Chinese, the Chinese file shelves compression processing method and the device of compression Chinese file shelves.
Another object of the present invention is also at compression method that a kind of Chinese file shelves are provided and device, so that the storing memory amount of Chinese file shelves is done more efficient compression.
Another object of the present invention is also in the method that a kind of definite Chinese compression processing unit is provided, so that more effective Chinese compression effect to be provided.
Find that through the inventor past, encoded method reached the technology of compression effectiveness, its key is in randomness (Entropy) value of file---a number that promptly calculates according to the appearance probability of all compression units.Though the meaning of unrest degree value and computational methods are not within discussion scope of the present invention, but assist the technological thought of File Compress to be summarized as follows: in the archives of institute's desire compression with the notion of random degree value, other probability occurs all compression units more consistent (predictability is low, the randomness height), compression effectiveness is poorer; Otherwise if probability more inconsistent (predictability height, randomness is low) appears individually in compression unit, then compression effectiveness is better.
Each file all has its max. Possible compression ratio rate, and the preposition means of File Compress are promptly being found out the coding unit that one group of randomness that makes file reduces as far as possible.Therefore, though the present invention needn't be subject to its theory, but the inventor thinks, can make the Chinese file randomness drop to the compression unit group of theoretical minimum, should be the most approaching Chinese vocabulary of definition naturally, in other words, be the Huffman code that compression unit is weaved into similar Chinese vocabulary, Chinese file can be compressed near the theoretic limit.
Comprehensive above observation and analysis, the inventor thinks, if find a compression unit group of approaching Chinese words and phrases, is the compression unit coding with Chinese with this compression unit group, the at first compression that can obtain the first step because of the completeness and the logicality of its coding, add because of each compression unit uses the predictability of probability and weave into Huffman code, further more compressed.
Following with reference to description of drawings method and apparatus of the present invention.
The 1st figure represents the compression method of Chinese file shelves of the present invention and installs the system diagram of an embodiment.
The 2nd figure represents the compression method of Chinese file shelves of the present invention and installs the moving system figure of the compression processing unit cutter sweep of an embodiment.
In theory philological, how to seek out an objectively best compression unit group (or title " coding dictionary "), and the file shelves are compressed to theoretical minimum, be a very good problem to study.As long as can provide effective compression effectiveness but on using, find out one " coding dictionary " desirable, that be easy to processing (acquisition).Below how explanation method and apparatus of the present invention utilizes this one " coding dictionary " that the sign indicating number that Chinese file effectively cuts into a low randomness is gone here and there, and uses and compresses processing.
(1) compression of Chinese file shelves
The 1st figure represents the system diagram of one embodiment of the invention.Compression proposed by the invention at first comprises a compression processing unit cutter sweep (100) and one " coding dictionary " (101).Be marked with the use probability of each compression processing unit in the coding dictionary (101).Compression processing unit cutter sweep (100) can cut into the minimum compression unit sign indicating number string of randomness with file according to the use probability of each compression unit of record in the coding dictionary (101).
Coding table of comparisons generation device (102) can be according to the use probability of each compression unit of putting down in writing in the coding dictionary (101), derive Hofman tree and produce a coding table of comparisons (103) with the method that is suitable for, put down in writing the corresponding relation of each coding unit and Huffman code.Usually, this coding table of comparisons (103) is a table of making in advance on using, and is stored in the memory in a suitable manner.(104) be a memory, for storing treated file shelves.
When utilizing Chinese file shelves compression treatment device compression of the present invention to handle the Chinese file shelves, at first by system Chinese original document is read in, it is medium pending to be placed on memory (104).When doing the compression processing, by the use probability of compression processing unit cutter sweep (100) according to each compression unit of record in the coding dictionary (101), file according to method of the present invention, is cut into the minimum compression unit sign indicating number string of randomness, be stored in once again in the memory (104).The Chinese file shelves that compression set (105) is crossed encoding process, the reference encoder table of comparisons (103), the compression unit that constitutes a row sign indicating number string in the file shelves is converted to Huffman code one by one and obtains the compression shelves, be stored in once again in the memory (104), finish compression and handle.
Contract when handling carrying out back-pressure, back-pressure contracts processing unit (106) earlier with the compression shelves in the memory (104), and the reference encoder table of comparisons (103) is inverted to the original compression unit one by one to the Huffman code of representative compression shelves, and obtains original file shelves.
(2) cutting of compression processing unit
As previously mentioned, be the use probability of complying with each compression unit of record in the coding dictionary (101) earlier with compression processing unit cutter sweep (100) doing when compression is handled, file is cut into the minimum compression unit sign indicating number string of randomness.At this moment, the quality of cutting effect will become a treatment effect big key how.
Known cutting method is to utilize the means of long compression unit, with a Chinese fragment (normally Chinese sentence), searches the longest compression unit by known cut point toward sentence is terminal.Seek after, be cut point on next with this compression unit tail end, use these means then repeatedly, cut.
For example Chinese fragment " research institute carries out moving really " will be cut into the method:
" research institute | really | carry out | moving ",
This is not a desirable patterning method.The patterning method that randomness is lower should be:
" research institute | | really | action ",
So said method also is not suitable as Chinese file cutting process means.
The present invention is the cutting means that propose " probability competition " aspect cutting process.Via means, the Chinese file shelves can be cut into the coding unit sign indicating number string of a string low randomness, convert this yard string to Huffman code after, original Chinese file shelves can obtain very big compression.The cutting means of this kind probability competition can be applicable to various coding dictionary, and it can provide the minimum cutting effect of randomness under the coding dictionary of institute's reference; The coding dictionary more is an optimization, and resulting result is better.Though theoretic forced coding component dictionary is not an emphasis of the present invention,, can reach extraordinary compression effectiveness with general Chinese dictionary and the present invention use of arranging in pairs or groups.
Following foundation the 2nd figure illustrates the motion flow of Chinese file compression processing unit cutter sweep of the present invention (100).
The present invention at first obtains a Chinese fragment after starting in the Chinese file shelves (104) in (200) from memory, prepare to handle.This Chinese fragment is sentence normally.
With headed by the lead-in of this Chinese fragment, whether the different length word string of composition lists among the coding dictionary (101) in (201) systems inspection.To each seek coding unit, set up a virtual route, and be labeled as cut point at the relative position of compression unit tail end, then index is pointed to second word.This place accent virtual route promptly is the description to the cut point position.
Whether checkpoint (204) is checked has cut point above one in present index location.As surpass, then carry out action (202), otherwise, action (203) then carried out.
Relatively all have the path of cut point, the comprehensive probability that the cutting mode from lead-in to present index location may take place in present index location in action (202).The algorithm of comprehensive probability multiplies each other for the appearance probability with all compression units that are connected.Relatively, stay the probability maximum path takes place, delete all the other paths.It must be the path that the probability maximum takes place that this comparison makes last result.Action (202) need check whether index has pointed to this Chinese fragment tail end after finishing.In this way, then finish cutting to this Chinese fragment; As not, then down carry out action (203).
Action (203) is at first moved a position forward with index, points to literal in the next one.Then all cut points are done following action with the path that index location is identical at present: whether inspection lists among the coding dictionary (101) with composition different length word string headed by the word of present index location, then each is sought coding unit, form one or several new routes after being connected on original route, and be labeled as cut point at each new route tail end.List in the dictionary if all word strings are neither.Then delete this path, because this kind patterning method generation probability is zero.After action (202) finishes, carry out flow process rebound checkpoint (204), carry out above-mentioned exercises then repeatedly and finish up to sentence.At last, the cutting result with gained is indicated in the file shelves, is stored in the container, finishes cutting process.
Following example illustrates the handling process of compression processing unit cutter sweep shown in the 2nd figure.
Suppose that the partial content that stores is in coding dictionary (101):
Table one
Coding unit generation Ji Shuai
Grind 0.000009
Research 0.000496
Research institute 0.000018
Institute 0.002450
0.054438
Really 0.000070
True 0.000077
Certain 0.000067
Real 0.000179
Carry out 0.000084
Action 0.000167
Moving 0.000175
The probability of above annotation is 1,000,000 times (ten quadratic powers of ten) of actual value.
Step 1:
Action (200) obtains a Chinese fragment from the Chinese file shelves of memory (104), prepare to handle.
The Chinese fragment that obtains is:
" research institute carries out moving really "
Step 2:
, form the different length word string and whether list among the coding dictionary (101) with headed by the lead-in (grinding) of this Chinese fragment in (201) systems inspection.To each seek coding unit, set up a virtual route, and be labeled as cut point at the relative position of coding unit tail end.
Index refers at " grinding " word.Set up three initial paths.
Path: grind | 9000000
Research | 496000000
Research institute | 18000000
Step 3:
Checkpoint (204) checks that whether surpassing a path has cut point in present index location.
In this example, checkpoint (204) check result finds that not surpassing a path has cut point in target index location (grinding).Index moves to " studying carefully " word.
Whether (203) are positioned at the path of present index location (studying carefully) to all cut points at this moment, check with the different length word string of forming headed by the word of present index location and list among the coding dictionary (101).Found that with the word string headed by " studying carefully " and to study carefully institute as " studying carefully " ", " studying carefully " etc. is neither in the coding dictionary, delete this path " grind |, remaining:
Path: research | 496000000
Research institute | 18000000
Step 4
Index moves to the 3rd word.Repetitive operation (203), all cut points are done following action with the path that index location is identical at present: whether inspection lists among the coding dictionary (101) with composition different length word string headed by the word of present index location (institute).If any.With each seek coding unit, be connected on and form one or several new routes after the original route, and be labeled as cut point at each new route tail end.
In this example, " institute " word becomes so meet " institute " after path " research | " " study | institute | " among coding dictionary (101).
Path: research | institute | 1215200
Research institute | 18000000
Step 5:
(204) check that whether surpassing a path has cut point in present index location in the checkpoint equally.As surpass, then carry out action (202), calculate the path that the present index location in place has cutting, comprehensive probability may take place in the cutting mode of the index location from lead-in to present index.The algorithm of comprehensive probability is taken advantage of for the appearance probability of all compression units that are connected is looked into.Compare its size.Relatively, stay the path that the probability maximum takes place, delete all the other paths.
Check result finds to surpass a path has cut point in present index location, so relatively its generation probability is big little.The path: " research | institute | " deletion, because the probability (18000000) that probability (1215200) is lower than path " research institute | " takes place in it.The result is remaining:
Path: research institute | 18000000
Step 6:
Index moves to the 4th word.Repeat (203) action.Obtain two possible compression units " " and " really " be connected on " research institute | " afterwards, each forms a new route.
Path: research institute | | 979884
Research institute | really | 1260
Step 7:
Index moves to the 5th word.Repeat (203) action.Obtaining two other possible compression unit " really " and " really " is connected on " research institute | | " afterwards, each forms a new route.
Path: research institute | | really | 65.652228
Research institute | | really | 75.451068
Research institute | really | 1260.000000
Step 8
The judgement of foundation (204) repeats (202) action, and finding to surpass a path has cut point in present index location, and relatively probability takes place for it.
The path " research institute | | really |, deletion lower because probability (75.451068) takes place for it than path " research institute | really | " (1260.000000).Obtain:
Path: research institute | | really | 65.652228
Research institute | really | 1265.000000
Step 9
Index moves to the 6th word.According to the judgement of (204), repeat the action of (203).Obtain two other possible compression unit " reality " and " implementation " and be connected on " research | really | " afterwards, each forms a path, that is:
Path: research institute | | really | 65.652228
Research institute | really | real | 0.225540
Research institute | really | carry out | 0.105840
Step 10:
According to the judgement of (204), repeat the action of (202), finding to surpass a path has cut point in present index location, and relatively probability takes place in it.
The path " research institute | really | real | " deletion because its take place probability (0.225540) than the path " research institute | | really | " (65.652228) low.Obtain:
Path: research institute | | really | 65.652228
Research institute | really | carry out | 0.105840
Step 11:
Index moves to the 7th word.Checkpoint (204) check result finds not surpass a path has cut point in present index location (moving).Index moves to " moving " word:
Path: research institute | | really | action | 0.01 0964
Research institute | really | carry out | 0.105840
Step 12
Index moves to the Eight characters:
Path: research institute | | really | action | 0.010964
Research institute | really | carry out | moving | 0.000019
Step 13:
According to the judgement of (204), repeat the action of (202), finding to surpass a path has cut point in present index location, and relatively probability takes place in it.
The path " research institute | really | carry out | moving | " deletion because its take place probability (0.000019) than the path " research institute | | really | action | " (0.010964) low.Obtain:
Path: research institute | | really | action | 0.010964
Step 14:
Reach the sentence tail end, the compression processing unit cutting action finishes.
(3) invention effect
As shown in the above description, utilize Chinese file shelves compression processing method of the present invention and device, can obtain correct cutting effect, and can reach the good compression effect.
Table 2 shows the effect of utilizing Chinese file shelves compression processing method of the present invention and device to handle 3 Chinese file shelves gained.
Table 2
File name Original shelves length (byles) Original shelves length (bits) The compression shelves 6Length (bits) Coding schedule 7Length (bils) Compression ratio I 1 Compression ratio II 2
SOLVER 3 S370 4 LOTUS 5 49592 589238 243212 396736 4713904 1945696 141671 1909176 744952 36250 125091 70754 35.709 40.501 38.287 44.846 43.155 41.924
1Compression ratio 1=compression grade long/original shelves are long. 2Compression ratio 2=(compression grade long+coding schedule is long)/original shelves are long. 3Data source: Lotus 1-2-2f or Windows applied problem solution. 4Data source: IBM System370 service manual. 5Data source: Lotus 1-2-3 for Windows Chinese service manual. 6Punctuate and English in the original shelves all do not give processing. 7Coding schedule is made up of contained Chinese word in the file shelves.
By shown in this table as can be known, the present invention can obtain about compression ratio about in the of 35-55% when doing the compression of Chinese archives and handle, really have the raising effect.Moreover applicable scope of the present invention can comprise various data compression or other processing through frequency coding, also has purposes is extensively arranged.
It more than is embodiment of the invention explanation.Those skilled in the art are not difficult by understanding spirit of the present invention in the above explanation, and make different variations according to this and extend, in any case but, all belong to the present invention's category.

Claims (10)

1.一种中文文件档的压缩处理装置,包含1. A compression processing device for Chinese file files, comprising 一输入装置,用以输入待处理的中文文件档;An input device for inputting Chinese documents to be processed; 一存贮器,供储存所处理文件档;a memory for storing the processed documents; 一编码词典,记载每一压缩处理单元的使用机率;a coding dictionary, recording the usage probability of each compression processing unit; 一压缩处理单元切割装置,可将文件档切割成低混乱度的压缩处理单元码串;A compression processing unit cutting device, which can cut the file into a low-confusion compression processing unit code string; 一编码对照表,记载代表各压缩处理单元的压缩码,该对照表系依据各压缩处理单元在中文文件档中的平均出现机率编码,给予出现机率较高者较短的码,给予出现机率较低者较长的码;A code comparison table, which records the compression codes representing each compression processing unit. The comparison table is coded according to the average occurrence probability of each compression processing unit in the Chinese file. The lower one is the longer code; 一压缩装置,将该压缩处理单元切割装置切割处理过的中文文件档,参考该编对照表,把文件档中的压缩处理单元逐一转换成压缩码而得到压缩档;A compression device, the compression processing unit cutting device cuts the processed Chinese file, refers to the compilation comparison table, converts the compression processing units in the file file into compressed codes one by one to obtain a compressed file; 其中的压缩处理单元切割装置含有:Among them, the compression processing unit cutting device contains: 一暂存装置,可从欲处理的中文文件取出一个中文片段;A temporary storage device, which can take out a Chinese segment from the Chinese file to be processed; 第一检查装置,检查以该中文片段的特定字为首组成的不同长度字串是否列于该编码词典之中,在寻得的字串尾端注明为切割点,并对所有前后连贯的字串,建立一虚拟路径;The first checking device checks whether the word strings of different lengths formed by the specific word of the Chinese segment are listed in the code dictionary, and the end of the word string found is marked as a cutting point, and all consecutive words String, establish a virtual path; 第二检查装置,检查是否超过一路径在同一位置有切割点,对所有切割点位于该位置的路径,检查以该位置的字为首组成的不同长度字串,是否列于编码词典之中,无法在词典中找到该字串,则删除此路径;The second checking device checks whether more than one path has a cutting point at the same position, and for all paths where the cutting point is located at this position, checks whether the word strings of different lengths formed with the word at this position are listed in the coding dictionary, and cannot If the string is found in the dictionary, delete the path; 一计算装置,参考该编码词典记载的每一字串的使用机率,计算各路径机率分率;A calculation device, referring to the use probability of each word string recorded in the code dictionary, to calculate the probability fraction of each path; 一比较装置,比较各路径的机率分散,以分散最高的路径,作为切割之结果。A comparison means compares the probability distribution of each path, and takes the path with the highest dispersion as the result of cutting. 其特征在于,该压缩处理单元切割装置系利用该编码词典中记载各压缩处理单元的使用机率,计算各种可能的切割结果机率分数,将机率分数最高的切割结果,供压缩装置处理。It is characterized in that the cutting device of the compression processing unit uses the use probability of each compression processing unit recorded in the code dictionary to calculate the probability scores of various possible cutting results, and supplies the cutting result with the highest probability score to the compression device for processing. 2.如权利要求1项的装置,另含一反压缩处理装置,可将该存贮器中的压缩文件档,参考该编码对照表,把经过压缩的文件档的压缩码逐一反转成原始压缩单元,而将该存贮器中的压缩文件档转换成为原来的文件档。2. The device as claimed in claim 1, further containing an anti-compression processing device, which can refer to the code comparison table for the compressed file in the memory, and reverse the compressed code of the compressed file into the original one by one. A compression unit is used to convert the compressed file in the memory into an original file. 3.如权利要求1项的装置,其中该压缩处理单元切割装置的暂存装置,可从欲处理的中文文件档中取出一个中文句子,作为处理的片段。3. The device according to claim 1, wherein the temporary storage device of the compression processing unit and the cutting device can take out a Chinese sentence from the Chinese file to be processed as a processing segment. 4.如权利要求1项的装置,其中该压缩处理单元切割装置的编码词典,记载的压缩处理单元系中文之词。4. The device according to claim 1, wherein the compression processing unit cuts out the coding dictionary of the device, and the compression processing unit recorded is a word in Chinese. 5.如权利要求1项的装置,其中该压缩处理单元切割装置的计算装置,系将各路径所含压缩处理单元在该编码词典记载使用机率相乘,以计算各路径的机率分数。5. The device as claimed in claim 1, wherein the computing device of the compression processing unit cutting device multiplies the usage probabilities recorded in the code dictionary of the compression processing units contained in each path to calculate the probability score of each path. 6.一种中文文件档的压缩处理方法,包含6. A compression processing method for Chinese files, comprising 一输入步骤,输入待处理中文文件档,储存于存贮器中;An input step, input the Chinese file to be processed, and store it in the memory; 一切割步骤,将该文件档切割成低混乱度的压缩处理单元码串:A cutting step, cutting the file into low-confusion compression processing unit code strings: 一压缩步骤,将该切割步骤切割处理过的中文文件档,参考一记载各压缩处理单元的压缩码的编码对照表,将文件档中的压缩处理单元逐一转换成压缩码得到压缩档;其中,该对照表系依据各压缩处理单元在中文文件档中的平均出现机率编码,给予出现机率较高者较短的码,给予出现机率较低者较长的码;A compression step, the cutting step cuts the processed Chinese file, refers to a code comparison table that records the compression codes of each compression processing unit, and converts the compression processing units in the file file into compressed codes one by one to obtain a compressed file; wherein, The comparison table is coded according to the average occurrence probability of each compression processing unit in the Chinese file, and the one with the higher probability of occurrence is given a shorter code, and the one with a lower probability of appearance is given a longer code; 其中的切割步骤含有:The cutting steps include: 一暂存步骤,从欲处理的中文文件档中取出一个中文片段;A temporary storage step, taking out a Chinese segment from the Chinese file to be processed; 第一检查步骤,指标置于中文片段的首字,检查以该字为首组成的不同长度字串是否列于该编码词典之中,并对每一个寻得的字串,建立一虚拟路径,在字串尾端注明为切割点;In the first checking step, the index is placed on the first character of the Chinese segment, and whether the character strings of different lengths formed with the character as the beginning are listed in the code dictionary, and for each found character string, a virtual path is established, and the The end of the string is marked as a cutting point; 第二检查步骤,检查是否超过一路径在目前指标位置有切割点,如否,指标往前移动一个位置;The second checking step is to check whether more than one path has a cutting point at the current pointer position, if not, the pointer moves forward by one position; 第三检查步骤,对所有切割点与目前指标位置相同之路径,检查以目前指标位置的字为首组成的不同长度的字串是否列于编码词典之中,将每一个寻得编码,接在原路径之后组成新路径,并在每一新路径尾端注明为切割点,若无法在该编码词典中找到该字串,则删除该路径;The third checking step is to check whether the strings of different lengths formed with the word at the current pointer position are listed in the code dictionary for all paths whose cutting points are the same as the current pointer position, and connect each found code to the original path Form a new path afterwards, and mark it as a cutting point at the end of each new path, if the word string cannot be found in the code dictionary, then delete the path; 一计算步骤,对第二检查步骤检查出超过一个在目前指标位置有切割点之路径,参考该编码词典记载的每一字串的使用机率,计算该路径,由首字至目前的指标位置的机率分数;A calculation step, for the second checking step, check out more than one path with cutting points at the current index position, refer to the use probability of each word string recorded in the code dictionary, calculate the path, from the first word to the current index position probability score; 一比较步骤,比较各路径的机率分数,将分数最高的路径,供作压缩处理;A comparison step, comparing the probability scores of each path, and using the path with the highest score for compression processing; 第四步骤,检查是否已处理至该处理中文片段的末字,如否,则重复上述处理,如是,则结束分割处理;The fourth step is to check whether the last character of the processed Chinese segment has been processed, if not, repeat the above processing, and if so, end the segmentation process; 其特征在于,该切割步骤系利用一编码词典中记载的各压缩单元的使用机率,计算各种可能的切割结果的机率分数,将机率分数最高的切割结果,供压缩步骤处理。It is characterized in that the cutting step uses the usage probability of each compression unit recorded in a coding dictionary to calculate the probability scores of various possible cutting results, and the cutting result with the highest probability score is provided to the compression step for processing. 7.如权利要求第6项的方法,进一步包含一反压缩步骤,可将该压缩文件档,参考编码对照表,把经过压缩的文件档的压缩码逐一反转成原始压缩单元,而将该存贮器中的压缩文件档转换成为原来的文件档。7. The method according to claim 6, further comprising an anti-compression step, which can refer to the code comparison table for the compressed file, and reverse the compressed code of the compressed file into the original compression unit one by one, and the The compressed file in memory is converted to the original file. 8.如权利要求第6项的方法,其中该切割步骤的暂存步骤是从欲处理的中文文件档中取出的一中文句子作处理的片段。8. The method according to claim 6, wherein the temporary storage step of the cutting step is a segment of a Chinese sentence taken out from the Chinese file to be processed for processing. 9.如权利要求第6项的方法,其中该切割步骤所使用的编码词典记载之压缩处理单元系中文词。9. The method according to claim 6, wherein the compression processing unit recorded in the coding dictionary used in the cutting step is a Chinese word. 10.如权利要求第6项的方法,其中该切割步骤的计算步骤,系将各路径所含压缩处理单元在该编码词典记载使用机率相乘,以计算各路径之机率分数。10. The method according to claim 6, wherein the calculation step of the cutting step is to multiply the usage probabilities of the compression processing units contained in each path recorded in the coding dictionary to calculate the probability score of each path.
CN 92109706 1992-08-22 1992-08-22 Chinese file compression processing method and device Expired - Fee Related CN1031680C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 92109706 CN1031680C (en) 1992-08-22 1992-08-22 Chinese file compression processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 92109706 CN1031680C (en) 1992-08-22 1992-08-22 Chinese file compression processing method and device

Publications (2)

Publication Number Publication Date
CN1083638A CN1083638A (en) 1994-03-09
CN1031680C true CN1031680C (en) 1996-04-24

Family

ID=4944307

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 92109706 Expired - Fee Related CN1031680C (en) 1992-08-22 1992-08-22 Chinese file compression processing method and device

Country Status (1)

Country Link
CN (1) CN1031680C (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8175164B2 (en) 2005-09-01 2012-05-08 Samsung Electronics Co., Ltd. Devices and methods for data compression and decompression

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751451B (en) * 2008-12-11 2012-04-25 高德软件有限公司 A Chinese data compression and decompression method and related equipment
CN101520771A (en) * 2009-03-27 2009-09-02 广东国笔科技股份有限公司 Method and system for code compression and decoding for word library

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8175164B2 (en) 2005-09-01 2012-05-08 Samsung Electronics Co., Ltd. Devices and methods for data compression and decompression

Also Published As

Publication number Publication date
CN1083638A (en) 1994-03-09

Similar Documents

Publication Publication Date Title
CN1183683C (en) Position adaptive coding method using prefix prediction
CN1205574C (en) Method of and apparatus for compressing and expanding data and data processing apparatus and network system using same
KR101157693B1 (en) Multi-stage query processing system and method for use with tokenspace repository
CN1228887A (en) Data compression and decompression system interleaved with string search for on-the-fly dictionary updates
CN101079028A (en) On-line translation model selection method of statistic machine translation
CN1585968A (en) Method for compressing dictionary data
CN1388443A (en) Method and apparatus for data compression, computer program and storage media
CN1868127A (en) Data compression system and method
CN1475907A (en) Machine translation system based on examples
CN1163841C (en) Online Handwritten Chinese Character Recognition Device
KR20110007865A (en) Data Compression Method
De Moura et al. Direct pattern matching on compressed text
CN101036298A (en) System and method for static Huffman decoding
CN101751416A (en) Method for ordering and seeking character strings
CN1287320C (en) Method of compressing digital ink
CN1031680C (en) Chinese file compression processing method and device
CN1267963A (en) Data compression equipment and data restorer
CN101075261A (en) Method and device for compressing index
Wiseman et al. Conjugation-based compression for Hebrew texts
CN101055593A (en) Tibetan web page and its code identification method
CN100343851C (en) Database compression and decompression method
CN1949221A (en) Method and system of storing element and method and system of searching element
CN1916940A (en) Template optimized character recognition method and system
CN1301596C (en) Compression and decompression method of digital image data
CN1189839C (en) Word identifying device and method, and memory medium

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C14 Grant of patent or utility model
GR01 Patent grant
C19 Lapse of patent right due to non-payment of the annual fee
CF01 Termination of patent right due to non-payment of annual fee