[go: up one dir, main page]

CN108170682A - A kind of Chinese word cutting method and computing device based on specialized vocabulary - Google Patents

A kind of Chinese word cutting method and computing device based on specialized vocabulary Download PDF

Info

Publication number
CN108170682A
CN108170682A CN201810050618.7A CN201810050618A CN108170682A CN 108170682 A CN108170682 A CN 108170682A CN 201810050618 A CN201810050618 A CN 201810050618A CN 108170682 A CN108170682 A CN 108170682A
Authority
CN
China
Prior art keywords
participle
character
determined
entry
array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810050618.7A
Other languages
Chinese (zh)
Other versions
CN108170682B (en
Inventor
吕洪波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tongsheng Science & Technology Co Ltd
Original Assignee
Beijing Tongsheng Science & Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tongsheng Science & Technology Co Ltd filed Critical Beijing Tongsheng Science & Technology Co Ltd
Priority to CN201810050618.7A priority Critical patent/CN108170682B/en
Publication of CN108170682A publication Critical patent/CN108170682A/en
Application granted granted Critical
Publication of CN108170682B publication Critical patent/CN108170682B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of Chinese word cutting method based on specialized vocabulary, this method is suitable for performing in computing device, including:The dictionary with predetermined structure is constructed by reading in entry one by one, the identical entry of lead-in is arranged according to Unicode codes ascending order wherein in dictionary, and multiple first arrays are established for storing the identical entry of lead-in, and at least one second array is established in each first array, for storing entry content and flag, flag is used to identify whether entry belongs to specialized vocabulary;One or more of sentence to be segmented character string is searched in dictionary using binary chop, obtains multiple participles to be determined after first cutting;According to the corresponding flag of each participle to be determined to the participle setting participle weight to be determined;And according to multiple participles to be determined and its participle weight construction cutting route and shortest path is chosen as word segmentation result.The present invention discloses the computing device for performing this method together.

Description

A kind of Chinese word cutting method and computing device based on specialized vocabulary
Technical field
The present invention relates to technical field of information processing, particularly, are related to a kind of Chinese word cutting method based on specialized vocabulary And computing device.
Background technology
Chinese information processing technology is obtained in computer realms such as computer network, database technology, soft projects Extensive use, and Chinese Automatic Word Segmentation is an important basic work of Chinese information processing, at many Chinese informations Participle problem is directed in reason project, such as machine translation, automatic abstract, automatic classification, Chinese document database full-text search, search Engine etc..Since Chinese text is write the two or more syllables of a word together, there is no space between word, thus in Chinese text processing, what is initially encountered asks The problem of topic is participle, the correct cutting of word are to carry out the necessary condition of Chinese text processing.In addition, the method for Chinese word segmentation its Be not limited to Chinese application in fact, be also applied to English processing, such as handwriting recognition, the space between word just not it is clear that in Literary segmenting method can help to differentiate the boundary of English word.Therefore, research Chinese words segmentation has very important significance.
Although the primary expression unit of Modern Chinese is " word ", and in the majority with double word or multi-character words, since people recognize Know horizontal difference, the boundary of word and phrase is very difficult to distinguish.For example, " give and punish to the person of spitting everywhere ", " spits everywhere Person " is that a word or a phrase, different people have different standards in itself, same " sea " " brewery " etc., i.e., Make to be that same person may also make different judgements.And dictionary is relatively general used by existing Chinese words segmentation, does not have It is very big that word segmentation result may be caused inaccurate specifically for the dictionary of specialized vocabulary.
Therefore a kind of Chinese word cutting method that can recognize that specialized vocabulary is needed, so as to further improve participle accuracy rate.
Invention content
For this purpose, the present invention provides a kind of Chinese word cutting method and computing device based on specialized vocabulary, to try hard to solve Or at least alleviate above there are the problem of.
According to an aspect of the invention, there is provided a kind of Chinese word cutting method based on specialized vocabulary, this method are suitable for It is performed in computing device, including step:The dictionary with predetermined structure is constructed by reading in entry one by one, wherein in dictionary The identical entry of lead-in according to Unicode codes ascending order is arranged, and establishes multiple first arrays for storing the identical word of lead-in Item, and at least one second array is established in each first array, for storing entry content and flag, flag is used for Identify whether the entry belongs to specialized vocabulary;One or more in sentence to be segmented is searched in dictionary using binary chop A character string obtains multiple participles to be determined after first cutting;This is treated according to the corresponding flag of each participle to be determined Determine participle setting participle weight;According to multiple participles to be determined and its participle weight construction cutting route and choose shortest path As word segmentation result.
Optionally, in the method according to the invention, it is to be determined to this according to the corresponding flag of each participle to be determined The step of participle setting participle weight, includes:If the corresponding flag of participle to be determined indicates that the participle to be determined belongs to professional word It converges, then first participle weight is set to it;If the corresponding flag of participle to be determined indicates that the participle to be determined is not belonging to profession Vocabulary then sets it the second participle weight, wherein, the first participle weight is less than the second participle weight.
Optionally, in the method according to the invention, according to multiple participles to be determined and its participle weight construction cutting road Diameter is simultaneously chosen shortest path and is included as the step of word segmentation result:Each character is as node using in sentence to be segmented, wherein treating The first character for segmenting sentence is start node, last character is terminal node;It is sequentially constructed according to participle to be determined Go out a plurality of cutting route between start node and terminal node;It is cut with reference to the participle weight calculation every of each participle to be determined The length of sub-path;And the shortest cutting route of length is chosen as word segmentation result.
Optionally, in the method according to the invention, the dictionary with predetermined structure is constructed by reading in entry one by one The step of include:Inlet flow is established to read in entry successively;It judges whether to store using the entry lead-in as lead-in First array of entry;If there is no the first array, create to store with the lead-in according to the lead-in of the entry read in First array of all entries for lead-in;The second array is established in the first array to store entry content;Judging entry is It is no to belong to specialized vocabulary, if specialized vocabulary, then assign the first numerical value to its flag;And if not specialized vocabulary, then right Its flag assigns second value.
Optionally, in the method according to the invention, it is searched in sentence to be segmented in dictionary using binary chop One or more character strings before the step of obtaining multiple participles to be determined after first cutting, further include step:It identifies and waits to locate Non- Chinese character in the source statement of reason;And the non-Chinese character identified is rejected from pending source statement, it is treated Segment sentence.
Optionally, in the method according to the invention, non-Chinese character include punctuation mark, numerical character, English character, Ignore the non-visible character of action.
Optionally, in the method according to the invention, it is searched in sentence to be segmented in dictionary using binary chop One or more character strings, the step of obtaining multiple participles to be determined after first cutting, include:For in sentence to be segmented Each character:According to the Unicode codes of the character, storage is searched using the character as the first array of the entry of lead-in;With the word It accords with and forms at least one character string for lead-in, which is searched in all entries of the first array by binary chop; And when finding the corresponding entry of the character string, using the character string as participle to be determined.
Optionally, in the method according to the invention, at least one character string is formed by lead-in of the character, passes through two points The step of lookup method searches the character string in all entries of the first array further includes:Only include if existing in the first array The entry of the character then judges the character for whole word;And using the character as a participle to be determined.
According to another aspect of the present invention, a kind of computing device is provided, including:One or more processors;Storage Device;And one or more programs, wherein one or more programs are stored in memory and are configured as by one or more Processor performs, and one or more programs include the instruction for performing the either method in method as described above.
According to a further aspect of the invention, a kind of computer-readable storage medium for storing one or more programs is provided Matter, one or more programs include instruction, and instruction is when executed by a computing apparatus so that computing device performs side as described above Either method in method.
Chinese word segmentation scheme according to the present invention based on specialized vocabulary indicates that entry is by being added in when building dictionary The no flag for specialized vocabulary, then participle when, can be judgement be specialized vocabulary participle to be determined set one compared with Small participle weight, the length of cutting route is calculated according to participle weight and cutting route, and then is chosen shortest path and be used as and divide Word result.It by introducing this scoring mechanism, solves the Path Selection being likely to occur, ensure that the accurate of word segmentation result Property, it is not only able to preferably solve overlapping ambiguity, also there is higher discrimination to the specialized vocabulary in professional domain, by the technology Higher participle accuracy can be obtained by being applied in different industries.
Description of the drawings
In order to realize above-mentioned and related purpose, certain illustrative sides are described herein in conjunction with following description and attached drawing Face, these aspects indicate the various modes that can put into practice principles disclosed herein, and all aspects and its equivalent aspect It is intended to fall in the range of theme claimed.Read following detailed description in conjunction with the accompanying drawings, the disclosure it is above-mentioned And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical reference numeral generally refers to identical Component or element.
Fig. 1 shows the structure diagram of computing device 100 according to an embodiment of the invention;
Fig. 2 shows the flow charts of the Chinese word cutting method 200 according to an embodiment of the invention based on specialized vocabulary; And
Fig. 3 shows the flow diagram according to an embodiment of the invention for constructing the dictionary with predetermined structure.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure Completely it is communicated to those skilled in the art.
Fig. 1 shows the structure diagram of computing device 100 according to an embodiment of the invention.
In basic configuration 102, computing device 100 typically comprise system storage 106 and one or more at Manage device 104.Memory bus 108 can be used for the communication between processor 104 and system storage 106.
Depending on desired configuration, processor 104 can be any kind of processing, including but not limited to:Microprocessor (μ P), microcontroller (μ C), digital information processor (DSP) or any combination of them.Processor 104 can be included such as The cache of one or more rank of on-chip cache 110 and second level cache 112 etc, processor core 114 and register 116.Exemplary processor core 114 can include arithmetic and logical unit (ALU), floating-point unit (FPU), Digital signal processing core (DSP core) or any combination of them.Exemplary Memory Controller 118 can be with processor 104 are used together or in some implementations, Memory Controller 118 can be an interior section of processor 104.
Depending on desired configuration, system storage 106 can be any type of memory, including but not limited to:Easily The property lost memory (RAM), nonvolatile memory (ROM, flash memory etc.) or any combination of them.System stores Device 106 can include operating system 120, one or more apply 122 and program data 124.In some embodiments, It may be arranged to utilize 124 execute instruction of program data by one or more processors 104 in operating system 120 using 122.
Computing device 100 can also include contributing to from various interface equipments (for example, output equipment 142, Peripheral Interface 144 and communication equipment 146) to basic configuration 102 via the interface bus 140 of the communication of bus/interface controller 130.Example Output equipment 142 include graphics processing unit 148 and audio treatment unit 150.They can be configured as contribute to via One or more A/V port 152 communicates with the various external equipments of such as display or loud speaker etc.Outside example If interface 144 can include serial interface controller 154 and parallel interface controller 156, they, which can be configured as, contributes to Via one or more I/O port 158 and such as input equipment (for example, keyboard, mouse, pen, voice-input device, touch Input equipment) or the external equipment of other peripheral hardwares (such as printer, scanner etc.) etc communicate.Exemplary communication is set Standby 146 can include network controller 160, can be arranged to be convenient for via one or more communication port 164 and one The communication that other a or multiple computing devices 162 pass through network communication link.
Network communication link can be an example of communication media.Communication media can be usually presented as in such as carrier wave Or computer-readable instruction in the modulated data signal of other transmission mechanisms etc, data structure, program module, and can To include any information delivery media." modulated data signal " can such signal, one in its data set or more It is a or it change can the mode of coding information in the signal carry out.As unrestricted example, communication media can be with It is wire medium and such as sound, radio frequency (RF), microwave including such as cable network or private line network etc, infrared (IR) the various wireless mediums or including other wireless mediums.Term computer-readable medium used herein can include depositing Both storage media and communication media.
Computing device 100 can be implemented as server, such as file server, database server, application program service Device and WEB server etc. can also be embodied as a part for portable (or mobile) electronic equipment of small size, these electronic equipments Can be such as cellular phone, personal digital assistant (PDA), personal media player device, wireless network browsing apparatus, individual Helmet, application specific equipment or the mixing apparatus that any of the above function can be included.Computing device 100 can also be real It is now to include desktop computer and the personal computer of notebook computer configuration.
In realization method according to the present invention, computing device 100 is configured as performing according to the present invention based on profession The Chinese word cutting method of vocabulary.Wherein, the one or more application 122 of computing device 100 includes performing according to this hair The instruction of the bright Chinese word cutting method 200 based on specialized vocabulary.
Fig. 2 shows the flow charts of the Chinese word cutting method 200 according to an embodiment of the invention based on specialized vocabulary.
Method 200 starts from step S210, and the dictionary with predetermined structure is constructed by reading in entry one by one.
According to one embodiment of present invention, it is in the dictionary constructed that the identical entry of lead-in is built-in according to Java Unicode codes ascending order arranges, i.e., from " one " to the sequence of " tortoise ".
Since some buzz words, such as " Pie ” " Fu " can be included in Unicode codes, if being not added with screening disposably by institute The loading of some Unicode codes, will certainly wasting space resource, also increase the matched number of subsequent query.Therefore, according to this It in the dictionary with predetermined structure of invention, establishes multiple first arrays and is used to store the identical entry of lead-in, and each the At least one second array is established in one array, each second array is used for storing the content and flag of an entry, mark Position is for identifying whether the entry belongs to specialized vocabulary.In other words, the identical entry of all lead-ins is formed into a word block (that is, first array) in each first array, and builds multiple second arrays, and each second array includes a string Constant and an integer constant, wherein, string constants are used for storing the content of entry, and integer constant is used for storing flag.
As table 1 shows a kind of form of dictionary configuration according to embodiments of the present invention.
Table 1
The embodiment of the present invention additionally provides a kind of mistake that the dictionary with predetermined structure is constructed by reading in entry one by one Journey, as shown in Figure 3.
Using the form of file stream, first in step S310, inlet flow is established to read in entry successively, and judge whether to reach To the end of inlet flow, all entries read in and finish if inlet flow end is reached, if not continuing to execute following step.
Then, in step s 320, for the entry of reading, judge whether to store headed by the entry lead-in First array of the entry of word.For example, the entry read in is " reason ", then need to judge to whether there is with " road " in current dictionary The first array for lead-in.
In step S330, if there is no such first array, it is used for according to the establishment of the lead-in of the entry read in Storage is using the lead-in as the first array of all entries of lead-in.That is, if there is no the first numbers with " road " for lead-in Group just creates first array in dictionary, for storing all entries with " road " for lead-in.
And then in step S340, the second array is established in first array to store corresponding entry content.When So, if through judging, natively existing with the first array that " road " is lead-in in dictionary, that is just directly entered step S340, Second array is created in first array, for storing entry " reason ".
Then, in step S350, judge whether current entry belongs to specialized vocabulary, if specialized vocabulary, then mark it Know position and assign the first numerical value;If not specialized vocabulary, then assign its flag second value, and flag write-in second is counted In group.Optionally, the first numerical value is represented with 00,01 represents second value, alternatively, representing the first numerical value with 9, second is represented with 1 Numerical value, as long as flag can clearly distinguish specialized vocabulary and amateur vocabulary, the embodiment of the present invention does not make this Limitation.
Alternatively it is also possible to the specialized vocabulary in different majors field is distinguished by assigning different values to flag, Such as, for the specialized vocabulary of harmful influence industry, flag is set as 9;For the specialized vocabulary of radio, TV and film industries, flag is set as 8.The embodiment of the present invention is not restricted this.
Next step S310 is recycled into, continues to read in next entry, perform step S320- step S350, until arriving Up to inlet flow end, dictionary creation finishes.
Then in step S220, one or more of sentence to be segmented word is searched in dictionary using binary chop Symbol string, obtains multiple participles to be determined after first cutting.
A kind of realization method according to the present invention for pending source statement, first identifies non-Chinese character therein, The non-Chinese character identified is rejected from pending source statement again, obtains sentence to be segmented.Optionally, non-Chinese character packet Include punctuation mark, numerical character, English character, the non-visible character for ignoring action, ignore the non-visible character of action as entered a new line, Carriage return, horizontal tabulation symbol etc..Algorithm process that in this way can be for after provides basic language message and improves treatment effeciency.
Specifically, step S220 can be performed as follows:For each character in sentence to be segmented, according to the word The Unicode codes of symbol search storage using the character as the first array of the entry of lead-in;At least one is formed by lead-in of the character A character string searches the character string by binary chop in all entries of the first array;When finding the character string pair During the entry answered, just using the character string as participle to be determined.
For example, pending source statement is:
" Group Life Accident Insurance material benefit plan
Unexpected injury:Refer to by external, burst, non-original idea, the non-disease objective thing that body is made to come to harm Part.”
It is by identifying that non-Chinese character therein obtains sentence to be segmented:
" Group Life Accident Insurance material benefit plan unexpected injury refers to the non-disease of the non-original idea by external burst The objective event for making actual bodily harm "
Then, it by taking the first character " group " in sentence to be segmented as an example, searches and is stored in dictionary with " group " as lead-in First array of entry searches whether exist as " group " or " group " with binary chop from the first array again after finding Entry through searching, is found in the first array there are entry " group ", then using character string " group " as a participle to be determined. Above-mentioned search procedure is carried out to other each characters, to the last until a character, obtain after first cutting multiple treats Determine participle.
Another embodiment according to the present invention if there is the entry for only including the character in the first array, is sentenced The character break as whole word (oneself single character generally can be become whole word into the word of word), is then treated the character as one Determine participle.
After step S220 processing, pending source statement above can obtain following participle to be determined:
" group, the person is unexpected, and wound injures, insurance, material benefit, and plan is unexpected, wound, and injury refers to, by, it is external, , it happens suddenly, non-, original idea, non-, disease, making, body injures, objective, event "
Its corresponding flag is obtained from the second array of each participle to be determined of storage, then in step S230, According to the corresponding flag of each participle to be determined to the participle setting participle weight to be determined.
According to a kind of realization method, if the corresponding flag of participle to be determined indicates that the participle to be determined belongs to professional word It converges, then first participle weight is set to it;If the corresponding flag of participle to be determined indicates that the participle to be determined is not belonging to profession Vocabulary then sets it the second participle weight, also, first participle weight is less than the second participle weight.Optionally, the first participle Weight is set as 0.5, and the second participle weight is set as 1.
Then in step S240, according to multiple participles to be determined and its participle weight construction cutting route and choose most short Path is as word segmentation result.
According to a kind of realization method, cutting word is carried out using shortest path cutting word algorithm.According to one embodiment of present invention, The implementation procedure of construction cutting route is specifically described as:
1) each character is as node using in sentence to be segmented, wherein being saved by starting of the first character of sentence to be segmented Point, last character are terminal node.
2) it is sequentially constructed between start node and terminal node according to multiple participles to be determined that step S230 is obtained A plurality of cutting route.
3) length of every cutting route of participle weight calculation of each participle to be determined is combined, the length of cutting route is led to The score on the corresponding side of each word for counting and being syncopated as in the path is crossed to obtain.
If do not consider to segment weight, then, the corresponding side of each word counts 1 point, but if a word is more likely to and other Word composition word (that is, can not contain word element), then, 1 point (that is, 2 points of meter) is separately counted on the corresponding side of the word, for example, " people ", " reality ".On this basis, if the corresponding side of some word is calculated as x points, corresponding participle weight is y, then adds in examining for participle weight After amount, the score on corresponding side is:x*y.
4) the shortest cutting route of length is selected as word segmentation result.
If sentence to be segmented is:One groove tank car chloroform leakage accident of Jiangxi Ji'an.
Using each character as node, a plurality of cutting route between start node " river " and terminal node " event " is constructed For:
1. Jiangxi/Ji'an/mono-/groove tank car/tri-/chlorine/first/alkane/leakage/accident
2. Jiangxi/Ji'an/mono-/groove tank car/tri-/chlorine/methane/leakage/accident
3. Jiangxi/Ji'an/mono-/groove tank car/chloroform/leakage/accident
Wherein, methane and chloroform belong to specialized vocabulary, corresponding first participle weight (e.g., 0.5), other words correspond to the Two participle weights (e.g., 1).
This corresponding length of three cutting routes is respectively:
1. 1+1+1+1+1+2+1+2+1+1=12;
2. 1+1+1+1+1+2+1*0.5+1+1=9.5;
3. 1+1+1+1+1*0.5+1+1=6.5.
To sum up, choose length shortest the 3. the corresponding cutting result of article cutting route as word segmentation result.
Chinese word segmentation scheme according to the present invention based on specialized vocabulary indicates that entry is by being added in when building dictionary The no flag for specialized vocabulary, then participle when, can be judgement be specialized vocabulary participle to be determined set one compared with Small participle weight, the length of cutting route is calculated according to participle weight and cutting route, and then is chosen shortest path and be used as and divide Word result.It by introducing this scoring mechanism, solves the Path Selection being likely to occur, ensure that the accurate of word segmentation result Property, it is not only able to preferably solve overlapping ambiguity, also there is higher discrimination to the specialized vocabulary in professional domain, by the technology Higher participle accuracy can be obtained by being applied in different industries.
In the specification provided in this place, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention Example can be put into practice without these specific details.In some instances, well known method, knot is not been shown in detail Structure and technology, so as not to obscure the understanding of this description.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of each inventive aspect, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:I.e. required guarantor Shield the present invention claims the feature more features than being expressly recited in each claim.More precisely, as following As claims reflect, inventive aspect is all features less than single embodiment disclosed above.Therefore, it abides by Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim is in itself Separate embodiments as the present invention.
Those skilled in the art should understand that the modules or unit or group of the equipment in example disclosed herein Between can be arranged in equipment as depicted in this embodiment or alternatively can be positioned at and the equipment in the example In different one or more equipment.Module in aforementioned exemplary can be combined into a module or be segmented into addition multiple Submodule.
Those skilled in the art, which are appreciated that, to carry out adaptively the module in the equipment in embodiment Change and they are arranged in one or more equipment different from the embodiment.It can be the module or list in embodiment Member or group between be combined into one can be divided between module or unit or group and in addition multiple submodule or subelement or Between subgroup.Other than such feature and/or at least some of process or unit exclude each other, it may be used any Combination is disclosed to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so to appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification is (including adjoint power Profit requirement, abstract and attached drawing) disclosed in each feature can be by providing the alternative features of identical, equivalent or similar purpose come generation It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed One of meaning mode can use in any combination.
In addition, be described as herein can be by the processor of computer system or by performing for some in the embodiment The method or the combination of method element that other devices of the function are implemented.Therefore, have to implement the method or method The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, device embodiment Element described in this is the example of following device:The device is for implementing as in order to performed by implementing the element of the purpose of the invention Function.
Various technologies described herein can combine hardware or software or combination thereof is realized together.So as to the present invention Method and apparatus or the process and apparatus of the present invention some aspects or part can take embedded tangible media, such as soft The form of program code (instructing) in disk, CD-ROM, hard disk drive or other arbitrary machine readable storage mediums, Wherein when program is loaded into the machine of such as computer etc, and is performed by the machine, the machine becomes to put into practice this hair Bright equipment.
In the case where program code performs on programmable computers, computing device generally comprises processor, processor Readable storage medium (including volatile and non-volatile memory and or memory element), at least one input unit and extremely A few output device.Wherein, memory is configured for storage program code;Processor is configured for according to the memory Instruction in the said program code of middle storage, the method for performing the present invention.
By way of example and not limitation, computer-readable medium includes computer storage media and communication media.It calculates Machine readable medium includes computer storage media and communication media.Computer storage media storage such as computer-readable instruction, The information such as data structure, program module or other data.Communication media is generally modulated with carrier wave or other transmission mechanisms etc. Data-signal processed passes to embody computer-readable instruction, data structure, program module or other data including any information Pass medium.Above any combination is also included within the scope of computer-readable medium.
As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc. Description plain objects are merely representative of the different instances for being related to similar object, and are not intended to imply that the object being described in this way must Must have the time it is upper, spatially, in terms of sequence or given sequence in any other manner.
Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from It is interior it is clear for the skilled person that in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that The language that is used in this specification primarily to readable and introduction purpose and select rather than in order to explain or limit Determine subject of the present invention and select.Therefore, in the case of without departing from the scope and spirit of the appended claims, for this Many modifications and changes will be apparent from for the those of ordinary skill of technical field.For the scope of the present invention, to this The done disclosure of invention is illustrative and not restrictive, and it is intended that the scope of the present invention be defined by the claims appended hereto.

Claims (10)

1. a kind of Chinese word cutting method based on specialized vocabulary, the method is suitable for performing in computing device, the method packet Include step:
The dictionary with predetermined structure is constructed by reading in entry one by one, wherein the identical entry of lead-in is pressed in the dictionary It is arranged according to Unicode codes ascending order, and establishes multiple first arrays for storing the identical entry of lead-in, and in each first array In establish at least one second array, for storing entry content and flag, the flag is for identifying the entry It is no to belong to specialized vocabulary;
One or more of sentence to be segmented character string is searched in the dictionary using binary chop, obtains first cutting Multiple participles to be determined afterwards;
According to the corresponding flag of each participle to be determined to the participle setting participle weight to be determined;And
According to multiple participles to be determined and its participle weight construction cutting route and shortest path is chosen as word segmentation result.
2. the method for claim 1, wherein each the corresponding flag of participle to be determined is to be determined to this for the basis The step of participle setting participle weight, includes:
If the corresponding flag of participle to be determined indicates that the participle to be determined belongs to specialized vocabulary, the first participle is set to weigh it Weight;
If the corresponding flag of participle to be determined indicates that the participle to be determined is not belonging to specialized vocabulary, the second participle is set to it Weight,
Wherein, the first participle weight is less than the second participle weight.
3. method as claimed in claim 1 or 2, wherein, it is described to be cut according to multiple participles to be determined and its participle weight construction Sub-path is simultaneously chosen shortest path and is included as the step of word segmentation result:
Each character is as node using in sentence to be segmented, wherein the first character of sentence to be segmented is start node, finally One character is terminal node;
The a plurality of cutting route between start node and terminal node is sequentially constructed according to participle to be determined;
With reference to the length of every cutting route of participle weight calculation of each participle to be determined;And
The shortest cutting route of length is chosen as word segmentation result.
4. the method as described in any one of claim 1-3, wherein, it is described to be made a reservation for by reading in entry one by one to construct to have The step of dictionary of structure, includes:
Inlet flow is established to read in entry successively;
Judge whether the first array for storing using the entry lead-in as the entry of lead-in;
If there is no first array, create to store using the lead-in as lead-in according to the lead-in of the entry read in All entries the first array;
The second array is established in first array to store the entry content;
Judge whether the entry belongs to specialized vocabulary, if specialized vocabulary, then assign the first numerical value to its flag;And
If not specialized vocabulary, then assign second value to its flag.
5. the method as described in right will remove any one of 1-4, wherein, described searched in dictionary using binary chop is treated point One or more of word sentence character string before the step of obtaining multiple participles to be determined after first cutting, further includes step Suddenly:
Identify the non-Chinese character in pending source statement;And
The non-Chinese character identified is rejected from pending source statement, obtains sentence to be segmented.
6. method as claimed in claim 5, wherein, the non-Chinese character includes punctuation mark, numerical character, English words Accord with, ignore the non-visible character of action.
7. the method as described in any one of claim 1-6, wherein, described searched in dictionary using binary chop is treated point One or more of word sentence character string, the step of obtaining multiple participles to be determined after first cutting, include:
For each character in sentence to be segmented:
According to the Unicode codes of the character, storage is searched using the character as the first array of the entry of lead-in;
At least one character string is formed by lead-in of the character, by binary chop in all entries of first array Search the character string;And
When finding the corresponding entry of the character string, using the character string as participle to be determined.
8. the method for claim 7, wherein, it is described that at least one character string is formed by lead-in of the character, pass through two The step of lookup method is divided to search the character string in all entries of the first array further includes:
If there is the entry for only including the character in the first array, judge the character for whole word;And
Using the character as a participle to be determined.
9. a kind of computing device, including:
One or more processors;
Memory;And
One or more programs, wherein one or more of programs are stored in the memory and are configured as by described one A or multiple processors perform, and one or more of programs include performing in the method according to claim 1-8 Either method instruction.
10. a kind of computer readable storage medium for storing one or more programs, one or more of programs include instruction, Described instruction is when executed by a computing apparatus so that the computing device is performed in the method according to claim 1-8 Either method.
CN201810050618.7A 2018-01-18 2018-01-18 Chinese word segmentation method based on professional vocabulary and computing equipment Expired - Fee Related CN108170682B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810050618.7A CN108170682B (en) 2018-01-18 2018-01-18 Chinese word segmentation method based on professional vocabulary and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810050618.7A CN108170682B (en) 2018-01-18 2018-01-18 Chinese word segmentation method based on professional vocabulary and computing equipment

Publications (2)

Publication Number Publication Date
CN108170682A true CN108170682A (en) 2018-06-15
CN108170682B CN108170682B (en) 2021-09-07

Family

ID=62515230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810050618.7A Expired - Fee Related CN108170682B (en) 2018-01-18 2018-01-18 Chinese word segmentation method based on professional vocabulary and computing equipment

Country Status (1)

Country Link
CN (1) CN108170682B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522740A (en) * 2018-10-16 2019-03-26 易保互联医疗信息科技(北京)有限公司 Health data goes privacy processing method and system
CN110825608A (en) * 2018-08-08 2020-02-21 北京京东尚科信息技术有限公司 Key semantic testing method and device, storage medium and electronic equipment
CN114429130A (en) * 2022-01-14 2022-05-03 福建众创车联网络科技有限公司 A method and system for segmentation of auto parts names
CN114510548A (en) * 2021-12-29 2022-05-17 北京空间飞行器总体设计部 Method and device for dictionary construction and classification for spacecraft test identification and evaluation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6879951B1 (en) * 1999-07-29 2005-04-12 Matsushita Electric Industrial Co., Ltd. Chinese word segmentation apparatus
CN103838794A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 A word segmentation method suitable for professional search engines
CN105159949A (en) * 2015-08-12 2015-12-16 北京京东尚科信息技术有限公司 Chinese address word segmentation method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6879951B1 (en) * 1999-07-29 2005-04-12 Matsushita Electric Industrial Co., Ltd. Chinese word segmentation apparatus
CN103838794A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 A word segmentation method suitable for professional search engines
CN105159949A (en) * 2015-08-12 2015-12-16 北京京东尚科信息技术有限公司 Chinese address word segmentation method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张华平等: ""基于N-最短路径方法的中文词语粗分模型"", 《中文信息学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825608A (en) * 2018-08-08 2020-02-21 北京京东尚科信息技术有限公司 Key semantic testing method and device, storage medium and electronic equipment
CN110825608B (en) * 2018-08-08 2024-08-16 北京京东尚科信息技术有限公司 Critical semantic testing method and device, storage medium and electronic equipment
CN109522740A (en) * 2018-10-16 2019-03-26 易保互联医疗信息科技(北京)有限公司 Health data goes privacy processing method and system
CN109522740B (en) * 2018-10-16 2021-04-20 易保互联医疗信息科技(北京)有限公司 Health data privacy removal processing method and system
CN114510548A (en) * 2021-12-29 2022-05-17 北京空间飞行器总体设计部 Method and device for dictionary construction and classification for spacecraft test identification and evaluation
CN114429130A (en) * 2022-01-14 2022-05-03 福建众创车联网络科技有限公司 A method and system for segmentation of auto parts names

Also Published As

Publication number Publication date
CN108170682B (en) 2021-09-07

Similar Documents

Publication Publication Date Title
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
US10706230B2 (en) System and method for inputting text into electronic devices
CN108647205B (en) Fine-grained emotion analysis model construction method and device and readable storage medium
CN106202382B (en) Link instance method and system
US10460029B2 (en) Reply information recommendation method and apparatus
US8335787B2 (en) Topic word generation method and system
CN108170682A (en) A kind of Chinese word cutting method and computing device based on specialized vocabulary
CN111626048A (en) Text error correction method, device, equipment and storage medium
US20120259615A1 (en) Text prediction
CN114818891B (en) Small sample multi-label text classification model training method and text classification method
US20150199609A1 (en) Self-learning system for determining the sentiment conveyed by an input text
CN110377882B (en) Method, apparatus, system and storage medium for determining pinyin of text
CN110795628A (en) Search term processing method and device based on correlation and computing equipment
CN114330303B (en) Text error correction method and related equipment
CN111737464B (en) Text classification method and device and electronic equipment
CN107958039A (en) A kind of term error correction method, device and server
CN105814556A (en) Context sensitive input tools
US12423509B2 (en) Automated citations and assessment for automatically generated text
CN111046627B (en) Chinese character display method and system
CN114462401A (en) New word discovery method and computing device for field
CN111159526B (en) Query statement processing method, device, equipment and storage medium
CN108255808A (en) The method, apparatus and storage medium and electronic equipment that text divides
CN117493558A (en) Text classification method and device, electronic equipment and computer readable storage medium
CN110083679B (en) Search request processing method and device, electronic equipment and storage medium
KR102863616B1 (en) Apparatus and method for search

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210907