CN108170682A - A kind of Chinese word cutting method and computing device based on specialized vocabulary - Google Patents
A kind of Chinese word cutting method and computing device based on specialized vocabulary Download PDFInfo
- Publication number
- CN108170682A CN108170682A CN201810050618.7A CN201810050618A CN108170682A CN 108170682 A CN108170682 A CN 108170682A CN 201810050618 A CN201810050618 A CN 201810050618A CN 108170682 A CN108170682 A CN 108170682A
- Authority
- CN
- China
- Prior art keywords
- participle
- character
- determined
- entry
- array
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of Chinese word cutting method based on specialized vocabulary, this method is suitable for performing in computing device, including:The dictionary with predetermined structure is constructed by reading in entry one by one, the identical entry of lead-in is arranged according to Unicode codes ascending order wherein in dictionary, and multiple first arrays are established for storing the identical entry of lead-in, and at least one second array is established in each first array, for storing entry content and flag, flag is used to identify whether entry belongs to specialized vocabulary;One or more of sentence to be segmented character string is searched in dictionary using binary chop, obtains multiple participles to be determined after first cutting;According to the corresponding flag of each participle to be determined to the participle setting participle weight to be determined;And according to multiple participles to be determined and its participle weight construction cutting route and shortest path is chosen as word segmentation result.The present invention discloses the computing device for performing this method together.
Description
Technical field
The present invention relates to technical field of information processing, particularly, are related to a kind of Chinese word cutting method based on specialized vocabulary
And computing device.
Background technology
Chinese information processing technology is obtained in computer realms such as computer network, database technology, soft projects
Extensive use, and Chinese Automatic Word Segmentation is an important basic work of Chinese information processing, at many Chinese informations
Participle problem is directed in reason project, such as machine translation, automatic abstract, automatic classification, Chinese document database full-text search, search
Engine etc..Since Chinese text is write the two or more syllables of a word together, there is no space between word, thus in Chinese text processing, what is initially encountered asks
The problem of topic is participle, the correct cutting of word are to carry out the necessary condition of Chinese text processing.In addition, the method for Chinese word segmentation its
Be not limited to Chinese application in fact, be also applied to English processing, such as handwriting recognition, the space between word just not it is clear that in
Literary segmenting method can help to differentiate the boundary of English word.Therefore, research Chinese words segmentation has very important significance.
Although the primary expression unit of Modern Chinese is " word ", and in the majority with double word or multi-character words, since people recognize
Know horizontal difference, the boundary of word and phrase is very difficult to distinguish.For example, " give and punish to the person of spitting everywhere ", " spits everywhere
Person " is that a word or a phrase, different people have different standards in itself, same " sea " " brewery " etc., i.e.,
Make to be that same person may also make different judgements.And dictionary is relatively general used by existing Chinese words segmentation, does not have
It is very big that word segmentation result may be caused inaccurate specifically for the dictionary of specialized vocabulary.
Therefore a kind of Chinese word cutting method that can recognize that specialized vocabulary is needed, so as to further improve participle accuracy rate.
Invention content
For this purpose, the present invention provides a kind of Chinese word cutting method and computing device based on specialized vocabulary, to try hard to solve
Or at least alleviate above there are the problem of.
According to an aspect of the invention, there is provided a kind of Chinese word cutting method based on specialized vocabulary, this method are suitable for
It is performed in computing device, including step:The dictionary with predetermined structure is constructed by reading in entry one by one, wherein in dictionary
The identical entry of lead-in according to Unicode codes ascending order is arranged, and establishes multiple first arrays for storing the identical word of lead-in
Item, and at least one second array is established in each first array, for storing entry content and flag, flag is used for
Identify whether the entry belongs to specialized vocabulary;One or more in sentence to be segmented is searched in dictionary using binary chop
A character string obtains multiple participles to be determined after first cutting;This is treated according to the corresponding flag of each participle to be determined
Determine participle setting participle weight;According to multiple participles to be determined and its participle weight construction cutting route and choose shortest path
As word segmentation result.
Optionally, in the method according to the invention, it is to be determined to this according to the corresponding flag of each participle to be determined
The step of participle setting participle weight, includes:If the corresponding flag of participle to be determined indicates that the participle to be determined belongs to professional word
It converges, then first participle weight is set to it;If the corresponding flag of participle to be determined indicates that the participle to be determined is not belonging to profession
Vocabulary then sets it the second participle weight, wherein, the first participle weight is less than the second participle weight.
Optionally, in the method according to the invention, according to multiple participles to be determined and its participle weight construction cutting road
Diameter is simultaneously chosen shortest path and is included as the step of word segmentation result:Each character is as node using in sentence to be segmented, wherein treating
The first character for segmenting sentence is start node, last character is terminal node;It is sequentially constructed according to participle to be determined
Go out a plurality of cutting route between start node and terminal node;It is cut with reference to the participle weight calculation every of each participle to be determined
The length of sub-path;And the shortest cutting route of length is chosen as word segmentation result.
Optionally, in the method according to the invention, the dictionary with predetermined structure is constructed by reading in entry one by one
The step of include:Inlet flow is established to read in entry successively;It judges whether to store using the entry lead-in as lead-in
First array of entry;If there is no the first array, create to store with the lead-in according to the lead-in of the entry read in
First array of all entries for lead-in;The second array is established in the first array to store entry content;Judging entry is
It is no to belong to specialized vocabulary, if specialized vocabulary, then assign the first numerical value to its flag;And if not specialized vocabulary, then right
Its flag assigns second value.
Optionally, in the method according to the invention, it is searched in sentence to be segmented in dictionary using binary chop
One or more character strings before the step of obtaining multiple participles to be determined after first cutting, further include step:It identifies and waits to locate
Non- Chinese character in the source statement of reason;And the non-Chinese character identified is rejected from pending source statement, it is treated
Segment sentence.
Optionally, in the method according to the invention, non-Chinese character include punctuation mark, numerical character, English character,
Ignore the non-visible character of action.
Optionally, in the method according to the invention, it is searched in sentence to be segmented in dictionary using binary chop
One or more character strings, the step of obtaining multiple participles to be determined after first cutting, include:For in sentence to be segmented
Each character:According to the Unicode codes of the character, storage is searched using the character as the first array of the entry of lead-in;With the word
It accords with and forms at least one character string for lead-in, which is searched in all entries of the first array by binary chop;
And when finding the corresponding entry of the character string, using the character string as participle to be determined.
Optionally, in the method according to the invention, at least one character string is formed by lead-in of the character, passes through two points
The step of lookup method searches the character string in all entries of the first array further includes:Only include if existing in the first array
The entry of the character then judges the character for whole word;And using the character as a participle to be determined.
According to another aspect of the present invention, a kind of computing device is provided, including:One or more processors;Storage
Device;And one or more programs, wherein one or more programs are stored in memory and are configured as by one or more
Processor performs, and one or more programs include the instruction for performing the either method in method as described above.
According to a further aspect of the invention, a kind of computer-readable storage medium for storing one or more programs is provided
Matter, one or more programs include instruction, and instruction is when executed by a computing apparatus so that computing device performs side as described above
Either method in method.
Chinese word segmentation scheme according to the present invention based on specialized vocabulary indicates that entry is by being added in when building dictionary
The no flag for specialized vocabulary, then participle when, can be judgement be specialized vocabulary participle to be determined set one compared with
Small participle weight, the length of cutting route is calculated according to participle weight and cutting route, and then is chosen shortest path and be used as and divide
Word result.It by introducing this scoring mechanism, solves the Path Selection being likely to occur, ensure that the accurate of word segmentation result
Property, it is not only able to preferably solve overlapping ambiguity, also there is higher discrimination to the specialized vocabulary in professional domain, by the technology
Higher participle accuracy can be obtained by being applied in different industries.
Description of the drawings
In order to realize above-mentioned and related purpose, certain illustrative sides are described herein in conjunction with following description and attached drawing
Face, these aspects indicate the various modes that can put into practice principles disclosed herein, and all aspects and its equivalent aspect
It is intended to fall in the range of theme claimed.Read following detailed description in conjunction with the accompanying drawings, the disclosure it is above-mentioned
And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical reference numeral generally refers to identical
Component or element.
Fig. 1 shows the structure diagram of computing device 100 according to an embodiment of the invention;
Fig. 2 shows the flow charts of the Chinese word cutting method 200 according to an embodiment of the invention based on specialized vocabulary;
And
Fig. 3 shows the flow diagram according to an embodiment of the invention for constructing the dictionary with predetermined structure.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
Completely it is communicated to those skilled in the art.
Fig. 1 shows the structure diagram of computing device 100 according to an embodiment of the invention.
In basic configuration 102, computing device 100 typically comprise system storage 106 and one or more at
Manage device 104.Memory bus 108 can be used for the communication between processor 104 and system storage 106.
Depending on desired configuration, processor 104 can be any kind of processing, including but not limited to:Microprocessor
(μ P), microcontroller (μ C), digital information processor (DSP) or any combination of them.Processor 104 can be included such as
The cache of one or more rank of on-chip cache 110 and second level cache 112 etc, processor core
114 and register 116.Exemplary processor core 114 can include arithmetic and logical unit (ALU), floating-point unit (FPU),
Digital signal processing core (DSP core) or any combination of them.Exemplary Memory Controller 118 can be with processor
104 are used together or in some implementations, Memory Controller 118 can be an interior section of processor 104.
Depending on desired configuration, system storage 106 can be any type of memory, including but not limited to:Easily
The property lost memory (RAM), nonvolatile memory (ROM, flash memory etc.) or any combination of them.System stores
Device 106 can include operating system 120, one or more apply 122 and program data 124.In some embodiments,
It may be arranged to utilize 124 execute instruction of program data by one or more processors 104 in operating system 120 using 122.
Computing device 100 can also include contributing to from various interface equipments (for example, output equipment 142, Peripheral Interface
144 and communication equipment 146) to basic configuration 102 via the interface bus 140 of the communication of bus/interface controller 130.Example
Output equipment 142 include graphics processing unit 148 and audio treatment unit 150.They can be configured as contribute to via
One or more A/V port 152 communicates with the various external equipments of such as display or loud speaker etc.Outside example
If interface 144 can include serial interface controller 154 and parallel interface controller 156, they, which can be configured as, contributes to
Via one or more I/O port 158 and such as input equipment (for example, keyboard, mouse, pen, voice-input device, touch
Input equipment) or the external equipment of other peripheral hardwares (such as printer, scanner etc.) etc communicate.Exemplary communication is set
Standby 146 can include network controller 160, can be arranged to be convenient for via one or more communication port 164 and one
The communication that other a or multiple computing devices 162 pass through network communication link.
Network communication link can be an example of communication media.Communication media can be usually presented as in such as carrier wave
Or computer-readable instruction in the modulated data signal of other transmission mechanisms etc, data structure, program module, and can
To include any information delivery media." modulated data signal " can such signal, one in its data set or more
It is a or it change can the mode of coding information in the signal carry out.As unrestricted example, communication media can be with
It is wire medium and such as sound, radio frequency (RF), microwave including such as cable network or private line network etc, infrared
(IR) the various wireless mediums or including other wireless mediums.Term computer-readable medium used herein can include depositing
Both storage media and communication media.
Computing device 100 can be implemented as server, such as file server, database server, application program service
Device and WEB server etc. can also be embodied as a part for portable (or mobile) electronic equipment of small size, these electronic equipments
Can be such as cellular phone, personal digital assistant (PDA), personal media player device, wireless network browsing apparatus, individual
Helmet, application specific equipment or the mixing apparatus that any of the above function can be included.Computing device 100 can also be real
It is now to include desktop computer and the personal computer of notebook computer configuration.
In realization method according to the present invention, computing device 100 is configured as performing according to the present invention based on profession
The Chinese word cutting method of vocabulary.Wherein, the one or more application 122 of computing device 100 includes performing according to this hair
The instruction of the bright Chinese word cutting method 200 based on specialized vocabulary.
Fig. 2 shows the flow charts of the Chinese word cutting method 200 according to an embodiment of the invention based on specialized vocabulary.
Method 200 starts from step S210, and the dictionary with predetermined structure is constructed by reading in entry one by one.
According to one embodiment of present invention, it is in the dictionary constructed that the identical entry of lead-in is built-in according to Java
Unicode codes ascending order arranges, i.e., from " one " to the sequence of " tortoise ".
Since some buzz words, such as " Pie ” " Fu " can be included in Unicode codes, if being not added with screening disposably by institute
The loading of some Unicode codes, will certainly wasting space resource, also increase the matched number of subsequent query.Therefore, according to this
It in the dictionary with predetermined structure of invention, establishes multiple first arrays and is used to store the identical entry of lead-in, and each the
At least one second array is established in one array, each second array is used for storing the content and flag of an entry, mark
Position is for identifying whether the entry belongs to specialized vocabulary.In other words, the identical entry of all lead-ins is formed into a word block
(that is, first array) in each first array, and builds multiple second arrays, and each second array includes a string
Constant and an integer constant, wherein, string constants are used for storing the content of entry, and integer constant is used for storing flag.
As table 1 shows a kind of form of dictionary configuration according to embodiments of the present invention.
Table 1
The embodiment of the present invention additionally provides a kind of mistake that the dictionary with predetermined structure is constructed by reading in entry one by one
Journey, as shown in Figure 3.
Using the form of file stream, first in step S310, inlet flow is established to read in entry successively, and judge whether to reach
To the end of inlet flow, all entries read in and finish if inlet flow end is reached, if not continuing to execute following step.
Then, in step s 320, for the entry of reading, judge whether to store headed by the entry lead-in
First array of the entry of word.For example, the entry read in is " reason ", then need to judge to whether there is with " road " in current dictionary
The first array for lead-in.
In step S330, if there is no such first array, it is used for according to the establishment of the lead-in of the entry read in
Storage is using the lead-in as the first array of all entries of lead-in.That is, if there is no the first numbers with " road " for lead-in
Group just creates first array in dictionary, for storing all entries with " road " for lead-in.
And then in step S340, the second array is established in first array to store corresponding entry content.When
So, if through judging, natively existing with the first array that " road " is lead-in in dictionary, that is just directly entered step S340,
Second array is created in first array, for storing entry " reason ".
Then, in step S350, judge whether current entry belongs to specialized vocabulary, if specialized vocabulary, then mark it
Know position and assign the first numerical value;If not specialized vocabulary, then assign its flag second value, and flag write-in second is counted
In group.Optionally, the first numerical value is represented with 00,01 represents second value, alternatively, representing the first numerical value with 9, second is represented with 1
Numerical value, as long as flag can clearly distinguish specialized vocabulary and amateur vocabulary, the embodiment of the present invention does not make this
Limitation.
Alternatively it is also possible to the specialized vocabulary in different majors field is distinguished by assigning different values to flag,
Such as, for the specialized vocabulary of harmful influence industry, flag is set as 9;For the specialized vocabulary of radio, TV and film industries, flag is set as
8.The embodiment of the present invention is not restricted this.
Next step S310 is recycled into, continues to read in next entry, perform step S320- step S350, until arriving
Up to inlet flow end, dictionary creation finishes.
Then in step S220, one or more of sentence to be segmented word is searched in dictionary using binary chop
Symbol string, obtains multiple participles to be determined after first cutting.
A kind of realization method according to the present invention for pending source statement, first identifies non-Chinese character therein,
The non-Chinese character identified is rejected from pending source statement again, obtains sentence to be segmented.Optionally, non-Chinese character packet
Include punctuation mark, numerical character, English character, the non-visible character for ignoring action, ignore the non-visible character of action as entered a new line,
Carriage return, horizontal tabulation symbol etc..Algorithm process that in this way can be for after provides basic language message and improves treatment effeciency.
Specifically, step S220 can be performed as follows:For each character in sentence to be segmented, according to the word
The Unicode codes of symbol search storage using the character as the first array of the entry of lead-in;At least one is formed by lead-in of the character
A character string searches the character string by binary chop in all entries of the first array;When finding the character string pair
During the entry answered, just using the character string as participle to be determined.
For example, pending source statement is:
" Group Life Accident Insurance material benefit plan
Unexpected injury:Refer to by external, burst, non-original idea, the non-disease objective thing that body is made to come to harm
Part.”
It is by identifying that non-Chinese character therein obtains sentence to be segmented:
" Group Life Accident Insurance material benefit plan unexpected injury refers to the non-disease of the non-original idea by external burst
The objective event for making actual bodily harm "
Then, it by taking the first character " group " in sentence to be segmented as an example, searches and is stored in dictionary with " group " as lead-in
First array of entry searches whether exist as " group " or " group " with binary chop from the first array again after finding
Entry through searching, is found in the first array there are entry " group ", then using character string " group " as a participle to be determined.
Above-mentioned search procedure is carried out to other each characters, to the last until a character, obtain after first cutting multiple treats
Determine participle.
Another embodiment according to the present invention if there is the entry for only including the character in the first array, is sentenced
The character break as whole word (oneself single character generally can be become whole word into the word of word), is then treated the character as one
Determine participle.
After step S220 processing, pending source statement above can obtain following participle to be determined:
" group, the person is unexpected, and wound injures, insurance, material benefit, and plan is unexpected, wound, and injury refers to, by, it is external,
, it happens suddenly, non-, original idea, non-, disease, making, body injures, objective, event "
Its corresponding flag is obtained from the second array of each participle to be determined of storage, then in step S230,
According to the corresponding flag of each participle to be determined to the participle setting participle weight to be determined.
According to a kind of realization method, if the corresponding flag of participle to be determined indicates that the participle to be determined belongs to professional word
It converges, then first participle weight is set to it;If the corresponding flag of participle to be determined indicates that the participle to be determined is not belonging to profession
Vocabulary then sets it the second participle weight, also, first participle weight is less than the second participle weight.Optionally, the first participle
Weight is set as 0.5, and the second participle weight is set as 1.
Then in step S240, according to multiple participles to be determined and its participle weight construction cutting route and choose most short
Path is as word segmentation result.
According to a kind of realization method, cutting word is carried out using shortest path cutting word algorithm.According to one embodiment of present invention,
The implementation procedure of construction cutting route is specifically described as:
1) each character is as node using in sentence to be segmented, wherein being saved by starting of the first character of sentence to be segmented
Point, last character are terminal node.
2) it is sequentially constructed between start node and terminal node according to multiple participles to be determined that step S230 is obtained
A plurality of cutting route.
3) length of every cutting route of participle weight calculation of each participle to be determined is combined, the length of cutting route is led to
The score on the corresponding side of each word for counting and being syncopated as in the path is crossed to obtain.
If do not consider to segment weight, then, the corresponding side of each word counts 1 point, but if a word is more likely to and other
Word composition word (that is, can not contain word element), then, 1 point (that is, 2 points of meter) is separately counted on the corresponding side of the word, for example, " people ",
" reality ".On this basis, if the corresponding side of some word is calculated as x points, corresponding participle weight is y, then adds in examining for participle weight
After amount, the score on corresponding side is:x*y.
4) the shortest cutting route of length is selected as word segmentation result.
If sentence to be segmented is:One groove tank car chloroform leakage accident of Jiangxi Ji'an.
Using each character as node, a plurality of cutting route between start node " river " and terminal node " event " is constructed
For:
1. Jiangxi/Ji'an/mono-/groove tank car/tri-/chlorine/first/alkane/leakage/accident
2. Jiangxi/Ji'an/mono-/groove tank car/tri-/chlorine/methane/leakage/accident
3. Jiangxi/Ji'an/mono-/groove tank car/chloroform/leakage/accident
Wherein, methane and chloroform belong to specialized vocabulary, corresponding first participle weight (e.g., 0.5), other words correspond to the
Two participle weights (e.g., 1).
This corresponding length of three cutting routes is respectively:
1. 1+1+1+1+1+2+1+2+1+1=12;
2. 1+1+1+1+1+2+1*0.5+1+1=9.5;
3. 1+1+1+1+1*0.5+1+1=6.5.
To sum up, choose length shortest the 3. the corresponding cutting result of article cutting route as word segmentation result.
Chinese word segmentation scheme according to the present invention based on specialized vocabulary indicates that entry is by being added in when building dictionary
The no flag for specialized vocabulary, then participle when, can be judgement be specialized vocabulary participle to be determined set one compared with
Small participle weight, the length of cutting route is calculated according to participle weight and cutting route, and then is chosen shortest path and be used as and divide
Word result.It by introducing this scoring mechanism, solves the Path Selection being likely to occur, ensure that the accurate of word segmentation result
Property, it is not only able to preferably solve overlapping ambiguity, also there is higher discrimination to the specialized vocabulary in professional domain, by the technology
Higher participle accuracy can be obtained by being applied in different industries.
In the specification provided in this place, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention
Example can be put into practice without these specific details.In some instances, well known method, knot is not been shown in detail
Structure and technology, so as not to obscure the understanding of this description.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of each inventive aspect,
Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:I.e. required guarantor
Shield the present invention claims the feature more features than being expressly recited in each claim.More precisely, as following
As claims reflect, inventive aspect is all features less than single embodiment disclosed above.Therefore, it abides by
Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim is in itself
Separate embodiments as the present invention.
Those skilled in the art should understand that the modules or unit or group of the equipment in example disclosed herein
Between can be arranged in equipment as depicted in this embodiment or alternatively can be positioned at and the equipment in the example
In different one or more equipment.Module in aforementioned exemplary can be combined into a module or be segmented into addition multiple
Submodule.
Those skilled in the art, which are appreciated that, to carry out adaptively the module in the equipment in embodiment
Change and they are arranged in one or more equipment different from the embodiment.It can be the module or list in embodiment
Member or group between be combined into one can be divided between module or unit or group and in addition multiple submodule or subelement or
Between subgroup.Other than such feature and/or at least some of process or unit exclude each other, it may be used any
Combination is disclosed to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so to appoint
Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification is (including adjoint power
Profit requirement, abstract and attached drawing) disclosed in each feature can be by providing the alternative features of identical, equivalent or similar purpose come generation
It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included certain features rather than other feature, but the combination of the feature of different embodiments means in of the invention
Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed
One of meaning mode can use in any combination.
In addition, be described as herein can be by the processor of computer system or by performing for some in the embodiment
The method or the combination of method element that other devices of the function are implemented.Therefore, have to implement the method or method
The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, device embodiment
Element described in this is the example of following device:The device is for implementing as in order to performed by implementing the element of the purpose of the invention
Function.
Various technologies described herein can combine hardware or software or combination thereof is realized together.So as to the present invention
Method and apparatus or the process and apparatus of the present invention some aspects or part can take embedded tangible media, such as soft
The form of program code (instructing) in disk, CD-ROM, hard disk drive or other arbitrary machine readable storage mediums,
Wherein when program is loaded into the machine of such as computer etc, and is performed by the machine, the machine becomes to put into practice this hair
Bright equipment.
In the case where program code performs on programmable computers, computing device generally comprises processor, processor
Readable storage medium (including volatile and non-volatile memory and or memory element), at least one input unit and extremely
A few output device.Wherein, memory is configured for storage program code;Processor is configured for according to the memory
Instruction in the said program code of middle storage, the method for performing the present invention.
By way of example and not limitation, computer-readable medium includes computer storage media and communication media.It calculates
Machine readable medium includes computer storage media and communication media.Computer storage media storage such as computer-readable instruction,
The information such as data structure, program module or other data.Communication media is generally modulated with carrier wave or other transmission mechanisms etc.
Data-signal processed passes to embody computer-readable instruction, data structure, program module or other data including any information
Pass medium.Above any combination is also included within the scope of computer-readable medium.
As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc.
Description plain objects are merely representative of the different instances for being related to similar object, and are not intended to imply that the object being described in this way must
Must have the time it is upper, spatially, in terms of sequence or given sequence in any other manner.
Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from
It is interior it is clear for the skilled person that in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that
The language that is used in this specification primarily to readable and introduction purpose and select rather than in order to explain or limit
Determine subject of the present invention and select.Therefore, in the case of without departing from the scope and spirit of the appended claims, for this
Many modifications and changes will be apparent from for the those of ordinary skill of technical field.For the scope of the present invention, to this
The done disclosure of invention is illustrative and not restrictive, and it is intended that the scope of the present invention be defined by the claims appended hereto.
Claims (10)
1. a kind of Chinese word cutting method based on specialized vocabulary, the method is suitable for performing in computing device, the method packet
Include step:
The dictionary with predetermined structure is constructed by reading in entry one by one, wherein the identical entry of lead-in is pressed in the dictionary
It is arranged according to Unicode codes ascending order, and establishes multiple first arrays for storing the identical entry of lead-in, and in each first array
In establish at least one second array, for storing entry content and flag, the flag is for identifying the entry
It is no to belong to specialized vocabulary;
One or more of sentence to be segmented character string is searched in the dictionary using binary chop, obtains first cutting
Multiple participles to be determined afterwards;
According to the corresponding flag of each participle to be determined to the participle setting participle weight to be determined;And
According to multiple participles to be determined and its participle weight construction cutting route and shortest path is chosen as word segmentation result.
2. the method for claim 1, wherein each the corresponding flag of participle to be determined is to be determined to this for the basis
The step of participle setting participle weight, includes:
If the corresponding flag of participle to be determined indicates that the participle to be determined belongs to specialized vocabulary, the first participle is set to weigh it
Weight;
If the corresponding flag of participle to be determined indicates that the participle to be determined is not belonging to specialized vocabulary, the second participle is set to it
Weight,
Wherein, the first participle weight is less than the second participle weight.
3. method as claimed in claim 1 or 2, wherein, it is described to be cut according to multiple participles to be determined and its participle weight construction
Sub-path is simultaneously chosen shortest path and is included as the step of word segmentation result:
Each character is as node using in sentence to be segmented, wherein the first character of sentence to be segmented is start node, finally
One character is terminal node;
The a plurality of cutting route between start node and terminal node is sequentially constructed according to participle to be determined;
With reference to the length of every cutting route of participle weight calculation of each participle to be determined;And
The shortest cutting route of length is chosen as word segmentation result.
4. the method as described in any one of claim 1-3, wherein, it is described to be made a reservation for by reading in entry one by one to construct to have
The step of dictionary of structure, includes:
Inlet flow is established to read in entry successively;
Judge whether the first array for storing using the entry lead-in as the entry of lead-in;
If there is no first array, create to store using the lead-in as lead-in according to the lead-in of the entry read in
All entries the first array;
The second array is established in first array to store the entry content;
Judge whether the entry belongs to specialized vocabulary, if specialized vocabulary, then assign the first numerical value to its flag;And
If not specialized vocabulary, then assign second value to its flag.
5. the method as described in right will remove any one of 1-4, wherein, described searched in dictionary using binary chop is treated point
One or more of word sentence character string before the step of obtaining multiple participles to be determined after first cutting, further includes step
Suddenly:
Identify the non-Chinese character in pending source statement;And
The non-Chinese character identified is rejected from pending source statement, obtains sentence to be segmented.
6. method as claimed in claim 5, wherein, the non-Chinese character includes punctuation mark, numerical character, English words
Accord with, ignore the non-visible character of action.
7. the method as described in any one of claim 1-6, wherein, described searched in dictionary using binary chop is treated point
One or more of word sentence character string, the step of obtaining multiple participles to be determined after first cutting, include:
For each character in sentence to be segmented:
According to the Unicode codes of the character, storage is searched using the character as the first array of the entry of lead-in;
At least one character string is formed by lead-in of the character, by binary chop in all entries of first array
Search the character string;And
When finding the corresponding entry of the character string, using the character string as participle to be determined.
8. the method for claim 7, wherein, it is described that at least one character string is formed by lead-in of the character, pass through two
The step of lookup method is divided to search the character string in all entries of the first array further includes:
If there is the entry for only including the character in the first array, judge the character for whole word;And
Using the character as a participle to be determined.
9. a kind of computing device, including:
One or more processors;
Memory;And
One or more programs, wherein one or more of programs are stored in the memory and are configured as by described one
A or multiple processors perform, and one or more of programs include performing in the method according to claim 1-8
Either method instruction.
10. a kind of computer readable storage medium for storing one or more programs, one or more of programs include instruction,
Described instruction is when executed by a computing apparatus so that the computing device is performed in the method according to claim 1-8
Either method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810050618.7A CN108170682B (en) | 2018-01-18 | 2018-01-18 | Chinese word segmentation method based on professional vocabulary and computing equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810050618.7A CN108170682B (en) | 2018-01-18 | 2018-01-18 | Chinese word segmentation method based on professional vocabulary and computing equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108170682A true CN108170682A (en) | 2018-06-15 |
CN108170682B CN108170682B (en) | 2021-09-07 |
Family
ID=62515230
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810050618.7A Expired - Fee Related CN108170682B (en) | 2018-01-18 | 2018-01-18 | Chinese word segmentation method based on professional vocabulary and computing equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108170682B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109522740A (en) * | 2018-10-16 | 2019-03-26 | 易保互联医疗信息科技(北京)有限公司 | Health data goes privacy processing method and system |
CN110825608A (en) * | 2018-08-08 | 2020-02-21 | 北京京东尚科信息技术有限公司 | Key semantic testing method and device, storage medium and electronic equipment |
CN114429130A (en) * | 2022-01-14 | 2022-05-03 | 福建众创车联网络科技有限公司 | A method and system for segmentation of auto parts names |
CN114510548A (en) * | 2021-12-29 | 2022-05-17 | 北京空间飞行器总体设计部 | Method and device for dictionary construction and classification for spacecraft test identification and evaluation |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6879951B1 (en) * | 1999-07-29 | 2005-04-12 | Matsushita Electric Industrial Co., Ltd. | Chinese word segmentation apparatus |
CN103838794A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | A word segmentation method suitable for professional search engines |
CN105159949A (en) * | 2015-08-12 | 2015-12-16 | 北京京东尚科信息技术有限公司 | Chinese address word segmentation method and system |
-
2018
- 2018-01-18 CN CN201810050618.7A patent/CN108170682B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6879951B1 (en) * | 1999-07-29 | 2005-04-12 | Matsushita Electric Industrial Co., Ltd. | Chinese word segmentation apparatus |
CN103838794A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | A word segmentation method suitable for professional search engines |
CN105159949A (en) * | 2015-08-12 | 2015-12-16 | 北京京东尚科信息技术有限公司 | Chinese address word segmentation method and system |
Non-Patent Citations (1)
Title |
---|
张华平等: ""基于N-最短路径方法的中文词语粗分模型"", 《中文信息学报》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110825608A (en) * | 2018-08-08 | 2020-02-21 | 北京京东尚科信息技术有限公司 | Key semantic testing method and device, storage medium and electronic equipment |
CN110825608B (en) * | 2018-08-08 | 2024-08-16 | 北京京东尚科信息技术有限公司 | Critical semantic testing method and device, storage medium and electronic equipment |
CN109522740A (en) * | 2018-10-16 | 2019-03-26 | 易保互联医疗信息科技(北京)有限公司 | Health data goes privacy processing method and system |
CN109522740B (en) * | 2018-10-16 | 2021-04-20 | 易保互联医疗信息科技(北京)有限公司 | Health data privacy removal processing method and system |
CN114510548A (en) * | 2021-12-29 | 2022-05-17 | 北京空间飞行器总体设计部 | Method and device for dictionary construction and classification for spacecraft test identification and evaluation |
CN114429130A (en) * | 2022-01-14 | 2022-05-03 | 福建众创车联网络科技有限公司 | A method and system for segmentation of auto parts names |
Also Published As
Publication number | Publication date |
---|---|
CN108170682B (en) | 2021-09-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
US10706230B2 (en) | System and method for inputting text into electronic devices | |
CN108647205B (en) | Fine-grained emotion analysis model construction method and device and readable storage medium | |
CN106202382B (en) | Link instance method and system | |
US10460029B2 (en) | Reply information recommendation method and apparatus | |
US8335787B2 (en) | Topic word generation method and system | |
CN108170682A (en) | A kind of Chinese word cutting method and computing device based on specialized vocabulary | |
CN111626048A (en) | Text error correction method, device, equipment and storage medium | |
US20120259615A1 (en) | Text prediction | |
CN114818891B (en) | Small sample multi-label text classification model training method and text classification method | |
US20150199609A1 (en) | Self-learning system for determining the sentiment conveyed by an input text | |
CN110377882B (en) | Method, apparatus, system and storage medium for determining pinyin of text | |
CN110795628A (en) | Search term processing method and device based on correlation and computing equipment | |
CN114330303B (en) | Text error correction method and related equipment | |
CN111737464B (en) | Text classification method and device and electronic equipment | |
CN107958039A (en) | A kind of term error correction method, device and server | |
CN105814556A (en) | Context sensitive input tools | |
US12423509B2 (en) | Automated citations and assessment for automatically generated text | |
CN111046627B (en) | Chinese character display method and system | |
CN114462401A (en) | New word discovery method and computing device for field | |
CN111159526B (en) | Query statement processing method, device, equipment and storage medium | |
CN108255808A (en) | The method, apparatus and storage medium and electronic equipment that text divides | |
CN117493558A (en) | Text classification method and device, electronic equipment and computer readable storage medium | |
CN110083679B (en) | Search request processing method and device, electronic equipment and storage medium | |
KR102863616B1 (en) | Apparatus and method for search |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210907 |