[go: up one dir, main page]

WO2019100458A1 - Method and device for segmenting thai syllables - Google Patents

Method and device for segmenting thai syllables Download PDF

Info

Publication number
WO2019100458A1
WO2019100458A1 PCT/CN2017/116082 CN2017116082W WO2019100458A1 WO 2019100458 A1 WO2019100458 A1 WO 2019100458A1 CN 2017116082 W CN2017116082 W CN 2017116082W WO 2019100458 A1 WO2019100458 A1 WO 2019100458A1
Authority
WO
WIPO (PCT)
Prior art keywords
thai
syllable
character
segmentation
segmented
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2017/116082
Other languages
French (fr)
Chinese (zh)
Inventor
张睦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Transn Iol Technology Co Ltd
Original Assignee
Transn Iol Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Transn Iol Technology Co Ltd filed Critical Transn Iol Technology Co Ltd
Publication of WO2019100458A1 publication Critical patent/WO2019100458A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • G06F18/295Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the invention relates to the technical field of information retrieval, in particular to a method and a device for segmentation of a Thai syllable.
  • Dai language Also known as Dai language, it is the language of the Shutai nationality and belongs to the East Asian language/Chinese-Tibetan language. About 68 million people worldwide use Thai. In the Thai text, there is no punctuation between words and words, no spaces are left, and a sentence is spelled continuously from beginning to end. Generally, a sentence is represented by a space between two letters or a small pause in a sentence. However, as a basic unit that is clearly defined in the grammar, there are no obvious spaces between the Thai syllables and syllables in the text. Therefore, the processing of Thai text must first be syllabic to the Thai text. This segmentation process provides an important foundation for Thai lexical, syntactic, and more complex natural language processing algorithm tasks.
  • the syllable segmentation of Thai can be performed according to more than 200 rules that Thai researchers have summarized the grammatical patterns of syllables.
  • the grammar rules are complex and difficult to understand, there may be conflicts between a large number of rules, which makes the syllable segmentation speed of Thai is slower and the accuracy is not very high.
  • Embodiments of the present invention provide a method and apparatus for segmentation of Thai syllables.
  • a brief summary is given below. This generalization is not a general comment, nor is it intended to identify key/critical constituent elements or to describe the scope of protection of these embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the following detailed description.
  • a method for segmentation of a Thai syllable including:
  • Extracting each to-be-segmented syllable in the Thai text to be processed wherein the to-be-segmented syllable is composed of Thai characters that appear n times consecutively, and one of the to-be-segmented identifiers, and n is a positive integer;
  • the syllables in the pending Thai sentence are set according to each syllable to be sliced and its corresponding segmentation probability.
  • the preprocessing of the Thai text to be processed, determining the non-Thai character string, and the position syllable type information of each Thai character include:
  • the labeling the boundary between each character in the Thai text to be processed includes:
  • a boundary between a Thai syllable character and a non-Thai character string is marked as the to-be-sliced identifier.
  • determining the segmentation probability of each syllable to be sliced includes:
  • the cleavage probability corresponding to the syllable to be sliced by the Thai character arranged in the first order is zero;
  • the segmentation identifier corresponding to the to-be-segmented syllable of the Thai character is arranged by the second order The probability is zero.
  • the syllable in the to-be-processed Thai sentence is divided according to each syllable to be sliced and its corresponding segmentation probability, including:
  • the syllables in the set Thai sentence are divided according to the size of the pre-processed segmentation probability.
  • an apparatus for segmentation of a Thai syllable comprising:
  • a preprocessing unit for preprocessing the pending Thai text obtained from the Thai corpus, determining a non-Thai character string, and location syllable type information of each Thai character;
  • An identifier unit configured to label a boundary between each character in the Thai text to be processed, wherein a boundary composed of at least one Thai syllable character is marked as a to-be-sliced identifier;
  • An extracting unit configured to extract each to-be-segmented syllable in the Thai text to be processed, wherein the to-be-segmented syllable is composed of Thai characters that appear n times consecutively, and one of the to-be-segmented identifiers, n is Positive integer
  • a probability determining unit configured to determine, according to the position syllable type information of the Thai character in the syllable to be sliced, a Markov chain probabilistic speech model to determine a puncturing probability of each syllable to be sliced;
  • a segmentation unit is configured to slice the syllables in the pending Thai sentence according to each syllable to be sliced and its corresponding segmentation probability.
  • the pre-processing unit is further configured to: identify a non-Thai character string in the Thai text to be processed; and, according to a correspondence between the saved Thai character and the location syllable type information, Determining position syllable type information of each Thai character in the to-be-processed Thai text, wherein the position syllable type information is generated according to basic grammatical rules of Thai, including syllable type information and position information.
  • the identifier unit is further configured to mark a boundary between two non-Thai phonological characters as a first segmentation identifier; and, to mark a boundary between two Thai syllable characters as The to-be-sliced identifier marks a boundary between a Thai syllable character and a non-Thai character string as the to-be-sliced identifier.
  • the probability determining unit is further configured to: if the position syllable type information of the Thai character is a consonant character not at the end of the syllable, the Thai character arranged in the first order and the to-be-cut The segmentation probability corresponding to the sub-identified segmentation syllable is zero; if the position syllable type information of the Thai character is a vowel character not at the beginning of the syllable, the segmentation to be segmented by the second sequence is included The segmentation probability corresponding to the to-be-segmented syllable of the Thai character is zero.
  • the segmentation unit is further configured to determine, according to each syllable to be sliced and its corresponding segmentation probability, a pre-cut of each pre-processed syllable in the set Thai sentence a sub-probability; according to the size of the pre-processing segmentation probability, the syllables in the set Thai sentence to be processed are segmented.
  • the Markov chain probability speech model can be used to determine the segmentation probability of each syllable by using the position syllable type information of the Thai character, and the syllables in the Thai sentence are segmented according to the segmentation probability. Another way to segment the Thai syllables. And the segmentation method is based on the basic grammar rules of Thai language. The n-gram grammar of Markov chain probability speech model is used to carry out probability statistics, which improves the accuracy and segmentation speed of Thai syllable segmentation.
  • FIG. 1 is a flowchart of a Thai syllable segmentation method according to an exemplary embodiment
  • FIG. 2 is a flow chart showing a Thai syllable segmentation method according to an exemplary embodiment
  • FIG. 3 is a block diagram of a Thai syllabary segmentation device according to an exemplary embodiment
  • FIG. 4 is a block diagram of a Thai syllable segmentation apparatus according to an exemplary embodiment.
  • the Markov chain probability speech model can be used to determine the segmentation probability of each syllable by using the position syllable type information of the Thai character, and the syllables in the Thai sentence are segmented according to the segmentation probability. Another way to segment the Thai syllables. And the segmentation method is based on the basic grammar rules of Thai language.
  • the n-gram grammar of Markov chain probability speech model is used to carry out probability statistics, which improves the accuracy and segmentation speed of Thai syllable segmentation.
  • FIG. 1 is a flow chart showing a Thai syllable segmentation method according to an exemplary embodiment. As shown in Figure 1, the process of segmentation of Thai syllables includes:
  • Step 101 Preprocess the pending Thai text obtained from the Thai corpus to determine a non-Thai character string and position syllable type information of each Thai character.
  • the Thai corpus includes various types of Thai texts, which may include news, encyclopedia, novels, essays, and the like, respectively.
  • the corpus text is obtained from the Thai corpus, where the Thai text to be processed is obtained from the Thai corpus.
  • the Thai text to be processed contains not only Thai characters, but also non-Thai characters.
  • non-Thai characters can include: foreign characters (such as English characters), Arabic numerals, punctuation and blank. Therefore, the non-Thai character string in the Thai text can be treated first, so that the Thai characters and non-Thai characters in the Thai text to be processed can be distinguished.
  • a syllable of Thai is mainly composed of vowel characters, consonant characters and tones. Among them, for 15 Thai vowel characters, they can be combined into at least 28 vowel forms such as diphthong or vowel, and appear in the syllable. For 44 Thai consonants, it can be used as the syllable head (open consonant) and tail (finish tail). There are five tones included in the syllable. Based on the basic grammar rules of the above Thai language, a correspondence between a Thai character and a position syllable type information may be generated and saved, wherein the position syllable type information is generated according to the basic grammatical rules of the Thai language, including syllable type information and position information.
  • the correspondence between Thai characters and position syllable type information can be as shown in Table 1.
  • the position syllable type information of each Thai character in the Thai text to be processed can be determined according to the correspondence between the saved Thai characters and the position syllable type information.
  • Thai characters in Thai text to be processed include Thai characters According to Table 1, the position syllable type information of the Thai character is determined to be a consonant not at the end of the syllable.
  • Step 102 Label the boundary between each character in the Thai text to be processed, wherein a boundary composed of at least one Thai syllable character is marked as a to-be-sliced identifier.
  • the boundary between each character may be marked by a method of manual identification, that is, after the artificial Thai syllable is segmented, the boundary between the characters is marked.
  • the boundary can also be automatically tagged, and the smart match identifies the Thai syllable segment and marks the boundary between the characters. Either way, the label set ⁇ S, B ⁇ can be used to label the boundaries between Thai characters.
  • the boundary between the two non-Thai phonological characters is marked as the first segmentation identifier; and the boundary of the two Thai syllable characters is marked as the to-be-segmented identifier, and a Thai syllable character is associated with a non-Thai language
  • the boundary between the strings is marked as the identifier to be sliced.
  • the boundary between the two Thai syllable characters can be marked as the identifier to be sliced; and the boundary between a Thai syllable character and a non-Thai character string is marked as the identifier to be sliced.
  • the Thai text to be processed can be indicated by C 1 I 1 C 2 I 2 ... C i I i ... CnIn, where C i is the character in the Thai text to be processed, and I i is the boundary between the characters. .
  • the labeling of the boundary between each character in the Thai text is a pre-processing process.
  • the above-mentioned Thai syllable characters can be determined according to the method of manual identification or intelligent recognition, but the specific syllable division needs to be followed. step.
  • Step 103 Extract each to-be-segmented syllable in the Thai text to be processed, wherein the to-be-segmented syllable is composed of a Thai character that appears n times consecutively and a to-be-divided identifier.
  • the syllable composed of the Thai character and the to-be-sliced identifier is extracted from the Thai text to be processed, and the extracted syllable is a syllable to be sliced, and the syllable to be sliced includes n consecutive occurrences.
  • Thai characters and a to-be-divided logo are not easy to be divided.
  • n is a positive integer.
  • the to-be-segmented syllable consists of a Thai character that appears once and a to-be-divided identifier, it can be represented by CB, or BC. If the to-be-segmented syllable consists of 2 Thai characters and a to-be-divided identifier, it is available.
  • CCB, CBC, or BCC indicates that if the syllable to be sliced is composed of three Thai characters and one to be sliced, it can be represented by CCCB, CCBC, CBCC, or BCCC.
  • the to-be-segmented syllable consists of 4 Thai characters and a to-be-divided identifier, it can be represented by CCCCB, CCCBC, CCBCC, CBCCC or BCCC. And so on. Where C is the Thai character and B is the identity to be sliced.
  • the extracted syllables to be sliced may be the syllabic syllables to be successively appearing once in the Thai character, or the syllables to be sliced in which the Thai characters are consecutively displayed 2 times, or the syllables to be sliced in which the Thai characters are consecutively 3 times, or , including the to-be-segmented syllables in which the Thai characters appear one time in succession, the to-be-segmented syllables in which the Thai characters appear consecutively, and the to-be-segmented syllables in which the Thai characters appear consecutively three times.
  • Step 104 Determine a segmentation probability of each syllable to be segmented by using a Markov chain probability speech model according to the position syllable type information of the Thai character in the syllable to be segmented.
  • Markov chain is a discrete event stochastic process with Markov property in exponentials.
  • the Markov chain is a stochastic process that satisfies the Markov property.
  • Markov Chain describes a sequence of states, X 1 , X 2 , X 3 ..., each of which depends on a finite number of states.
  • a Markov chain is a sequence of random variables with Markov properties. The range of these variables, the set of all their possible values, is called the "state space", and the value of X n is the state of time n. If the conditional probability distribution of X n+1 for the past state is only a function of X n , then
  • x is a state in the process.
  • the above identity can be seen as a Markov property.
  • the position syllable type information of the Thai character in the split syllable has been determined. Therefore, for the set position syllable type information, the corresponding splitting probability is fixed, including: if the position syllable type information of the Thai character is not in the syllable At the end of the consonant character, the segmentation probability corresponding to the to-be-segmented syllable to be sliced by the Thai character arranged in the first order is zero; if the position syllable type information of the Thai character is the element not at the beginning of the syllable In the case of a phonetic character, the segmentation probability corresponding to the segmentation syllable to be segmented by the second order and the Thai character to be sliced is zero.
  • a syllable to be divided into two consecutive characters of Thai characters For example, a syllable to be cut for two consecutive Thai characters, and a syllable to be cut for three consecutive Thai characters, for example, if the Thai character C is not at the end of the syllable
  • the values of P CB , P CCB , P CBC , P CCCB , P CCBC , and P CBCC are all zero. That is, in the first order, C is in front of B.
  • the values of P BC , P BCC , P CBC , P CCBC , P BCCC , and P CBCC are all zero. That is, in the second order, B is before C.
  • the number of occurrences of each Thai character before the first segmentation mark in the pending Thai text is counted, and each Thai character appears in the pending Thai text.
  • the total number of times using the Markov chain probability speech model described above, can determine the segmentation probability of each syllable to be segmented.
  • the segmentation model of the syllable and the corresponding segmentation probability can be determined.
  • Step 105 Splitting the syllables in the pending Thai sentence according to each syllable to be sliced and its corresponding segmentation probability.
  • the segmentation model of the syllable and the corresponding segmentation probability has been determined. Therefore, after the input of the pending Thai sentence is input, the setting of the pending Thai sentence can be determined according to each syllable to be sliced and its corresponding segmentation probability. The pre-processed segmentation probability of each pre-processed syllable is then segmented according to the size of the pre-processed segmentation probability.
  • the syllables and the corresponding segmentation probabilities may be matched in a segmentation model in which the syllables and the corresponding segmentation probabilities have been determined, and it is determined that each of the pre-processing syllables and their corresponding pre-processing segmentation probabilities in the pending Thai sentence are set, and then compared.
  • the size of the pre-processed segmentation probability can be segmented according to the maximum set of segmentation probabilities for the syllables in the set Thai sentence.
  • the syllables in the set Thai sentence are segmented according to the pre-processing splitting probability greater than the set value.
  • the Markov chain probability speech model can be used to determine the segmentation probability of each syllable by using the position syllable type information of the Thai character, and the syllables in the Thai sentence are segmented according to the segmentation probability.
  • the segmentation method is based on the basic grammar rules of Thai language.
  • the n-gram grammar of Markov chain probability speech model is used to carry out probability statistics, which improves the accuracy and segmentation speed of Thai syllable segmentation.
  • the Thai character of one consecutive syllable, the Thai character of two consecutive times, and the Thai character of three consecutive times are respectively corresponding to the one-gram grammar and the binary of the Markov chain probability speech model.
  • FIG. 2 is a flowchart of a Thai syllable segmentation method according to an exemplary embodiment. As shown in Figure 2, the Thai syllable segmentation process includes:
  • Step 201 Identify a non-Thai character string in the Thai text to be processed obtained from the Thai corpus to determine a non-Thai language string.
  • Step 202 Determine position syllable type information of each Thai character in the Thai text to be processed according to the correspondence between the saved Thai characters and the position syllable type information.
  • the position syllable type information is generated according to the basic grammar rules of the Thai language, including syllable type information and position information.
  • the correspondence between the saved Thai characters and the position syllable type information can be as shown in Table 1.
  • Step 203 Label the boundary between two non-Thai syllabic characters in the Thai text to be processed as the first segmentation identifier, the boundary between the two Thai syllable characters, and a Thai syllable character with a non-Thai language The boundaries between strings are marked as the identifier to be sliced.
  • the Thai text to be processed can be identified by C 1 I 1 C 2 I 2 ... C i I i ... CnIn, where C i is the character in the Thai text to be processed, and I i is the boundary between the characters.
  • Step 204 Extract each tonal syllable in the Thai text to be processed.
  • the extracted to-be-segmented syllable includes: a syllable to be sliced for one consecutive Thai character, a to-be-segmented syllable in which two Thai characters appear consecutively, and a to-be-segmented syllable in which three consecutive Thai characters appear consecutively.
  • a syllable to be sliced for one consecutive Thai character a to-be-segmented syllable in which two Thai characters appear consecutively
  • a to-be-segmented syllable in which three consecutive Thai characters appear consecutively Corresponding to the nine representation types CB, BC, CCB, CBC, BCC, CCCB, CCBC, CBCC, BCCC, respectively, where C is the Thai character and B is the identifier to be sliced.
  • Step 205 Determine a segmentation probability of each syllable to be segmented by using a Markov chain probability speech model according to position syllable type information of the Thai character in the syllable to be segmented.
  • the corresponding segmentation probability is fixed, wherein if the Thai character C is a consonant character not at the end of the syllable, P CB , P CCB , P CBC , P CCCB , P CCBC , and P The value of CBCC is 0. If the Thai character C is a vowel character that is not at the beginning of the syllable, the values of P BC , P BCC , P CBC , P CCBC , P BCCC , and P CBCC are all zero.
  • the syllables to be segmented and the segmentation probability in the Thai document in the Thai corpus are counted, and the segmentation model of the syllable and the corresponding segmentation probability can be determined.
  • Step 206 Determine, according to each syllable to be sliced and its corresponding segmentation probability, a pre-processing segmentation probability of each pre-processed syllable in the pending Thai sentence.
  • Step 207 Split the syllables in the set Thai sentence according to the size of the pre-processed segmentation probability.
  • the Markov chain probabilistic speech model can be used to determine the segmentation probability of each syllable by using a one-gram grammar, a binary grammar, and a ternary grammar, and the syllables in the Thai sentence are determined according to the segmentation probability.
  • the segmentation improves the accuracy and segmentation speed of the Thai syllable segmentation.
  • a device for segmentation of Thai syllables can be constructed.
  • FIG. 3 is a block diagram of a Thai syllabary segmentation device according to an exemplary embodiment.
  • the apparatus includes: a pre-processing unit 100, an identification unit 200, an extraction unit 200, a probability determination unit 400, and a segmentation unit 500, where
  • the pre-processing unit 100 is configured to preprocess the Thai text to be processed obtained from the Thai corpus, determine a non-Thai character string, and position syllable type information of each Thai character.
  • the identifying unit 200 is configured to perform labeling on a boundary between each character in the Thai text to be processed, wherein a boundary composed of at least one Thai syllable character is marked as a to-be-segmented identifier.
  • the extracting unit 300 is configured to extract each to-be-segmented syllable in the Thai text to be processed, wherein the to-be-segmented syllable is composed of Thai characters that appear n times consecutively, and a to-be-divided identifier, where n is a positive integer.
  • the probability determining unit 400 is configured to determine a segmentation probability of each syllable to be segmented by using a Markov chain probability speech model according to the position syllable type information of the Thai character in the syllable to be segmented.
  • the segmentation unit 500 is configured to slice and set the syllables in the Thai sentence to be processed according to each syllable to be sliced and its corresponding segmentation probability.
  • the pre-processing unit 100 is further configured to identify a non-Thai character string to be processed in the Thai text; and determine the Thai language to be processed according to the correspondence between the saved Thai character and the location syllable type information.
  • the identifier unit 200 is further configured to mark a boundary between two non-Thai phonological characters as a first segmentation identifier; and, to mark a boundary between two Thai syllable characters as The segmentation identifier is used to mark the boundary between a Thai syllable character and a non-Thai character string as the identifier to be sliced.
  • the probability determining unit 400 is further configured to: if the position syllable type information of the Thai character is a consonant character not at the end of the syllable, include the Thai character arranged in the first order and the to-be-cut identifier to be cut.
  • the segmentation probability corresponding to the partial syllable is zero; if the position syllable type information of the Thai character is a vowel character that is not at the beginning of the syllable, the to-be-segmented yoke and the Thai character to be sliced are arranged in the second order.
  • the corresponding segmentation probability is zero.
  • the segmentation unit 500 is further configured to determine, according to each syllable to be sliced and its corresponding segmentation probability, a pre-processing segmentation probability of each pre-processed syllable in the pending Thai sentence; According to the size of the pre-processing splitting probability, the syllables in the set Thai sentence are divided.
  • FIG. 4 is a block diagram of a Thai syllable segmentation apparatus according to an exemplary embodiment. As shown in FIG. 4, the apparatus includes: a pre-processing unit 100, an identification unit 200, an extraction unit 200, a probability determination unit 400, and a segmentation unit 500. A storage unit 600 can also be included.
  • the storage unit 600 stores the correspondence between the Thai characters and the position syllable type information as shown in Table 1. In this way, the pre-processing unit 100 can identify the non-Thai character string in the Thai text to be processed, determine the non-Thai character string, and determine the pending relationship according to the correspondence between the Thai character and the position syllable type information saved by the storage unit 600. The position syllable type information of each Thai character in the Thai text.
  • the extracting unit 300 extracts each syllable to be sliced in the Thai text to be processed.
  • the extracted syllables to be sliced include: a syllable to be sliced with 1 Thai character in succession, a syllable to be sliced with 2 Thai characters in succession, and a syllable to be sliced with 3 consecutive Thai characters.
  • CB, BC, CCB, CBC, BCC, CCCB, CCBC, CBCC, BCCC respectively, where C is the Thai character and B is the identifier to be sliced.
  • the probability determination unit 400 determines that the values of CB , P CCB , P CBC , P CCCB , P CCBC , and P CBCC are all zero. If the Thai character C is a vowel character that is not at the beginning of the syllable, the probability determining unit 400 determines that the values of P BC , P BCC , P CBC , P CCBC , P BCCC , and P CBCC are all zero.
  • the probability determining unit 400 may perform probability statistics of the Markov chain probabilistic speech model, and determine the cut of each of the nine types of the syllables to be cut out in the above Table 2. Probability.
  • the segmentation module 500 can determine, according to each syllable to be sliced and its corresponding segmentation probability, a pre-processing segmentation probability of each pre-processed syllable in the pending Thai sentence, and then, according to the pre-processing segmentation probability The size of the syllable in the set Thai sentence is divided.
  • the Markov chain probability speech model can be used to determine the segmentation probability of each syllable by using the position syllable type information of the Thai character, and the syllables in the Thai sentence are segmented according to the segmentation probability. Another way to segment the Thai syllables is provided. And the segmentation method is based on the basic grammar rules of Thai language. The n-gram grammar of Markov chain probability speech model is used to carry out probability statistics, which improves the accuracy and segmentation speed of Thai syllable segmentation.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to the technical field of information retrieval, and disclosed thereby are a method and device for segmenting Thai syllables. The method comprises: pre-processing a Thai text to be processed acquired from a Thai corpus, and determining non-Thai character strings and location syllable type information of each Thai character; labeling boundaries between each character in the Thai text to be processed, wherein each boundary consisting of at least one Thai syllable character is labeled as an identification to be segmented; extracting each syllable to be segmented in the Thai text to be processed, wherein each syllable to be segmented is composed of a Thai character which continuously appears n times and an identification to be segmented, and n is a positive integer; determining, according to the location syllable type information of the Thai characters in the syllables to be segmented, the segmentation probability of each syllable to be segmented by using a Markov chain probability voice model; and segmenting and configuring the syllables in a Thai sentence to be processed according to each syllable to be segmented and the segmentation probability corresponding thereto.

Description

泰语音节切分的方法及装置Method and device for segmenting Thai syllables 技术领域Technical field

本发明涉及信息检索技术领域,特别涉及泰语音节切分的方法及装置。The invention relates to the technical field of information retrieval, in particular to a method and a device for segmentation of a Thai syllable.

背景技术Background technique

泰语

Figure PCTCN2017116082-appb-000001
也称傣语(Dai language),是傣泰民族的语言,属于东亚语系/汉藏语系的一种语言。全球有约6800万人口使用泰语。泰语的文本中,词与词之间不用标点,不留空格,一句话从头到尾连续不断的拼写,一般,以空两个字母的间隔或句子当中的小停顿表示一个句子。然而,作为一个在语法中被明确定义的基本单元,文本中的泰语音节和音节之间却没有明显的空格存在。因此,泰语文本的处理操作首先得要对泰语文本进行音节的切分。这项切分处理工作为泰文的词法,句法,以及更复杂的自然语言处理算法任务提供了重要的基础。 Thai
Figure PCTCN2017116082-appb-000001
Also known as Dai language, it is the language of the Shutai nationality and belongs to the East Asian language/Chinese-Tibetan language. About 68 million people worldwide use Thai. In the Thai text, there is no punctuation between words and words, no spaces are left, and a sentence is spelled continuously from beginning to end. Generally, a sentence is represented by a space between two letters or a small pause in a sentence. However, as a basic unit that is clearly defined in the grammar, there are no obvious spaces between the Thai syllables and syllables in the text. Therefore, the processing of Thai text must first be syllabic to the Thai text. This segmentation process provides an important foundation for Thai lexical, syntactic, and more complex natural language processing algorithm tasks.

目前,可根据泰国学者对音节构成的语法模式进行归纳得出的200多条的规则对泰文进行音节切分。但是,由于语法规则复杂且难以理解,大量的规则之间还可能存在冲突,使得泰文的音节切分速度比较慢,而且准确性也不是很高。At present, the syllable segmentation of Thai can be performed according to more than 200 rules that Thai scholars have summarized the grammatical patterns of syllables. However, because the grammar rules are complex and difficult to understand, there may be conflicts between a large number of rules, which makes the syllable segmentation speed of Thai is slower and the accuracy is not very high.

发明内容Summary of the invention

本发明实施例提供了一种泰语音节切分的方法及装置。为了对披露的实施例的一些方面有一个基本的理解,下面给出了简单的概括。该概括部分不是泛泛评述,也不是要确定关键/重要组成元素或描绘这些实施例的保护范围。其唯一目的是用简单的形式呈现一些概念,以此作为后面的详细说明的序言。Embodiments of the present invention provide a method and apparatus for segmentation of Thai syllables. In order to have a basic understanding of some aspects of the disclosed embodiments, a brief summary is given below. This generalization is not a general comment, nor is it intended to identify key/critical constituent elements or to describe the scope of protection of these embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the following detailed description.

根据本发明实施例的第一方面,提供了一种泰语音节切分的方法,包括:According to a first aspect of the embodiments of the present invention, a method for segmentation of a Thai syllable is provided, including:

对从泰语语料库中获取的待处理泰语文本进行预处理,确定非泰语字 符串,以及每个泰语字符的位置音节类型信息;Preprocessing the pending Thai text obtained from the Thai corpus to determine the non-Thai character string and the location syllable type information of each Thai character;

对所述待处理泰语文本中每个字符之间的边界进行打标签,其中,由至少一个泰语音节字符组成的边界标注为待切分标识;Labeling a boundary between each character in the Thai text to be processed, wherein a boundary composed of at least one Thai syllable character is marked as a to-be-sliced identifier;

提取所述待处理泰语文本中的每个待切分音节,其中,所述待切分音节由连续出现n次的泰语字符,以及一个所述待切分标识组成,n为正整数;Extracting each to-be-segmented syllable in the Thai text to be processed, wherein the to-be-segmented syllable is composed of Thai characters that appear n times consecutively, and one of the to-be-segmented identifiers, and n is a positive integer;

根据所述待切分音节中泰语字符的位置音节类型信息,采用马尔科夫链概率语音模型,确定每个待切分音节的切分概率;Determining, according to the position syllable type information of the Thai character in the syllable to be segmented, using a Markov chain probability speech model to determine a singular probability of each syllable to be sliced;

根据每个待切分音节及其对应的切分概率,切分设定待处理泰语句子中的音节。The syllables in the pending Thai sentence are set according to each syllable to be sliced and its corresponding segmentation probability.

本发明一实施例中,所述对待处理泰语文本进行预处理,确定非泰语字符串,以及每个泰语字符的位置音节类型信息包括:In an embodiment of the invention, the preprocessing of the Thai text to be processed, determining the non-Thai character string, and the position syllable type information of each Thai character include:

对所述待处理泰语文本中的非泰语字符串进行识别;Identifying a non-Thai character string in the Thai text to be processed;

根据保存的泰语字符与位置音节类型信息之间的对应关系,确定所述待处理泰语文本中每个泰语字符的位置音节类型信息,其中,所述位置音节类型信息是根据泰语的基本语法规则生成的,包括音节类型信息和位置信息。Determining position syllable type information of each Thai character in the to-be-processed Thai text according to a correspondence between the saved Thai character and the position syllable type information, wherein the position syllable type information is generated according to the basic grammatical rules of the Thai language , including syllable type information and location information.

本发明一实施例中,所述对所述待处理泰语文本中每个字符之间的边界进行打标签包括:In an embodiment of the invention, the labeling the boundary between each character in the Thai text to be processed includes:

将两个非泰语音节字符之间的边界标注为第一切分标识;Mark the boundary between two non-Thai syllable characters as the first segmentation identifier;

将两个泰语音节字符之间的边界标注为所述待切分标识;Marking the boundary between two Thai syllable characters as the to-be-divided identifier;

将一个泰语音节字符与一个非泰语字符串之间的边界标注为所述待切分标识。A boundary between a Thai syllable character and a non-Thai character string is marked as the to-be-sliced identifier.

本发明一实施例中,所述根据所述待切分音节中泰语字符的位置音节类型信息,采用马尔科夫链模型,确定每个待切分音节的切分概率包括:In an embodiment of the invention, according to the position syllable type information of the Thai character in the syllable to be sliced, using the Markov chain model, determining the segmentation probability of each syllable to be sliced includes:

若所述泰语字符的位置音节类型信息为不在音节末尾的辅音字符时,则包括由第一顺序排列的泰语字符与所述待切分标识的待切分音节对应的切分概率为零;If the position syllable type information of the Thai character is a consonant character that is not at the end of the syllable, the cleavage probability corresponding to the syllable to be sliced by the Thai character arranged in the first order is zero;

若所述泰语字符的位置音节类型信息为不在音节起始位置的元音字符时,则包括由第二顺序排列的所述待切分标识与所述泰语字符的待切分音节对应的切分概率为零。If the position syllable type information of the Thai character is a vowel character that is not at the beginning of the syllable, the segmentation identifier corresponding to the to-be-segmented syllable of the Thai character is arranged by the second order The probability is zero.

本发明一实施例中,所述根据每个待切分音节及其对应的切分概率,切分设定待处理泰语句子中的音节包括:In an embodiment of the invention, the syllable in the to-be-processed Thai sentence is divided according to each syllable to be sliced and its corresponding segmentation probability, including:

根据每个待切分音节及其对应的切分概率,确定所述设定待处理泰语句子中每个预处理音节的预处理切分概率;Determining a pre-processing segmentation probability of each pre-processed syllable in the set Thai sentence according to each to-be-segmented syllable and its corresponding segmentation probability;

根据所述预处理切分概率的大小,对所述设定待处理泰语句子中的音节进行切分。The syllables in the set Thai sentence are divided according to the size of the pre-processed segmentation probability.

根据本发明实施例的第二方面,提供一种泰语音节切分的装置,包括:According to a second aspect of the embodiments of the present invention, there is provided an apparatus for segmentation of a Thai syllable, comprising:

预处理单元,用于对从泰语语料库中获取的待处理泰语文本进行预处理,确定非泰语字符串,以及每个泰语字符的位置音节类型信息;a preprocessing unit for preprocessing the pending Thai text obtained from the Thai corpus, determining a non-Thai character string, and location syllable type information of each Thai character;

标识单元,用于对所述待处理泰语文本中每个字符之间的边界进行打标签,其中,由至少一个泰语音节字符组成的边界标注为待切分标识;An identifier unit, configured to label a boundary between each character in the Thai text to be processed, wherein a boundary composed of at least one Thai syllable character is marked as a to-be-sliced identifier;

提取单元,用于提取所述待处理泰语文本中的每个待切分音节,其中,所述待切分音节由连续出现n次的泰语字符,以及一个所述待切分标识组成,n为正整数;An extracting unit, configured to extract each to-be-segmented syllable in the Thai text to be processed, wherein the to-be-segmented syllable is composed of Thai characters that appear n times consecutively, and one of the to-be-segmented identifiers, n is Positive integer

概率确定单元,用于根据所述待切分音节中泰语字符的位置音节类型信息,采用马尔科夫链概率语音模型,确定每个待切分音节的切分概率;a probability determining unit, configured to determine, according to the position syllable type information of the Thai character in the syllable to be sliced, a Markov chain probabilistic speech model to determine a puncturing probability of each syllable to be sliced;

切分单元,用于根据每个待切分音节及其对应的切分概率,切分设定待处理泰语句子中的音节。A segmentation unit is configured to slice the syllables in the pending Thai sentence according to each syllable to be sliced and its corresponding segmentation probability.

本发明一实施例中,所述预处理单元,还用于对所述待处理泰语文本中的非泰语字符串进行识别;以及,根据保存的泰语字符与位置音节类型信息之间的对应关系,确定所述待处理泰语文本中每个泰语字符的位置音节类型信息,其中,所述位置音节类型信息是根据泰语的基本语法规则生成的,包括音节类型信息和位置信息。In an embodiment of the present invention, the pre-processing unit is further configured to: identify a non-Thai character string in the Thai text to be processed; and, according to a correspondence between the saved Thai character and the location syllable type information, Determining position syllable type information of each Thai character in the to-be-processed Thai text, wherein the position syllable type information is generated according to basic grammatical rules of Thai, including syllable type information and position information.

本发明一实施例中,所述标识单元,还用于将两个非泰语音节字符之间的边界标注为第一切分标识;以及,将两个泰语音节字符之间的边界标注为所述待切分标识,将一个泰语音节字符与一个非泰语字符串之间的边界标注为所述待切分标识。In an embodiment of the invention, the identifier unit is further configured to mark a boundary between two non-Thai phonological characters as a first segmentation identifier; and, to mark a boundary between two Thai syllable characters as The to-be-sliced identifier marks a boundary between a Thai syllable character and a non-Thai character string as the to-be-sliced identifier.

本发明一实施例中,所述概率确定单元,还用于若所述泰语字符的位置音节类型信息为不在音节末尾的辅音字符时,则包括由第一顺序排列的泰语字符与所述待切分标识的待切分音节对应的切分概率为零;若所述泰 语字符的位置音节类型信息为不在音节起始位置的元音字符时,则包括由第二顺序排列的所述待切分标识与所述泰语字符的待切分音节对应的切分概率为零。In an embodiment of the present invention, the probability determining unit is further configured to: if the position syllable type information of the Thai character is a consonant character not at the end of the syllable, the Thai character arranged in the first order and the to-be-cut The segmentation probability corresponding to the sub-identified segmentation syllable is zero; if the position syllable type information of the Thai character is a vowel character not at the beginning of the syllable, the segmentation to be segmented by the second sequence is included The segmentation probability corresponding to the to-be-segmented syllable of the Thai character is zero.

本发明一实施例中,所述切分单元,还用于根据每个待切分音节及其对应的切分概率,确定所述设定待处理泰语句子中每个预处理音节的预处理切分概率;根据所述预处理切分概率的大小,对所述设定待处理泰语句子中的音节进行切分。In an embodiment of the present invention, the segmentation unit is further configured to determine, according to each syllable to be sliced and its corresponding segmentation probability, a pre-cut of each pre-processed syllable in the set Thai sentence a sub-probability; according to the size of the pre-processing segmentation probability, the syllables in the set Thai sentence to be processed are segmented.

本发明实施例提供的技术方案可以包括以下有益效果:The technical solutions provided by the embodiments of the present invention may include the following beneficial effects:

本发明实施例中,可通过泰语字符的位置音节类型信息,采用马尔科夫链概率语音模型,确定每个音节的切分概率,并根据切分概率对泰语句子中的音节进行切分,提供了另一种泰语音节切分的方式。并且该切分方式基于泰语基本语法规则,采用马尔科夫链概率语音模型的n元文法进行概率统计,提高了泰语音节切分的准确性和切分速度。In the embodiment of the present invention, the Markov chain probability speech model can be used to determine the segmentation probability of each syllable by using the position syllable type information of the Thai character, and the syllables in the Thai sentence are segmented according to the segmentation probability. Another way to segment the Thai syllables. And the segmentation method is based on the basic grammar rules of Thai language. The n-gram grammar of Markov chain probability speech model is used to carry out probability statistics, which improves the accuracy and segmentation speed of Thai syllable segmentation.

应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本发明。The above general description and the following detailed description are merely illustrative and illustrative and are not restrictive.

附图说明DRAWINGS

此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并与说明书一起用于解释本发明的原理。The accompanying drawings, which are incorporated in the specification of FIG

图1是根据一示例性实施例示出的一种泰语音节切分方法的流程图;FIG. 1 is a flowchart of a Thai syllable segmentation method according to an exemplary embodiment;

图2是根据一示例性实施例示出的一种泰语音节切分方法的流程图;2 is a flow chart showing a Thai syllable segmentation method according to an exemplary embodiment;

图3是根据一示例性实施例示出的一种泰语音节切分装置的框图;FIG. 3 is a block diagram of a Thai syllabary segmentation device according to an exemplary embodiment; FIG.

图4是根据一示例性实施例示出的一种泰语音节切分装置的框图。FIG. 4 is a block diagram of a Thai syllable segmentation apparatus according to an exemplary embodiment.

具体实施方式Detailed ways

以下描述和附图充分地示出本发明的具体实施方案,以使本领域的技术人员能够实践它们。实施例仅代表可能的变化。除非明确要求,否则单独的部件和功能是可选的,并且操作的顺序可以变化。一些实施方案的部分和特征可以被包括在或替换其他实施方案的部分和特征。本发明的实施方案的范围包括权利要求书的整个范围,以及权利要求书的所有可获得的 等同物。在本文中,各实施方案可以被单独地或总地用术语“发明”来表示,这仅仅是为了方便,并且如果事实上公开了超过一个的发明,不是要自动地限制该应用的范围为任何单个发明或发明构思。本文中,诸如第一和第二等之类的关系术语仅仅用于将一个实体或者操作与另一个实体或操作区分开来,而不要求或者暗示这些实体或操作之间存在任何实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素。本文中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的结构、产品等而言,由于其与实施例公开的部分相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The detailed description of the embodiments of the invention are set forth in the description The examples represent only possible variations. Individual components and functions are optional unless explicitly required, and the order of operations may vary. Portions and features of some embodiments may be included or substituted for portions and features of other embodiments. The scope of the embodiments of the invention includes the full scope of the claims and all equivalents of the claims. In this context, various embodiments may be referred to individually or collectively by the term "invention," for convenience only, and if more than one invention is disclosed, it is not intended to automatically limit the scope of the application to any A single invention or inventive concept. Herein, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not require or imply any actual relationship between the entities or operations or order. Furthermore, the terms "comprises" or "comprising" or "comprising" or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, or device that includes a plurality of elements includes not only those elements but also other items not specifically listed. Elements. The various embodiments herein are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same similar parts between the various embodiments may be referred to each other. For the structures, products, and the like disclosed in the embodiments, since they correspond to the parts disclosed in the embodiments, the description is relatively simple, and the relevant parts can be referred to the method parts.

泰语文档中,词与词之间不用标点,不留空格,一句话从头到尾连续不断的拼写,较难从泰文文档中识别出泰语音节。本发明实施例中,可通过泰语字符的位置音节类型信息,采用马尔科夫链概率语音模型,确定每个音节的切分概率,并根据切分概率对泰语句子中的音节进行切分,提供了另一种泰语音节切分的方式。并且该切分方式基于泰语基本语法规则,采用马尔科夫链概率语音模型的n元文法进行概率统计,提高了泰语音节切分的准确性和切分速度。In Thai documents, there is no punctuation between words and words, no spaces are left, and a sentence is spelled continuously from beginning to end. It is difficult to identify Thai syllables from Thai documents. In the embodiment of the present invention, the Markov chain probability speech model can be used to determine the segmentation probability of each syllable by using the position syllable type information of the Thai character, and the syllables in the Thai sentence are segmented according to the segmentation probability. Another way to segment the Thai syllables. And the segmentation method is based on the basic grammar rules of Thai language. The n-gram grammar of Markov chain probability speech model is used to carry out probability statistics, which improves the accuracy and segmentation speed of Thai syllable segmentation.

图1是根据一示例性实施例示出的一种泰语音节切分方法的流程图。如图1所示,泰语音节切分的过程包括:FIG. 1 is a flow chart showing a Thai syllable segmentation method according to an exemplary embodiment. As shown in Figure 1, the process of segmentation of Thai syllables includes:

步骤101:对从泰语语料库中获取的待处理泰语文本进行预处理,确定非泰语字符串,以及每个泰语字符的位置音节类型信息。Step 101: Preprocess the pending Thai text obtained from the Thai corpus to determine a non-Thai character string and position syllable type information of each Thai character.

本发明实施例中,泰语语料库中包括多种类型的泰语文本,分别可包括新闻、百科、小说和短文等等。从泰语语料库中获取语料文本,这里,即从泰语语料库中获取待处理泰语文本。In the embodiment of the present invention, the Thai corpus includes various types of Thai texts, which may include news, encyclopedia, novels, essays, and the like, respectively. The corpus text is obtained from the Thai corpus, where the Thai text to be processed is obtained from the Thai corpus.

待处理泰语文本中不仅仅只包括泰语字符,还可能包括非泰语字符。其中,非泰语字符可包括:外文字符(例如英文字符)、阿拉伯数字、标点符合以及空白四个类型。因此,可首先对待处理泰语文本中的非泰语字符串进行识别,这样,可区分出待处理泰语文本中的泰语字符和非泰语字符。The Thai text to be processed contains not only Thai characters, but also non-Thai characters. Among them, non-Thai characters can include: foreign characters (such as English characters), Arabic numerals, punctuation and blank. Therefore, the non-Thai character string in the Thai text can be treated first, so that the Thai characters and non-Thai characters in the Thai text to be processed can be distinguished.

泰语的一个音节主要由元音字符,辅音字符以及音调组成。其中,对于15个泰语元音字符,它们之间可以通过互相结合成为双元音或者三元音等至少28种元音组合形态(vowel forms),并出现在音节中。而对于44个泰语辅音,可以作为音节的头(开首辅音)和尾(韵尾)。音节所包含的声调有5种。基于上述泰语的基本语法规则可生成一个泰语字符与位置音节类型信息之间的对应关系并进行保存,其中,位置音节类型信息是根据泰语的基本语法规则生成的,包括音节类型信息和位置信息。A syllable of Thai is mainly composed of vowel characters, consonant characters and tones. Among them, for 15 Thai vowel characters, they can be combined into at least 28 vowel forms such as diphthong or vowel, and appear in the syllable. For 44 Thai consonants, it can be used as the syllable head (open consonant) and tail (finish tail). There are five tones included in the syllable. Based on the basic grammar rules of the above Thai language, a correspondence between a Thai character and a position syllable type information may be generated and saved, wherein the position syllable type information is generated according to the basic grammatical rules of the Thai language, including syllable type information and position information.

例如:泰语字符与位置音节类型信息之间的对应关系可如表1所示。For example, the correspondence between Thai characters and position syllable type information can be as shown in Table 1.

Figure PCTCN2017116082-appb-000002
Figure PCTCN2017116082-appb-000002

表1Table 1

从而,可根据保存的泰语字符与位置音节类型信息之间的对应关系,确定待处理泰语文本中每个泰语字符的位置音节类型信息。例如:待处理泰语文本中包括泰文字符

Figure PCTCN2017116082-appb-000003
则可根据表1,确定该泰文字符的位置音节类型信息为不在音节末尾的辅音。 Thereby, the position syllable type information of each Thai character in the Thai text to be processed can be determined according to the correspondence between the saved Thai characters and the position syllable type information. For example: Thai characters in Thai text to be processed include Thai characters
Figure PCTCN2017116082-appb-000003
According to Table 1, the position syllable type information of the Thai character is determined to be a consonant not at the end of the syllable.

步骤102:对待处理泰语文本中每个字符之间的边界进行打标签,其中,由至少一个泰语音节字符组成的边界标注为待切分标识。Step 102: Label the boundary between each character in the Thai text to be processed, wherein a boundary composed of at least one Thai syllable character is marked as a to-be-sliced identifier.

本发明实施例可人工标识的方法对每个字符之间的边界进行打标签,即人工泰语音节切分后,在字符之间的边界进行打标签。当然,也可自动对边界进行打标签,智能匹配识别泰语音节切分后,在字符之间的边界进行打标签。无论何种方式,都可采用标注集{S,B}来对泰语字符间的边界进行打标签。In the embodiment of the present invention, the boundary between each character may be marked by a method of manual identification, that is, after the artificial Thai syllable is segmented, the boundary between the characters is marked. Of course, the boundary can also be automatically tagged, and the smart match identifies the Thai syllable segment and marks the boundary between the characters. Either way, the label set {S, B} can be used to label the boundaries between Thai characters.

其中,将两个非泰语音节字符之间的边界标注为第一切分标识;以及,将两个泰语音节字符的边界标注为待切分标识,将一个泰语音节字符与一个非泰语字符串之间的边界标注为待切分标识。Wherein, the boundary between the two non-Thai phonological characters is marked as the first segmentation identifier; and the boundary of the two Thai syllable characters is marked as the to-be-segmented identifier, and a Thai syllable character is associated with a non-Thai language The boundary between the strings is marked as the identifier to be sliced.

这里,待处理泰语文本中字符与字符之间的边界可用I i标示,其中, 可将两个非泰语音节字符之间的边界标注为第一切分标识,即I i=S,即S为第一切分标识。而可将两个泰语音节字符之间的边界标注为待切分标识;以及,一个泰语音节字符与一个非泰语字符串之间的边界标注为待切分标识,此时,I i=B,即B为待切分标识。通过上述的标识过程,待处理泰语文本可用C 1I 1C 2I 2…C iI i…CnIn来标示,其中,C i为待处理泰语文本中的字符,I i为字符之间的边界。其中,I i=S,或I i=B,n为正整数。 Here, the boundary between the character and the character in the Thai text to be processed may be denoted by I i , wherein the boundary between the two non-Thai syllable characters may be marked as the first segmentation identifier, that is, I i =S, ie, S The first segmentation identifier. The boundary between the two Thai syllable characters can be marked as the identifier to be sliced; and the boundary between a Thai syllable character and a non-Thai character string is marked as the identifier to be sliced. At this time, I i = B, that is, B is the identifier to be sliced. Through the above identification process, the Thai text to be processed can be indicated by C 1 I 1 C 2 I 2 ... C i I i ... CnIn, where C i is the character in the Thai text to be processed, and I i is the boundary between the characters. . Where I i = S, or I i = B, n is a positive integer.

这里,对待处理泰语文本中每个字符之间的边界进行打标签是预处理过程,上述的泰语音节字符可是根据人工标识或智能识别的方式确定的,但是具体的音节划分,还需进行后续步骤。Here, the labeling of the boundary between each character in the Thai text is a pre-processing process. The above-mentioned Thai syllable characters can be determined according to the method of manual identification or intelligent recognition, but the specific syllable division needs to be followed. step.

步骤103:提取待处理泰语文本中的每个待切分音节,其中,待切分音节由连续出现n次的泰语字符与待切分标识组成。Step 103: Extract each to-be-segmented syllable in the Thai text to be processed, wherein the to-be-segmented syllable is composed of a Thai character that appears n times consecutively and a to-be-divided identifier.

由于泰文字符之间边界不容易划分,因此,从待处理泰语文本中提取由泰语字符和待切分标识组成的音节,提取的音节是待切分音节,该待切分音节包括连续出现n次的泰语字符以及一个待切分标识。这里,n为正整数。Since the boundary between Thai characters is not easy to be divided, the syllable composed of the Thai character and the to-be-sliced identifier is extracted from the Thai text to be processed, and the extracted syllable is a syllable to be sliced, and the syllable to be sliced includes n consecutive occurrences. Thai characters and a to-be-divided logo. Here, n is a positive integer.

若待切分音节由出现1次的泰语字符以及一个待切分标识组成,则可用CB,或BC表示,若待切分音节由出现2次的泰语字符以及一个待切分标识组成,则可用CCB,CBC,或BCC表示,若待切分音节由出现3次的泰语字符以及一个待切分标识组成,则可用CCCB,CCBC,CBCC,或BCCC表示。若待切分音节由出现4次的泰语字符以及一个待切分标识组成,则可用CCCCB,CCCBC,CCBCC,CBCCC或BCCC表示。依次类推。其中,C表示泰文字符,而B表示待切分标识。If the to-be-segmented syllable consists of a Thai character that appears once and a to-be-divided identifier, it can be represented by CB, or BC. If the to-be-segmented syllable consists of 2 Thai characters and a to-be-divided identifier, it is available. CCB, CBC, or BCC indicates that if the syllable to be sliced is composed of three Thai characters and one to be sliced, it can be represented by CCCB, CCBC, CBCC, or BCCC. If the to-be-segmented syllable consists of 4 Thai characters and a to-be-divided identifier, it can be represented by CCCCB, CCCBC, CCBCC, CBCCC or BCCC. And so on. Where C is the Thai character and B is the identity to be sliced.

提取的待切分音节可都为连续出现1次泰文字符的待切分音节,或者,连续出现2次泰文字符的待切分音节,或者,连续出现3次泰文字符的待切分音节,或者,包括连续出现1次泰文字符的待切分音节,连续出现2次泰文字符的待切分音节,以及连续出现3次泰文字符的待切分音节。当然还有其他的选择,根据具有应用场景决定,就不再例举了。The extracted syllables to be sliced may be the syllabic syllables to be successively appearing once in the Thai character, or the syllables to be sliced in which the Thai characters are consecutively displayed 2 times, or the syllables to be sliced in which the Thai characters are consecutively 3 times, or , including the to-be-segmented syllables in which the Thai characters appear one time in succession, the to-be-segmented syllables in which the Thai characters appear consecutively, and the to-be-segmented syllables in which the Thai characters appear consecutively three times. Of course, there are other options, depending on the application scenario, it will not be enumerated.

步骤104:根据待切分音节中泰语字符的位置音节类型信息,采用马尔科夫链概率语音模型,确定每个待切分音节的切分概率。Step 104: Determine a segmentation probability of each syllable to be segmented by using a Markov chain probability speech model according to the position syllable type information of the Thai character in the syllable to be segmented.

马尔可夫链,是指数学中具有马尔可夫性质的离散事件随机过程。马 尔可夫链是满足马尔可夫性质的随机过程。马尔可夫链(Markov Chain),描述了一种状态序列,X 1,X 2,X 3…,其每个状态值取决于前面有限个状态。马尔可夫链是具有马尔可夫性质的随机变量的一个数列。这些变量的范围,即它们所有可能取值的集合,被称为“状态空间”,而X n的值则是在时间n的状态。如果X n+1对于过去状态的条件概率分布仅是X n的一个函数,则 Markov chain is a discrete event stochastic process with Markov property in exponentials. The Markov chain is a stochastic process that satisfies the Markov property. Markov Chain, describes a sequence of states, X 1 , X 2 , X 3 ..., each of which depends on a finite number of states. A Markov chain is a sequence of random variables with Markov properties. The range of these variables, the set of all their possible values, is called the "state space", and the value of X n is the state of time n. If the conditional probability distribution of X n+1 for the past state is only a function of X n , then

P(X n+1=x|X 1=x 1,X 2=x 2,…,X n=x n)=P(X n+1=x|X n=x n) P(X n+1 =x|X 1 =x 1 , X 2 =x 2 ,...,X n =x n )=P(X n+1 =x|X n =x n )

这里x为过程中的某个状态。上面这个恒等式可以被看作是马尔可夫性质。Here x is a state in the process. The above identity can be seen as a Markov property.

待切分音节中的泰语字符的位置音节类型信息已经确定了,因此,对于设定的位置音节类型信息,其对应的切分概率是固定,包括:若泰语字符的位置音节类型信息为不在音节末尾的辅音字符时,则包括由第一顺序排列的泰语字符与待切分标识的待切分音节对应的切分概率为零;若泰语字符的位置音节类型信息为不在音节起始位置的元音字符时,则包括由第二顺序排列的待切分标识与泰语字符的待切分音节对应的切分概率为零。The position syllable type information of the Thai character in the split syllable has been determined. Therefore, for the set position syllable type information, the corresponding splitting probability is fixed, including: if the position syllable type information of the Thai character is not in the syllable At the end of the consonant character, the segmentation probability corresponding to the to-be-segmented syllable to be sliced by the Thai character arranged in the first order is zero; if the position syllable type information of the Thai character is the element not at the beginning of the syllable In the case of a phonetic character, the segmentation probability corresponding to the segmentation syllable to be segmented by the second order and the Thai character to be sliced is zero.

以包括连续出现1次泰文字符的待切分音节,连续出现2次泰文字符的待切分音节,以及连续出现3次泰文字符的待切分音节为例,若泰语字符C为不在音节末尾的辅音字符时,P CB、P CCB、P CBC、P CCCB、P CCBC、以及P CBCC的值都为0。即第一顺序中,C在B的前面。若泰语字符C为不在音节起始位置的元音字符时,则P BC、P BCC、P CBC、P CCBC、P BCCC以及P CBCC的值都为0。即第二顺序中,B在C前。 For example, a syllable to be divided into two consecutive characters of Thai characters, a syllable to be cut for two consecutive Thai characters, and a syllable to be cut for three consecutive Thai characters, for example, if the Thai character C is not at the end of the syllable For consonant characters, the values of P CB , P CCB , P CBC , P CCCB , P CCBC , and P CBCC are all zero. That is, in the first order, C is in front of B. If the Thai character C is a vowel character that is not at the beginning of the syllable, the values of P BC , P BCC , P CBC , P CCBC , P BCCC , and P CBCC are all zero. That is, in the second order, B is before C.

对于包含其他位置音节类型信息的泰语字符的待切分音节,分别统计每个泰语字符在待处理泰语文本中第一切分标识前出现的次数,以及每个泰语字符在待处理泰语文本中出现的总次数,采用上述的马尔科夫链概率语音模型,即可确定每个待切分音节的切分概率。For the syllables of Thai characters containing information on the syllable type of other positions, the number of occurrences of each Thai character before the first segmentation mark in the pending Thai text is counted, and each Thai character appears in the pending Thai text. The total number of times, using the Markov chain probability speech model described above, can determine the segmentation probability of each syllable to be segmented.

从而,对泰语语料库中泰语文档中待切分音节以及切分概率的统计,可确定音节以及对应的切分概率的切分模型。Thus, for the statistics of the syllables to be segmented and the probability of segmentation in the Thai document in the Thai corpus, the segmentation model of the syllable and the corresponding segmentation probability can be determined.

步骤105:根据每个待切分音节及其对应的切分概率,切分设定待处理泰语句子中的音节。Step 105: Splitting the syllables in the pending Thai sentence according to each syllable to be sliced and its corresponding segmentation probability.

已经确定了音节以及对应的切分概率的切分模型,因此,输入了设定 待处理泰语句子后,可根据每个待切分音节及其对应的切分概率,确定设定待处理泰语句子中每个预处理音节的预处理切分概率,然后,根据预处理切分概率的大小,对设定待处理泰语句子中的音节进行切分。The segmentation model of the syllable and the corresponding segmentation probability has been determined. Therefore, after the input of the pending Thai sentence is input, the setting of the pending Thai sentence can be determined according to each syllable to be sliced and its corresponding segmentation probability. The pre-processed segmentation probability of each pre-processed syllable is then segmented according to the size of the pre-processed segmentation probability.

具体,可在已经确定了音节以及对应的切分概率的切分模型中匹配,确定设定待处理泰语句子中每种可能存在的预处理音节及其对应的预处理切分概率,然后,比较预处理切分概率的大小,可根据最大一组切分概率,对设定待处理泰语句子中的音节进行切分。或者,根据大于设定值的预处理切分概率,对设定待处理泰语句子中的音节进行切分。Specifically, it may be matched in a segmentation model in which the syllables and the corresponding segmentation probabilities have been determined, and it is determined that each of the pre-processing syllables and their corresponding pre-processing segmentation probabilities in the pending Thai sentence are set, and then compared. The size of the pre-processed segmentation probability can be segmented according to the maximum set of segmentation probabilities for the syllables in the set Thai sentence. Alternatively, the syllables in the set Thai sentence are segmented according to the pre-processing splitting probability greater than the set value.

可见,本发明实施例中,可通过泰语字符的位置音节类型信息,采用马尔科夫链概率语音模型,确定每个音节的切分概率,并根据切分概率对泰语句子中的音节进行切分,提供了另一种泰语音节切分的方式。并且该切分方式基于泰语基本语法规则,采用马尔科夫链概率语音模型的n元文法进行概率统计,提高了泰语音节切分的准确性和切分速度。It can be seen that, in the embodiment of the present invention, the Markov chain probability speech model can be used to determine the segmentation probability of each syllable by using the position syllable type information of the Thai character, and the syllables in the Thai sentence are segmented according to the segmentation probability. , provides another way to segment the Thai syllables. And the segmentation method is based on the basic grammar rules of Thai language. The n-gram grammar of Markov chain probability speech model is used to carry out probability statistics, which improves the accuracy and segmentation speed of Thai syllable segmentation.

下面将操作流程集合到具体实施例中,举例说明本公开实施例提供的方法。The operation flow is gathered into a specific embodiment to exemplify the method provided by the embodiment of the present disclosure.

本实施例中,待切分音节中可出现连续1次的泰语字符,连续2次的泰语字符,以及连续3次的泰语字符,分别对应采用马尔科夫链概率语音模型的一元文法、二元文法,以及三元文法。In this embodiment, the Thai character of one consecutive syllable, the Thai character of two consecutive times, and the Thai character of three consecutive times are respectively corresponding to the one-gram grammar and the binary of the Markov chain probability speech model. Grammar, and ternary grammar.

图2是根据一示例性实施例示出的一种泰语音节切分方法的流程图。如图2所示,泰语音节切分过程包括:FIG. 2 is a flowchart of a Thai syllable segmentation method according to an exemplary embodiment. As shown in Figure 2, the Thai syllable segmentation process includes:

步骤201:对从泰语语料库中获取的待处理泰语文本中的非泰语字符串进行识别,确定非泰语字符串。Step 201: Identify a non-Thai character string in the Thai text to be processed obtained from the Thai corpus to determine a non-Thai language string.

步骤202:根据保存的泰语字符与位置音节类型信息之间的对应关系,确定待处理泰语文本中每个泰语字符的位置音节类型信息。Step 202: Determine position syllable type information of each Thai character in the Thai text to be processed according to the correspondence between the saved Thai characters and the position syllable type information.

这里,位置音节类型信息是根据泰语的基本语法规则生成的,包括音节类型信息和位置信息。保存的泰语字符与位置音节类型信息之间的对应关系可如表1所示。Here, the position syllable type information is generated according to the basic grammar rules of the Thai language, including syllable type information and position information. The correspondence between the saved Thai characters and the position syllable type information can be as shown in Table 1.

步骤203:将待处理泰语文本中两个非泰语音节字符之间的边界标注为第一切分标识,将两个泰语音节字符之间的边界,以及一个泰语音节字符与一个非泰语字符串之间的边界都标注为待切分标识。Step 203: Label the boundary between two non-Thai syllabic characters in the Thai text to be processed as the first segmentation identifier, the boundary between the two Thai syllable characters, and a Thai syllable character with a non-Thai language The boundaries between strings are marked as the identifier to be sliced.

这样,待处理泰语文本可用C 1I 1C 2I 2…C iI i…CnIn来标示,其中,C i为待处理泰语文本中的字符,I i为字符之间的边界。其中,I i=S,或I i=B。 Thus, the Thai text to be processed can be identified by C 1 I 1 C 2 I 2 ... C i I i ... CnIn, where C i is the character in the Thai text to be processed, and I i is the boundary between the characters. Where I i =S, or I i =B.

步骤204:提取待处理泰语文本中的每个待切分音节。Step 204: Extract each tonal syllable in the Thai text to be processed.

这里,提取出来的待切分音节包括:连续出现1次泰文字符的待切分音节,连续出现2次泰文字符的待切分音节,以及连续出现3次泰文字符的待切分音节。分别对应为CB,BC,CCB,CBC,BCC,CCCB,CCBC,CBCC,BCCC这九种表示类型,其中,C表示泰文字符,而B表示待切分标识。Here, the extracted to-be-segmented syllable includes: a syllable to be sliced for one consecutive Thai character, a to-be-segmented syllable in which two Thai characters appear consecutively, and a to-be-segmented syllable in which three consecutive Thai characters appear consecutively. Corresponding to the nine representation types CB, BC, CCB, CBC, BCC, CCCB, CCBC, CBCC, BCCC, respectively, where C is the Thai character and B is the identifier to be sliced.

步骤205:根据待切分音节中泰语字符的位置音节类型信息,采用马尔科夫链概率语音模型,确定每个待切分音节的切分概率。Step 205: Determine a segmentation probability of each syllable to be segmented by using a Markov chain probability speech model according to position syllable type information of the Thai character in the syllable to be segmented.

Figure PCTCN2017116082-appb-000004
Figure PCTCN2017116082-appb-000004

Figure PCTCN2017116082-appb-000005
Figure PCTCN2017116082-appb-000005

表2Table 2

对于设定的位置音节类型信息,其对应的切分概率是固定,其中,若泰语字符C为不在音节末尾的辅音字符时,P CB、P CCB、P CBC、P CCCB、P CCBC、以及P CBCC的值都为0。若泰语字符C为不在音节起始位置的元音字符时,则P BC、P BCC、P CBC、P CCBC、P BCCC以及P CBCC的值都为0。 For the set position syllable type information, the corresponding segmentation probability is fixed, wherein if the Thai character C is a consonant character not at the end of the syllable, P CB , P CCB , P CBC , P CCCB , P CCBC , and P The value of CBCC is 0. If the Thai character C is a vowel character that is not at the beginning of the syllable, the values of P BC , P BCC , P CBC , P CCBC , P BCCC , and P CBCC are all zero.

对于包含其他位置音节类型信息的泰语字符的待切分音节,分别统计每个泰语字符在待处理泰语文本中第一切分标识前出现的次数,以及每个泰语字符在待处理泰语文本中出现的总次数,采用上述的马尔科夫链概率语音模型,计算表格2中9种类型对应的每个待切分音节的切分概率。For the syllables of Thai characters containing information on the syllable type of other positions, the number of occurrences of each Thai character before the first segmentation mark in the pending Thai text is counted, and each Thai character appears in the pending Thai text. The total number of times, using the Markov chain probability speech model described above, calculates the segmentation probability of each of the nine syllabic syllables corresponding to the nine types in Table 2.

根据表2,对泰语语料库中泰语文档中待切分音节以及切分概率进行统计,可确定音节以及对应的切分概率的切分模型。According to Table 2, the syllables to be segmented and the segmentation probability in the Thai document in the Thai corpus are counted, and the segmentation model of the syllable and the corresponding segmentation probability can be determined.

步骤206:根据每个待切分音节及其对应的切分概率,确定设定待处理泰语句子中每个预处理音节的预处理切分概率。Step 206: Determine, according to each syllable to be sliced and its corresponding segmentation probability, a pre-processing segmentation probability of each pre-processed syllable in the pending Thai sentence.

步骤207:根据预处理切分概率的大小,对设定待处理泰语句子中的音节进行切分。Step 207: Split the syllables in the set Thai sentence according to the size of the pre-processed segmentation probability.

例如:CCBC的预处理切分概率最大,则可确定I i=B=1,即此处为音节边界。 For example, if CCBC has the highest pre-cutting probability, I can determine I i = B=1, which is the syllable boundary here.

可见,本实施例中,可通过一元文法,二元文法以及三元文法,采用马尔科夫链概率语音模型,确定每个音节的切分概率,并根据切分概率对 泰语句子中的音节进行切分,提高了泰语音节切分的准确性和切分速度。It can be seen that, in this embodiment, the Markov chain probabilistic speech model can be used to determine the segmentation probability of each syllable by using a one-gram grammar, a binary grammar, and a ternary grammar, and the syllables in the Thai sentence are determined according to the segmentation probability. The segmentation improves the accuracy and segmentation speed of the Thai syllable segmentation.

下述为本公开装置实施例,可以用于执行本公开方法实施例。The following is an apparatus embodiment of the present disclosure, which may be used to implement the method embodiments of the present disclosure.

根据上述泰语音节切分的过程,可构建一种泰语音节切分的装置。According to the above-mentioned process of segmentation of the Thai syllables, a device for segmentation of Thai syllables can be constructed.

图3是根据一示例性实施例示出的一种泰语音节切分装置的框图。如图3所示,该装置包括:预处理单元100、标识单元200、提取单元200、概率确定单元400和切分单元500,其中,FIG. 3 is a block diagram of a Thai syllabary segmentation device according to an exemplary embodiment. As shown in FIG. 3, the apparatus includes: a pre-processing unit 100, an identification unit 200, an extraction unit 200, a probability determination unit 400, and a segmentation unit 500, where

预处理单元100,用于对从泰语语料库中获取的待处理泰语文本进行预处理,确定非泰语字符串,以及每个泰语字符的位置音节类型信息。The pre-processing unit 100 is configured to preprocess the Thai text to be processed obtained from the Thai corpus, determine a non-Thai character string, and position syllable type information of each Thai character.

标识单元200,用于对待处理泰语文本中每个字符之间的边界进行打标签,其中,由至少一个泰语音节字符组成的边界标注为待切分标识。The identifying unit 200 is configured to perform labeling on a boundary between each character in the Thai text to be processed, wherein a boundary composed of at least one Thai syllable character is marked as a to-be-segmented identifier.

提取单元300,用于提取待处理泰语文本中的每个待切分音节,其中,待切分音节由连续出现n次的泰语字符,以及一个待切分标识组成,n为正整数。The extracting unit 300 is configured to extract each to-be-segmented syllable in the Thai text to be processed, wherein the to-be-segmented syllable is composed of Thai characters that appear n times consecutively, and a to-be-divided identifier, where n is a positive integer.

概率确定单元400,用于根据待切分音节中泰语字符的位置音节类型信息,采用马尔科夫链概率语音模型,确定每个待切分音节的切分概率。The probability determining unit 400 is configured to determine a segmentation probability of each syllable to be segmented by using a Markov chain probability speech model according to the position syllable type information of the Thai character in the syllable to be segmented.

切分单元500,用于根据每个待切分音节及其对应的切分概率,切分设定待处理泰语句子中的音节。The segmentation unit 500 is configured to slice and set the syllables in the Thai sentence to be processed according to each syllable to be sliced and its corresponding segmentation probability.

本发明一实施例中,预处理单元100,还用于对待处理泰语文本中的非泰语字符串进行识别;以及,根据保存的泰语字符与位置音节类型信息之间的对应关系,确定待处理泰语文本中每个泰语字符的位置音节类型信息,其中,位置音节类型信息是根据泰语的基本语法规则生成的,包括音节类型信息和位置信息。In an embodiment of the present invention, the pre-processing unit 100 is further configured to identify a non-Thai character string to be processed in the Thai text; and determine the Thai language to be processed according to the correspondence between the saved Thai character and the location syllable type information. The position syllable type information of each Thai character in the text, wherein the position syllable type information is generated according to the basic grammatical rules of the Thai language, including syllable type information and position information.

本发明一实施例中,标识单元200,还用于将两个非泰语音节字符之间的边界标注为第一切分标识;以及,将两个泰语音节字符之间的边界标注为待切分标识,将一个泰语音节字符与一个非泰语字符串之间的边界标注为待切分标识。In an embodiment of the present invention, the identifier unit 200 is further configured to mark a boundary between two non-Thai phonological characters as a first segmentation identifier; and, to mark a boundary between two Thai syllable characters as The segmentation identifier is used to mark the boundary between a Thai syllable character and a non-Thai character string as the identifier to be sliced.

本发明一实施例中,概率确定单元400,还用于若泰语字符的位置音节类型信息为不在音节末尾的辅音字符时,则包括由第一顺序排列的泰语字符与待切分标识的待切分音节对应的切分概率为零;若泰语字符的位置音节类型信息为不在音节起始位置的元音字符时,则包括由第二顺序排列的 待切分标识与泰语字符的待切分音节对应的切分概率为零。In an embodiment of the present invention, the probability determining unit 400 is further configured to: if the position syllable type information of the Thai character is a consonant character not at the end of the syllable, include the Thai character arranged in the first order and the to-be-cut identifier to be cut. The segmentation probability corresponding to the partial syllable is zero; if the position syllable type information of the Thai character is a vowel character that is not at the beginning of the syllable, the to-be-segmented yoke and the Thai character to be sliced are arranged in the second order. The corresponding segmentation probability is zero.

本发明一实施例中,切分单元500,还用于根据每个待切分音节及其对应的切分概率,确定设定待处理泰语句子中每个预处理音节的预处理切分概率;根据预处理切分概率的大小,对设定待处理泰语句子中的音节进行切分。In an embodiment of the present invention, the segmentation unit 500 is further configured to determine, according to each syllable to be sliced and its corresponding segmentation probability, a pre-processing segmentation probability of each pre-processed syllable in the pending Thai sentence; According to the size of the pre-processing splitting probability, the syllables in the set Thai sentence are divided.

下面举例说明本公开实施例提供的装置。The apparatus provided by the embodiment of the present disclosure is exemplified below.

图4是根据一示例性实施例示出的一种泰语音节切分装置的框图。如图4所示,该装置包括:预处理单元100、标识单元200、提取单元200、概率确定单元400和切分单元500。还可包括一个存储单元600。FIG. 4 is a block diagram of a Thai syllable segmentation apparatus according to an exemplary embodiment. As shown in FIG. 4, the apparatus includes: a pre-processing unit 100, an identification unit 200, an extraction unit 200, a probability determination unit 400, and a segmentation unit 500. A storage unit 600 can also be included.

其中,存储单元600中存储了如表1所示的泰语字符与位置音节类型信息之间的对应关系。这样,预处理单元100可对待处理泰语文本中的非泰语字符串进行识别,确定非泰语字符串,并可根据存储单元600保存的泰语字符与位置音节类型信息之间的对应关系,确定待处理泰语文本中每个泰语字符的位置音节类型信息。The storage unit 600 stores the correspondence between the Thai characters and the position syllable type information as shown in Table 1. In this way, the pre-processing unit 100 can identify the non-Thai character string in the Thai text to be processed, determine the non-Thai character string, and determine the pending relationship according to the correspondence between the Thai character and the position syllable type information saved by the storage unit 600. The position syllable type information of each Thai character in the Thai text.

而标识单元200可将待处理泰语文本中两个非泰语音节字符之间的边界标注为第一切分标识,将两个泰语音节字符之间的边界,以及一个泰语音节字符与一个非泰语字符串之间的边界都标注为待切分标识,从而,待处理泰语文本可用C 1I 1C 2I 2…C iI i…CnIn来标示,其中,C i为待处理泰语文本中的字符,I i为字符之间的边界。其中,I i=S,或I i=B。S为第一切分标识,B为待切分标识。 The identification unit 200 can mark the boundary between two non-Thai syllabic characters in the Thai text to be processed as the first segmentation identifier, the boundary between the two Thai syllable characters, and one Thai syllable character and one The boundary between the non-Thai strings is marked as the identifier to be segmented, so that the Thai text to be processed can be marked by C 1 I 1 C 2 I 2 ... C i I i ... CnIn, where C i is the Thai text to be processed The character in , I i is the boundary between the characters. Where I i =S, or I i =B. S is the first segmentation identifier, and B is the identifier to be sliced.

然后,提取单元300提取待处理泰语文本中的每个待切分音节。从而,提取出来的待切分音节包括:连续出现1次泰文字符的待切分音节,连续出现2次泰文字符的待切分音节,以及连续出现3次泰文字符的待切分音节。分别对应为CB,BC,CCB,CBC,BCC,CCCB,CCBC,CBCC,BCCC这九种表示类型,其中,C表示泰文字符,而B表示待切分标识。Then, the extracting unit 300 extracts each syllable to be sliced in the Thai text to be processed. Thus, the extracted syllables to be sliced include: a syllable to be sliced with 1 Thai character in succession, a syllable to be sliced with 2 Thai characters in succession, and a syllable to be sliced with 3 consecutive Thai characters. Corresponding to the nine representation types CB, BC, CCB, CBC, BCC, CCCB, CCBC, CBCC, BCCC, respectively, where C is the Thai character and B is the identifier to be sliced.

若泰语字符C为不在音节末尾的辅音字符时,概率确定单元400确定 CB、P CCB、P CBC、P CCCB、P CCBC、以及P CBCC的值都为0。若泰语字符C为不在音节起始位置的元音字符时,概率确定单元400确定P BC、P BCC、P CBC、P CCBC、P BCCC以及P CBCC的值都为0。对于包含其他位置音节类型信息的泰语字符的待切分音节,概率确定单元400可进行马尔科夫链概率语音模型 的概率统计,确定上述表2中九种类型的每个待切分音节的切分概率。 If the Thai character C is a consonant character that is not at the end of the syllable, the probability determination unit 400 determines that the values of CB , P CCB , P CBC , P CCCB , P CCBC , and P CBCC are all zero. If the Thai character C is a vowel character that is not at the beginning of the syllable, the probability determining unit 400 determines that the values of P BC , P BCC , P CBC , P CCBC , P BCCC , and P CBCC are all zero. For the to-be-segmented syllables of the Thai characters including the other position syllable type information, the probability determining unit 400 may perform probability statistics of the Markov chain probabilistic speech model, and determine the cut of each of the nine types of the syllables to be cut out in the above Table 2. Probability.

从而,切分模块500可根据每个待切分音节及其对应的切分概率,确定设定待处理泰语句子中每个预处理音节的预处理切分概率,然后,根据预处理切分概率的大小,对设定待处理泰语句子中的音节进行切分。Therefore, the segmentation module 500 can determine, according to each syllable to be sliced and its corresponding segmentation probability, a pre-processing segmentation probability of each pre-processed syllable in the pending Thai sentence, and then, according to the pre-processing segmentation probability The size of the syllable in the set Thai sentence is divided.

可见,本实施例中,可通过泰语字符的位置音节类型信息,采用马尔科夫链概率语音模型,确定每个音节的切分概率,并根据切分概率对泰语句子中的音节进行切分,提供了另一种泰语音节切分的方式。并且该切分方式基于泰语基本语法规则,采用马尔科夫链概率语音模型的n元文法进行概率统计,提高了泰语音节切分的准确性和切分速度。It can be seen that, in this embodiment, the Markov chain probability speech model can be used to determine the segmentation probability of each syllable by using the position syllable type information of the Thai character, and the syllables in the Thai sentence are segmented according to the segmentation probability. Another way to segment the Thai syllables is provided. And the segmentation method is based on the basic grammar rules of Thai language. The n-gram grammar of Markov chain probability speech model is used to carry out probability statistics, which improves the accuracy and segmentation speed of Thai syllable segmentation.

本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (system), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的 功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

应当理解的是,本发明并不局限于上面已经描述并在附图中示出的流程及结构,并且可以在不脱离其范围进行各种修改和改变。本发明的范围仅由所附的权利要求来限制。It is to be understood that the invention is not to be construed as being limited to The scope of the invention is limited only by the appended claims.

Claims (10)

一种泰语音节切分的方法,其特征在于,包括:A method for segmentation of Thai syllables, characterized in that it comprises: 对从泰语语料库中获取的待处理泰语文本进行预处理,确定非泰语字符串,以及每个泰语字符的位置音节类型信息;Preprocessing the pending Thai text obtained from the Thai corpus to determine the non-Thai character string and the location syllable type information of each Thai character; 对所述待处理泰语文本中每个字符之间的边界进行打标签,其中,由至少一个泰语音节字符组成的边界标注为待切分标识;Labeling a boundary between each character in the Thai text to be processed, wherein a boundary composed of at least one Thai syllable character is marked as a to-be-sliced identifier; 提取所述待处理泰语文本中的每个待切分音节,其中,所述待切分音节由连续出现n次的泰语字符,以及一个所述待切分标识组成,n为正整数;Extracting each to-be-segmented syllable in the Thai text to be processed, wherein the to-be-segmented syllable is composed of Thai characters that appear n times consecutively, and one of the to-be-segmented identifiers, and n is a positive integer; 根据所述待切分音节中泰语字符的位置音节类型信息,采用马尔科夫链概率语音模型,确定每个待切分音节的切分概率;Determining, according to the position syllable type information of the Thai character in the syllable to be segmented, using a Markov chain probability speech model to determine a singular probability of each syllable to be sliced; 根据每个待切分音节及其对应的切分概率,切分设定待处理泰语句子中的音节。The syllables in the pending Thai sentence are set according to each syllable to be sliced and its corresponding segmentation probability. 如权利要求1所述的方法,其特征在于,所述对待处理泰语文本进行预处理,确定非泰语字符串,以及每个泰语字符的位置音节类型信息包括:The method according to claim 1, wherein said preprocessing the Thai text to be processed, determining the non-Thai character string, and the position syllable type information of each Thai character include: 对所述待处理泰语文本中的非泰语字符串进行识别;Identifying a non-Thai character string in the Thai text to be processed; 根据保存的泰语字符与位置音节类型信息之间的对应关系,确定所述待处理泰语文本中每个泰语字符的位置音节类型信息,其中,所述位置音节类型信息是根据泰语的基本语法规则生成的,包括音节类型信息和位置信息。Determining position syllable type information of each Thai character in the to-be-processed Thai text according to a correspondence between the saved Thai character and the position syllable type information, wherein the position syllable type information is generated according to the basic grammatical rules of the Thai language , including syllable type information and location information. 如权利要求1所述的方法,其特征在于,所述对所述待处理泰语文本中每个字符之间的边界进行打标签包括:The method of claim 1 wherein said tagging a boundary between each character in said Thai text to be processed comprises: 将两个非泰语音节字符之间的边界标注为第一切分标识;Mark the boundary between two non-Thai syllable characters as the first segmentation identifier; 将两个泰语音节字符之间的边界标注为所述待切分标识;Marking the boundary between two Thai syllable characters as the to-be-divided identifier; 将一个泰语音节字符与一个非泰语字符串之间的边界标注为所述待切分标识。A boundary between a Thai syllable character and a non-Thai character string is marked as the to-be-sliced identifier. 如权利要求1或2所述的方法,其特征在于,所述根据所述待切分音节中泰语字符的位置音节类型信息,采用马尔科夫链模型,确定每个待切分音节的切分概率包括:The method according to claim 1 or 2, wherein said Markov chain model is used to determine the segmentation of each syllable to be sliced according to the position syllable type information of the Thai character in the syllable to be sliced Probabilities include: 若所述泰语字符的位置音节类型信息为不在音节末尾的辅音字符时,则包括由第一顺序排列的泰语字符与所述待切分标识的待切分音节对应的切分概率为零;If the position syllable type information of the Thai character is a consonant character that is not at the end of the syllable, the cleavage probability corresponding to the syllable to be sliced by the Thai character arranged in the first order is zero; 若所述泰语字符的位置音节类型信息为不在音节起始位置的元音字符时,则包括由第二顺序排列的所述待切分标识与所述泰语字符的待切分音节对应的切分概率为零。If the position syllable type information of the Thai character is a vowel character that is not at the beginning of the syllable, the segmentation identifier corresponding to the to-be-segmented syllable of the Thai character is arranged by the second order The probability is zero. 如权利要求1所述的方法,其特征在于,所述根据每个待切分音节及其对应的切分概率,切分设定待处理泰语句子中的音节包括:The method according to claim 1, wherein the segmentation of the syllables in the pending Thai sentence according to each syllable to be sliced and its corresponding segmentation probability comprises: 根据每个待切分音节及其对应的切分概率,确定所述设定待处理泰语句子中每个预处理音节的预处理切分概率;Determining a pre-processing segmentation probability of each pre-processed syllable in the set Thai sentence according to each to-be-segmented syllable and its corresponding segmentation probability; 根据所述预处理切分概率的大小,对所述设定待处理泰语句子中的音节进行切分。The syllables in the set Thai sentence are divided according to the size of the pre-processed segmentation probability. 一种泰语音节切分的装置,其特征在于,包括:A device for segmentation of Thai syllables, characterized in that it comprises: 预处理单元,用于对从泰语语料库中获取的待处理泰语文本进行预处理,确定非泰语字符串,以及每个泰语字符的位置音节类型信息;a preprocessing unit for preprocessing the pending Thai text obtained from the Thai corpus, determining a non-Thai character string, and location syllable type information of each Thai character; 标识单元,用于对所述待处理泰语文本中每个字符之间的边界进行打标签,其中,由至少一个泰语音节字符组成的边界标注为待切分标识;An identifier unit, configured to label a boundary between each character in the Thai text to be processed, wherein a boundary composed of at least one Thai syllable character is marked as a to-be-sliced identifier; 提取单元,用于提取所述待处理泰语文本中的每个待切分音节,其中,所述待切分音节由连续出现n次的泰语字符,以及一个所述待切分标识组成,n为正整数;An extracting unit, configured to extract each to-be-segmented syllable in the Thai text to be processed, wherein the to-be-segmented syllable is composed of Thai characters that appear n times consecutively, and one of the to-be-segmented identifiers, n is Positive integer 概率确定单元,用于根据所述待切分音节中泰语字符的位置音节类型信息,采用马尔科夫链概率语音模型,确定每个待切分音节的切分概率;a probability determining unit, configured to determine, according to the position syllable type information of the Thai character in the syllable to be sliced, a Markov chain probabilistic speech model to determine a puncturing probability of each syllable to be sliced; 切分单元,用于根据每个待切分音节及其对应的切分概率,切分设定待处理泰语句子中的音节。A segmentation unit is configured to slice the syllables in the pending Thai sentence according to each syllable to be sliced and its corresponding segmentation probability. 如权利要求6所述的装置,其特征在于,The device of claim 6 wherein: 所述预处理单元,还用于对所述待处理泰语文本中的非泰语字符串进行识别;以及,根据保存的泰语字符与位置音节类型信息之间的对应关系,确定所述待处理泰语文本中每个泰语字符的位置音节类型信息,其中,所述位置音节类型信息是根据泰语的基本语法规则生成的,包括音节类型信息和位置信息。The preprocessing unit is further configured to: identify a non-Thai character string in the Thai text to be processed; and determine the to-be-processed Thai text according to a correspondence between the saved Thai character and the location syllable type information. The position syllable type information of each of the Thai characters, wherein the position syllable type information is generated according to the basic grammatical rules of the Thai language, including syllable type information and position information. 如权利要求6所述的装置,其特征在于,The device of claim 6 wherein: 所述标识单元,还用于将两个非泰语音节字符之间的边界标注为第一切分标识;以及,将两个泰语音节字符之间的边界标注为所述待切分标识,将一个泰语音节字符与一个非泰语字符串之间的边界标注为所述待切分标识。The identifier unit is further configured to mark a boundary between two non-Thai phonological characters as a first segmentation identifier; and, to mark a boundary between two Thai syllable characters as the to-be-segmented identifier, A boundary between a Thai syllable character and a non-Thai character string is marked as the to-be-sliced identifier. 如权利要求6或7所述的装置,其特征在于,A device according to claim 6 or claim 7 wherein: 所述概率确定单元,还用于若所述泰语字符的位置音节类型信息为不在音节末尾的辅音字符时,则包括由第一顺序排列的泰语字符与所述待切分标识的待切分音节对应的切分概率为零;若所述泰语字符的位置音节类型信息为不在音节起始位置的元音字符时,则包括由第二顺序排列的所述待切分标识与所述泰语字符的待切分音节对应的切分概率为零。The probability determining unit is further configured to: if the position syllable type information of the Thai character is a consonant character not at the end of the syllable, including the Thai character arranged in the first order and the to-be-segmented syllable to be sliced Corresponding segmentation probability is zero; if the location syllable type information of the Thai character is a vowel character not at the beginning of the syllable, the to-be-segmented identifier and the Thai character arranged by the second order are included The segmentation probability corresponding to the segmented syllable is zero. 如权利要求6所述的装置,其特征在于,The device of claim 6 wherein: 所述切分单元,还用于根据每个待切分音节及其对应的切分概率,确定所述设定待处理泰语句子中每个预处理音节的预处理切分概率;根据所述预处理切分概率的大小,对所述设定待处理泰语句子中的音节进行切分。The segmentation unit is further configured to determine, according to each to-be-segmented syllable and its corresponding segmentation probability, a pre-processing segmentation probability of each pre-processed syllable in the set Thai sentence; The size of the segmentation probability is processed, and the syllables in the set Thai sentence are divided.
PCT/CN2017/116082 2017-11-27 2017-12-14 Method and device for segmenting thai syllables Ceased WO2019100458A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711204590.XA CN107967259A (en) 2017-11-27 2017-11-27 The method and device of Thai syllable splitting
CN201711204590.X 2017-11-27

Publications (1)

Publication Number Publication Date
WO2019100458A1 true WO2019100458A1 (en) 2019-05-31

Family

ID=61998959

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/116082 Ceased WO2019100458A1 (en) 2017-11-27 2017-12-14 Method and device for segmenting thai syllables

Country Status (2)

Country Link
CN (1) CN107967259A (en)
WO (1) WO2019100458A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460766A (en) * 2020-03-31 2020-07-28 云知声智能科技股份有限公司 Method and device for identifying contradictory speech block boundaries
CN112883726A (en) * 2021-01-21 2021-06-01 昆明理工大学 Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871537B (en) * 2019-01-31 2022-12-27 沈阳雅译网络技术有限公司 High-precision Thai sentence segmentation method
CN111627421B (en) * 2020-05-13 2023-08-11 广州国音智能科技有限公司 Speech recognition method, device, equipment and computer-readable storage medium
CN112905024B (en) * 2021-01-21 2023-10-27 李博林 Syllable recording method and device for word
CN114254638B (en) * 2021-12-22 2025-06-06 科大讯飞股份有限公司 Devanagari word segmentation and recognition method, device, electronic device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1346126A (en) * 2000-09-27 2002-04-24 中国科学院自动化研究所 Three-tone model with tune and training method
CN103324607A (en) * 2012-03-20 2013-09-25 北京百度网讯科技有限公司 Method and device for word segmentation of Thai texts
CN103324621A (en) * 2012-03-21 2013-09-25 北京百度网讯科技有限公司 Method and device for correcting spelling of Thai texts
CN103914569A (en) * 2014-04-24 2014-07-09 百度在线网络技术(北京)有限公司 Input prompt method and device and dictionary tree model establishing method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5806021A (en) * 1995-10-30 1998-09-08 International Business Machines Corporation Automatic segmentation of continuous text using statistical approaches
US8165869B2 (en) * 2007-12-10 2012-04-24 International Business Machines Corporation Learning word segmentation from non-white space languages corpora
CN103678282B (en) * 2014-01-07 2016-05-25 苏州思必驰信息科技有限公司 A kind of segmenting method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1346126A (en) * 2000-09-27 2002-04-24 中国科学院自动化研究所 Three-tone model with tune and training method
CN103324607A (en) * 2012-03-20 2013-09-25 北京百度网讯科技有限公司 Method and device for word segmentation of Thai texts
CN103324621A (en) * 2012-03-21 2013-09-25 北京百度网讯科技有限公司 Method and device for correcting spelling of Thai texts
CN103914569A (en) * 2014-04-24 2014-07-09 百度在线网络技术(北京)有限公司 Input prompt method and device and dictionary tree model establishing method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460766A (en) * 2020-03-31 2020-07-28 云知声智能科技股份有限公司 Method and device for identifying contradictory speech block boundaries
CN111460766B (en) * 2020-03-31 2023-05-26 云知声智能科技股份有限公司 Contradictory language block boundary recognition method and device
CN112883726A (en) * 2021-01-21 2021-06-01 昆明理工大学 Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning
CN112883726B (en) * 2021-01-21 2021-12-28 昆明理工大学 Multi-task Thai word segmentation method based on joint learning of syllable segmentation and word segmentation

Also Published As

Publication number Publication date
CN107967259A (en) 2018-04-27

Similar Documents

Publication Publication Date Title
WO2019100458A1 (en) Method and device for segmenting thai syllables
Zitouni et al. Maximum entropy based restoration of Arabic diacritics
CN108536654B (en) Method and device for displaying identification text
CN111859964B (en) Method and device for identifying named entities in sentences
Freeman et al. Cross linguistic name matching in English and Arabic
KR20110004625A (en) A system and method for converting native phonetic pronunciation strings for Chinese characters using statistical methods
CN111046660B (en) Method and device for identifying text professional terms
Shaalan et al. A hybrid approach for building Arabic diacritizer
EP1675019B1 (en) System and method for disambiguating non diacritized arabic words in a text
CN111178009A (en) A text multilingual recognition method based on feature word weighting
Paripremkul et al. Segmenting words in Thai language using Minimum text units and conditional random Field
CN103744837B (en) Many texts contrast method based on keyword abstraction
Li et al. Chinese prosody phrase break prediction based on maximum entropy model.
US8335681B2 (en) Machine-translation apparatus using multi-stage verbal-phrase patterns, methods for applying and extracting multi-stage verbal-phrase patterns
Alghamdi et al. Automatic restoration of Arabic diacritics: a simple, purely statistical approach
CN107168953A (en) The new word discovery method and system that word-based vector is characterized in mass text
CN110866390B (en) Method and device for recognizing Chinese grammar error, computer equipment and storage medium
Nederhof et al. A probabilistic model of Ancient Egyptian writing
Chiu et al. Chinese spell checking based on noisy channel model
CN113919326A (en) Text error correction method and device
KR20120042381A (en) Apparatus and method for classifying sentence pattern of speech recognized sentence
Wray et al. Best practices for crowdsourcing dialectal arabic speech transcription
CN112084777B (en) Entity linking method
Tongtep et al. Multi-stage automatic NE and pos annotation using pattern-based and statistical-based techniques for thai corpus construction
CN109344389A (en) A method and system for constructing a Chinese-blind bilingual corpus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17932855

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17932855

Country of ref document: EP

Kind code of ref document: A1