[go: up one dir, main page]

CN108628813B - Processing method and device for processing - Google Patents

Processing method and device for processing Download PDF

Info

Publication number
CN108628813B
CN108628813B CN201710162165.2A CN201710162165A CN108628813B CN 108628813 B CN108628813 B CN 108628813B CN 201710162165 A CN201710162165 A CN 201710162165A CN 108628813 B CN108628813 B CN 108628813B
Authority
CN
China
Prior art keywords
punctuation
optimal
semantic
language model
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710162165.2A
Other languages
Chinese (zh)
Other versions
CN108628813A (en
Inventor
郑宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201710162165.2A priority Critical patent/CN108628813B/en
Publication of CN108628813A publication Critical patent/CN108628813A/en
Application granted granted Critical
Publication of CN108628813B publication Critical patent/CN108628813B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a processing method and device and a device for processing, wherein the method specifically comprises the following steps: acquiring a text to be processed; performing word segmentation on the text to be processed to obtain a global word sequence corresponding to the text to be processed; punctuation adding processing is carried out on the global word sequence to obtain an optimal punctuation adding result corresponding to the text to be processed; the punctuation adding process adds target punctuation marks between adjacent words in the global word sequence, the language model corresponding to the optimal punctuation adding result has optimal probability, and the optimal punctuation adding result comprises the following steps: at least one semantic segment, the semantic segment comprising: continuous words of the global word sequence and/or continuous words added with punctuation marks; and outputting the optimal punctuation addition result. The embodiment of the invention can improve the accuracy of adding the punctuations.

Description

Processing method and device for processing
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to a processing method and apparatus, and an apparatus for processing.
Background
In the information processing technology fields such as the communication field and the internet field, in some application scenarios, punctuation needs to be added to some files lacking punctuation, for example, punctuation is added to a text corresponding to a voice recognition result.
In the existing scheme, punctuations can be added to texts corresponding to voice recognition results according to the mute intervals of voice signals. Specifically, a threshold of the mute length may be set first, and if the length of the mute interval when the speaking user speaks in the voice signal exceeds the threshold, a punctuation is added to the corresponding position; conversely, if the length of the silence interval in the speech signal when the speaking user speaks exceeds the threshold, then no punctuation is added.
However, in the process of implementing the embodiment of the present invention, the inventor finds that different speaking users often have different speech rates, so that adding punctuation to the text corresponding to the speech recognition result according to the silence interval of the speech signal in the existing scheme will affect the accuracy of adding punctuation. For example, if the speech speed of the speaking user is too fast, there is no interval between sentences, or the interval is so short that it is smaller than the threshold, then no punctuation is added to the text; for another example, if the speech speed of the speaking user is too slow and approaches the situation of one word and one pause, the text will correspond to many punctuations; both of the above two cases can cause punctuation addition errors, i.e. the accuracy of punctuation addition is low.
Disclosure of Invention
In view of the above problems, embodiments of the present invention have been made to provide a processing method, a processing apparatus, and an apparatus for processing that overcome or at least partially solve the above problems, and can improve accuracy of punctuation addition.
In order to solve the above problem, the present invention discloses a processing method, comprising:
acquiring a text to be processed;
performing word segmentation on the text to be processed to obtain a global word sequence corresponding to the text to be processed;
punctuation adding processing is carried out on the global word sequence to obtain an optimal punctuation adding result corresponding to the text to be processed; the punctuation adding process adds target punctuation marks between adjacent words in the global word sequence, the language model corresponding to the optimal punctuation adding result has optimal probability, and the optimal punctuation adding result comprises the following steps: at least one semantic segment, the semantic segment comprising: continuous words of the global word sequence and/or continuous words added with punctuation marks;
and outputting the optimal punctuation addition result.
In another aspect, the present invention discloses a processing apparatus comprising:
the text to be processed acquisition module is used for acquiring a text to be processed;
the word segmentation module is used for segmenting the text to be processed to obtain a global word sequence corresponding to the text to be processed;
the punctuation adding processing module is used for performing punctuation adding processing on the global word sequence to obtain an optimal punctuation adding result corresponding to the text to be processed; the punctuation adding process adds target punctuation marks between adjacent words in the global word sequence, the language model corresponding to the optimal punctuation adding result has optimal probability, and the optimal punctuation adding result comprises the following steps: at least one semantic segment, the semantic segment comprising: continuous words of the global word sequence and/or continuous words added with punctuation marks; and
and the result output module is used for outputting the optimal punctuation addition result.
Optionally, the punctuation addition processing module includes:
and the dynamic programming processing submodule is used for performing punctuation addition processing on the global word sequence by utilizing a dynamic programming algorithm so as to obtain an optimal punctuation addition result corresponding to the text to be processed.
Optionally, the dynamic planning processing sub-module includes:
the set acquisition unit is used for acquiring a word sequence set corresponding to the global word sequence;
a first recursion unit, configured to determine, in a recursion manner, target punctuation marks of punctuation addition results of the optimal subsets corresponding to the subsets according to the order of the subsets of the word sequence set from small to large; the language model corresponding to the optimal subset punctuation addition result is optimal in probability;
and the first optimal result acquisition unit is used for acquiring an optimal punctuation addition result corresponding to the text to be processed according to the punctuation addition result of the optimal subset corresponding to the subset of the word sequence set.
Optionally, the subset of the set of sequences of consecutive words comprises: the first i consecutive words of the text to be processed, where i is greater than 0 and less than or equal to the number M of words contained in the text to be processed, the first recursion unit includes:
an adding subunit, configured to add punctuation marks between adjacent words in the first i consecutive words according to a target punctuation mark of an optimal subset punctuation addition result corresponding to the first k consecutive words, so as to obtain at least one subset punctuation addition path corresponding to the first i consecutive words; wherein 0< k < i, k being a positive integer;
the first language model probability determining subunit is used for determining the language model probability of the first semantic segment corresponding to the subset punctuation adding path by utilizing a neural network language model;
the first selection subunit is used for selecting an optimal subset punctuation adding path with optimal language model probability from the at least one subset punctuation adding path according to the language model probability of the first semantic segment;
and the target punctuation mark obtaining subunit is configured to obtain, according to punctuation marks included in the optimal subset punctuation adding path, target punctuation marks of the optimal subset punctuation adding result corresponding to the first i consecutive words.
Optionally, the dynamic planning processing sub-module includes:
a global path obtaining unit, configured to add punctuation marks between adjacent words in the global word sequence to obtain a global punctuation adding path corresponding to the global word sequence;
the mobile acquisition unit is used for acquiring a local punctuation adding path and a second semantic segment corresponding to the local punctuation adding path from the global punctuation adding path in a mobile mode according to the sequence from front to back; the number of character units contained in different second semantic fragments is the same, and the adjacent second semantic fragments have repeated character units, wherein the character units comprise: word and/or punctuation;
the second recursion unit is used for determining the target punctuation marks corresponding to the optimal second semantic segment in a recursion mode according to the sequence from front to back; the language model corresponding to the optimal second semantic fragment has optimal probability;
and the second optimal result acquisition unit is used for acquiring an optimal punctuation addition result corresponding to the text to be processed according to the target punctuation symbols corresponding to the optimal second semantic segments.
Optionally, the second recursion unit comprises:
the second language model probability determining subunit is used for determining the language model probability corresponding to the current second semantic segment by utilizing the N-element grammar language model and/or the neural network language model;
the second selection subunit is used for selecting the optimal current second semantic fragment from the multiple current second semantic fragments according to the language model probability corresponding to the current second semantic fragment;
a target punctuation mark determining subunit, configured to use punctuation marks included in the optimal current second semantic segment as target punctuation marks corresponding to the optimal current second semantic segment;
and the second semantic segment determining subunit is used for obtaining a next second semantic segment according to the optimal target punctuation mark corresponding to the current second semantic segment.
Optionally, the second optimal result obtaining unit includes:
and the adding subunit is configured to add punctuation marks to the global word sequence according to the target punctuation marks corresponding to the optimal second semantic segments in the order from back to front or the order from front to back, so as to obtain an optimal punctuation adding result corresponding to the text to be processed.
Optionally, the punctuation addition processing module includes:
the result exhaustion submodule is used for acquiring various punctuation addition results corresponding to the global word sequence;
the language model probability determining submodule is used for determining the language model probability corresponding to the punctuation addition result; and
and the result selection submodule is used for selecting a punctuation addition result with the optimal language model probability from the multiple punctuation addition results corresponding to the global word sequence as the optimal punctuation addition result corresponding to the text to be processed.
In yet another aspect, an apparatus for processing is disclosed that includes a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors to include instructions for:
acquiring a text to be processed;
performing word segmentation on the text to be processed to obtain a global word sequence corresponding to the text to be processed;
punctuation adding processing is carried out on the global word sequence to obtain an optimal punctuation adding result corresponding to the text to be processed; the punctuation adding process adds target punctuation marks between adjacent words in the global word sequence, the language model corresponding to the optimal punctuation adding result has optimal probability, and the optimal punctuation adding result comprises the following steps: at least one semantic segment, the semantic segment comprising: continuous words of the global word sequence and/or continuous words added with punctuation marks;
and outputting the optimal punctuation addition result.
The embodiment of the invention has the following advantages:
in the embodiment of the present invention, target punctuation marks are added between adjacent words in a global word sequence corresponding to a text to be processed through punctuation addition processing, and a language model corresponding to an optimal punctuation addition result obtained through the punctuation addition processing has an optimal probability, where the optimal punctuation addition result may include: at least one semantic segment, wherein the semantic segment may include: continuous words of the global word sequence and/or continuous words added with punctuation marks; the optimal punctuation adding result of the embodiment of the invention can realize the global optimization of the probability of the language model, and the global situation can be used for expressing the whole corresponding to the punctuation adding result of the text to be processed, so the optimal punctuation adding result of the embodiment of the invention can improve the accuracy of punctuation addition.
Drawings
FIG. 1 is a flow chart of the steps of one embodiment of a processing method of the present invention;
FIG. 2 is a schematic diagram of a path plan of a global word sequence corresponding to a text to be processed according to an embodiment of the present invention;
FIG. 3 is a block diagram of a processing device according to an embodiment of the present invention;
fig. 4 is a block diagram illustrating an apparatus for information processing as a terminal according to an exemplary embodiment; and
fig. 5 is a block diagram illustrating an apparatus for information processing as a server according to an example embodiment.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
The embodiment of the invention provides a processing scheme, which can be used for segmenting a text to be processed to obtain a global word sequence corresponding to the text to be processed, performing punctuation addition processing on the global word sequence to obtain an optimal punctuation addition result corresponding to the text to be processed, and outputting the optimal punctuation addition result; in the embodiment of the present invention, the punctuation addition processing adds target punctuation marks between adjacent words in the global word sequence, where the target punctuation marks can be used to represent an optimal candidate punctuation mark added between adjacent words, and a language model corresponding to an optimal punctuation addition result obtained by the punctuation addition processing is optimal in probability, where the optimal punctuation addition result may include: at least one semantic segment, which may include: the continuous words of the global word sequence and/or the continuous words added with punctuation marks, wherein the language model probability can be the synthesis of the language model probabilities corresponding to all semantic fragments contained in a punctuation addition result; the optimal punctuation adding result of the embodiment of the invention can realize the global optimization of the probability of the language model, and the global situation can be used for expressing the whole corresponding to the punctuation adding result of the text to be processed, so the optimal punctuation adding result of the embodiment of the invention can improve the accuracy of punctuation adding.
Method embodiment
Referring to fig. 1, a flowchart illustrating steps of an embodiment of a processing method according to the present invention is shown, which may specifically include the following steps:
step 101, acquiring a text to be processed;
102, performing word segmentation on the text to be processed to obtain a global word sequence corresponding to the text to be processed;
103, performing punctuation addition processing on the global word sequence to obtain an optimal punctuation addition result corresponding to the text to be processed; the punctuation adding process adds target punctuation marks between adjacent words in the global word sequence, where the language model probability corresponding to the optimal punctuation adding result is optimal, and the optimal punctuation adding result may include: at least one semantic segment, which may include: continuous words of the global word sequence and/or continuous words added with punctuation marks;
and 104, outputting the optimal punctuation addition result.
The embodiment of the invention can be applied to any application scenes needing to add punctuations, such as voice recognition, machine translation and the like, and it can be understood that the embodiment of the invention does not limit the specific application scenes. For example, in the context of speech recognition applications,
the processing method provided by the embodiment of the invention can be applied to the application environment of computing equipment such as a terminal or a server. Optionally, the terminal may include, but is not limited to: smart phones, tablets, laptop portable computers, in-vehicle computers, desktop computers, smart televisions, wearable devices, and the like. The server can be a cloud server or a common server and is used for providing a processing service of the text to be processed for the client.
The processing method provided by the embodiment of the invention can be suitable for processing Chinese, Japanese, Korean and other languages, and is used for improving the accuracy of punctuation addition. It will be appreciated that any language in which punctuation is desired is within the scope of applicability of the processing method of embodiments of the present invention.
In the embodiment of the present invention, the text to be processed may be used to represent a text that needs to be processed, where the text to be processed may be derived from a text or a voice input by a user through a computing device, or may be derived from other computing devices. It should be noted that, the text to be processed may include: one language or more than one language, for example, the text to be processed may include chinese, or may include a mixture of chinese and other languages such as english, and the embodiment of the present invention does not limit the specific text to be processed.
In practical applications, the computing device according to the embodiment of the present invention may execute the processing method flow according to the embodiment of the present invention through a client APP (Application), and the client APP may run on the computing device, for example, the client APP may be any APP running on a terminal, and the client APP may obtain a text to be processed from other applications of the computing device. Alternatively, the computing device in the embodiment of the present invention may execute the processing method flow in the embodiment of the present invention through a function device of the client application, and then the function device may obtain the text to be processed from another function device. Alternatively, the computing device of the embodiment of the present invention may be used as a server to execute the processing method of the embodiment of the present invention.
In an alternative embodiment of the present invention, step 101 may obtain the text to be processed according to the voice signal of the speaking user, in this case, step 101 may convert the voice signal of the speaking user into text information, and obtain the text to be processed from the text information. Alternatively, step 101 may directly receive text information corresponding to the voice signal of the user from the voice recognition device, and obtain the text to be processed from the text information. In practical applications, the speaking user may include: a user who speaks and sends a voice signal in the simultaneous interpretation scene, and/or a user who generates a voice signal through a terminal, etc., the voice signal of the speaking user can be received through a microphone or other voice acquisition devices.
Alternatively, speech recognition techniques may be employed to convert the speech signal of the speaking user into textual information. If the speech signal of the user who speaks is marked as S, the S is processed in series to obtain a corresponding speech feature sequence O, and the sequence O is marked as { O ═ O 1 ,O 2 ,…,O i ,…,O T In which O is i Is the ith speech feature, and T is the total number of speech features. A sentence corresponding to the speech signal S can be regarded as a word string composed of many words, and is denoted by W ═ W 1 ,w 2 ,…,w n }. The process of speech recognition is to find the most likely word string W based on the known speech feature sequence O.
Specifically, the speech recognition is a model matching process, in which a speech model is first established according to the speech characteristics of a person, and a template required for the speech recognition is established by extracting required features through analysis of an input speech signal; the process of recognizing the voice input by the user is a process of comparing the characteristics of the voice input by the user with the template, and finally determining the best template matched with the voice input by the user so as to obtain a voice recognition result. The specific speech recognition algorithm may adopt a training and recognition algorithm based on a statistical hidden markov model, or may adopt other algorithms such as a training and recognition algorithm based on a neural network, a recognition algorithm based on dynamic time warping matching, and the like.
In another alternative embodiment of the present invention, step 101 may obtain the text to be processed according to the text input by the user. For example, the text input by the user in the scenes of instant messaging, office documents and the like may not contain punctuation marks or contain fewer punctuation marks, and therefore, the text can be used as a source of the text to be processed.
In practical application, step 101 may obtain a text to be processed from a text corresponding to a voice signal or a text input by a user according to practical application requirements. Optionally, the text to be processed may be obtained from the text corresponding to the voice signal S according to the interval time of the voice signal S; for example, when the interval time of the voice signal S is greater than the time threshold, a corresponding first demarcation point may be determined according to the time point, a text corresponding to the voice signal S before the first demarcation point is used as a text to be processed, and a text corresponding to the voice signal S after the first demarcation point is processed to continue to obtain the text to be processed therefrom. Or optionally, the text to be processed may be obtained from the text corresponding to the voice signal or the text input by the user according to the number of words contained in the text corresponding to the voice signal or the text input by the user; for example, when the text corresponding to the voice signal or the text input by the user includes a number of words greater than a word number threshold, the corresponding second demarcation point may be determined according to the word number threshold, the text corresponding to the voice signal S before the second demarcation point may be used as the text to be processed, and the text corresponding to the voice signal S after the second demarcation point may be processed to continue to obtain the text to be processed therefrom. It can be understood that the embodiment of the present invention does not impose any limitation on the specific process of obtaining the text to be processed from the text corresponding to the voice signal or the text input by the user.
Since the corpus used for training the language model is usually a corpus subjected to word segmentation, in order to obtain the language model probability corresponding to the semantic segment included in the optimal punctuation addition result, in the embodiment of the present invention, the word segmentation may be performed on the to-be-processed text in step 102 to obtain the global word sequence corresponding to the to-be-processed text.
The word segmentation is a process of segmenting a text into a single word and recombining continuous texts into a global word sequence according to a certain specification. Taking the chinese word segmentation technique as an example, the goal of the word segmentation technique is to segment a text into a single chinese word. The segmentation of sentences into individual words is the first step in the realization of machine recognition of human languages, so that the word segmentation technology is widely applied to application branches of natural language processing such as text-to-speech conversion, machine translation, speech recognition, text summarization, text retrieval and the like.
In this embodiment of the present invention, in step 102, the to-be-processed text is subjected to word segmentation, and the word segmentation method that may be adopted may specifically include: the word segmentation method based on character string matching, the word segmentation method based on understanding, the word segmentation method based on statistics and the like can be understood, and the specific process of performing word segmentation on the text to be processed is not limited in the embodiment of the invention. In an application example of the present invention, the text to be processed is "hello i is mingming and happy to know you", and the corresponding global word sequence may include: "hello/i is/xiao/mingming/very happy/know you".
It should be noted that the process of segmenting the text to be processed in step 102 and the speech recognition process may be independent processes, and the process of segmenting the text to be processed in step 102 may not be affected by the speech recognition process, for example, the sentence W corresponding to the speech signal S may be segmented in step 102.
In an optional embodiment of the present invention, the method of the embodiment of the present invention may further include: writing the at least one text to be processed acquired in the step 101 into a cache region; step 102 may first read the text to be processed from the buffer and perform word segmentation on the read text to be processed. Optionally, a data structure such as a queue, an array, or a linked list may be established in a memory area of the computing device as the cache area, and the specific cache area is not limited in the embodiment of the present invention. The above-mentioned manner of storing the text to be processed by using the cache region can improve the processing efficiency of the text to be processed, and it can be understood that a manner of storing the text to be processed by using a disk is also feasible, and the embodiment of the present invention does not limit the specific storage manner of the text to be processed.
In the embodiment of the invention, multiple candidate punctuations can be added between adjacent words in the global word sequence corresponding to the text to be processed, that is, punctuation addition processing can be performed on the text to be processed according to the situation that multiple candidate punctuations are added between adjacent words in the global word sequence corresponding to the text to be processed, so that multiple punctuation addition schemes and corresponding punctuation addition results are corresponding to one global word sequence corresponding to the text to be processed, and the optimal punctuation addition result with the optimal language model probability is finally obtained in the embodiment of the invention. The language model probability can be the integration of the language model probabilities corresponding to all semantic fragments contained in one (arbitrary) punctuation addition result.
In the field of natural language processing, a language model is a probabilistic model built for a language or languages in order to build a distribution that describes the probability of occurrence of a given global sequence of words in a language. In particular to embodiments of the present invention, the distribution of the probability of occurrence of a given global word sequence described by the language model in the language may be referred to as a language model probability, and the given global word sequence described by the language model may be punctuation. Optionally, a corpus sentence may be obtained from the corpus, a word segmentation is performed on the corpus sentence, and the language model is obtained through training according to a global word sequence including punctuations. For example, "I like a dog, which plays a ball. "the corresponding global word sequence may be: "i/like/dog/,/dog/play/ball/. ", it is understood that the embodiment of the present invention does not impose any limitation on the specific global word sequence used for the training of the language model.
In the embodiment of the present invention, the language model may include: an N-gram (N-gram) language model, and/or a neural network language model, wherein the neural network language model may further include: RNNLM (Recurrent Neural Network Language Model), CNNLM (Convolutional Neural Network Language Model), DNNLM (Deep Neural Network Language Model), and the like.
Where the N-gram language model is based on the assumption that the nth word occurs only in relation to the preceding N-1 words, but not in relation to any other words, the probability of a complete sentence being the product of the probabilities of occurrence of the words.
Since the N-gram language model predicts the Nth word with a limited number of N-1 words (above), the N-gram language model may have a description capability of the language model probability of the semantic segment with a length of N, for example, N may be a positive integer with a fixed value smaller than the first length threshold, such as 3, 5, etc. One advantage of neural network language models over N-gram language models, such as RNNLM, is: all the above can be fully utilized to predict the next word, so RNNLM can have the description capability of language model probability of semantic fragments with variable length, that is, RNNLM is suitable for semantic fragments with a wide length range, for example, the length range of the semantic fragment corresponding to RNNLM may be: 1-a second length threshold, wherein the second length threshold is greater than the first length threshold.
In this embodiment of the present invention, a semantic segment may be used to represent a portion of a global word sequence to which punctuations are added, where the semantic segment may include: consecutive words of the global word sequence (i.e. not containing punctuation marks) and/or consecutive words to which punctuation marks are added. Optionally, a part of the global word sequence may be truncated to obtain the continuous word. For example, for a global word sequence "hello/i is/minuscule/happy/know you," its corresponding semantic segments may include: "hello/,/my is", "my is/minuscule/happy", etc., wherein "/" is a symbol provided for the convenience of description of the specification, and "/" is used to indicate a boundary between words and/or a boundary between words and punctuation marks, and in practical applications, "/" may not have any meaning.
It should be noted that, a person skilled in the art may determine a candidate punctuation mark that needs to be added according to an actual application requirement, and optionally, the candidate punctuation mark may include: the invention relates to a method for segmenting words, which comprises the steps of generating a plurality of words, wherein the words are represented by commas, question marks, periods, exclamation marks, spaces and the like, wherein the spaces can play a role in word segmentation or do not play any role, for example, for English, the spaces can be used for segmenting different words, and for Chinese, the spaces can be punctuation marks which do not play any role.
The embodiment of the invention can provide the following technical scheme for performing punctuation addition processing on the global word sequence to obtain the optimal punctuation addition result corresponding to the text to be processed:
the technical scheme 1,
Technical solution 1 may include: obtaining various punctuation addition results corresponding to the global word sequence; determining the language model probability corresponding to the punctuation addition result; and selecting a punctuation addition result with the optimal language model probability from the multiple punctuation addition results corresponding to the global word sequence as the optimal punctuation addition result corresponding to the text to be processed.
In practical application, a path planning algorithm can be adopted to obtain various punctuation addition results corresponding to the global word sequence. The principle of the path planning algorithm may be that, in an environment with an obstacle, a collision-free path from an initial state to a target state is found according to a certain evaluation criterion, and specifically, in the embodiment of the present invention, the obstacle may be used to represent candidate punctuation marks added between adjacent words of a global word sequence corresponding to a text to be processed, and the initial state and the target state respectively represent punctuation marks after a first word and a last word of the global word sequence corresponding to the text to be processed.
Referring to fig. 2, a schematic diagram illustrating a path planning of a global word sequence corresponding to a text to be processed according to an embodiment of the present invention is shown, where the global word sequence corresponding to the text to be processed is "hello/i is/xiaoming/happy/know you", and then candidate punctuations may be added between adjacent words of "hello/i is/xiaoming/happy/know you"; in fig. 2, words such as "hello", "my is", "xiaoming", "happy", "know you" are respectively represented by rectangles, punctuations such as comma, space, exclamation mark, question mark, period and the like are respectively represented by circles, and then a plurality of paths can be provided between punctuations after the first word "hello" and the last end word "know you" of the global word sequence corresponding to the text to be processed.
It can be understood that the path planning algorithm is only an optional embodiment of the present invention, and actually, a person skilled in the art may obtain various punctuation addition results corresponding to the text to be processed by using other algorithms according to actual application requirements, and it can be understood that the embodiment of the present invention does not limit a specific obtaining algorithm of the various punctuation addition results.
In practical applications, a language model probability corresponding to the punctuation addition result may be determined by using a language model, and the corresponding language model may include: an N-gram language model, and/or a neural network language model, etc.
In an optional embodiment of the present invention, the determining the language model probability corresponding to the punctuation addition result may include: determining corresponding language model probability aiming at the third semantic segment contained in each punctuation addition result; fusing the language model probabilities corresponding to all third semantic fragments contained in each punctuation addition result to obtain corresponding language model probabilities; the punctuation addition result with the highest language model probability can be obtained from all punctuation addition results and used as the optimal punctuation addition result corresponding to the text to be processed.
Optionally, the corresponding third semantic segments may be obtained from the punctuation addition result by moving in the order from front to back, the number of character units included in different third semantic segments may be the same, and the adjacent second semantic segments may have repeated character units, where the character units may include: words and/or punctuation. In this case, the language model probability corresponding to the third semantic segment can be determined by the N-gram language model and/or the neural network language model. Assuming that N is 5 and the number of the first character unit is 1, the following order of numbering may be followed: 1-5, 2-6, 3-7, 4-8 and the like, obtaining a corresponding third semantic segment with the length of 5 from the punctuation addition result, and determining a language model probability corresponding to each third semantic segment by using an N-gram language model, for example, if each third semantic segment is input into an N-gram, the N-gram can output a corresponding language model probability.
Optionally, the process of fusing the language model probabilities corresponding to all the third semantic segments included in each punctuation addition result may include: the language model probabilities corresponding to all the third semantic fragments included in each punctuation addition result are summed, or multiplied, or weighted average processed, and the like, and it can be understood that the specific process of fusing the language model probabilities corresponding to all the third semantic fragments included in each punctuation addition result is not limited in the embodiment of the present invention.
In another optional embodiment of the present invention, the determining the language model probability corresponding to the punctuation addition result may include: determining the language model probability corresponding to all semantic fragments of each punctuation addition result by utilizing a neural network language model; the punctuation addition result with the highest language model probability can be obtained from all punctuation addition results and used as the optimal punctuation addition result corresponding to the text to be processed. Because RNNLM is suitable for semantic fragments with a wide length range, all semantic fragments of each punctuation addition result can be taken as a whole, and the RNNLM determines the language model probabilities corresponding to all semantic fragments of the punctuation addition result, for example, if all character units included in the punctuation addition result are input into RNNLM, then RNNLM can output the corresponding language model probabilities.
Technical scheme 2,
The technical means 2 may include: and performing punctuation addition processing on the global word sequence by using a dynamic programming algorithm to obtain an optimal punctuation addition result corresponding to the text to be processed.
The principle of the dynamic programming algorithm may be that by splitting the problem, the problem state and the relationship between the states are defined so that the problem can be solved in a recursive (or divide and conquer) manner. Specifically, in the embodiments of the present invention, the problem may be: searching an optimal punctuation addition result corresponding to the text to be processed, wherein the state can be that punctuation addition processing of the global word sequence is decomposed to obtain the optimal punctuation addition result corresponding to the local text to be processed and a target punctuation mark corresponding to the optimal punctuation addition result; the target punctuation can be used to represent the best candidate punctuation to be added between adjacent words. Compared with the technical scheme 1 that multiple punctuation addition results corresponding to the global word sequence are exhausted and the punctuation addition result with the optimal language model probability is selected, the dynamic programming algorithm adopted by the technical scheme 2 can reduce the operation amount, and the reduction range of the operation amount is increased along with the increase of the length of the global word sequence corresponding to the text to be processed.
The embodiment of the present invention may provide the following dynamic programming scheme, which uses a dynamic programming algorithm to perform punctuation addition processing on the global word sequence to obtain an optimal punctuation addition result corresponding to the text to be processed:
dynamic planning scheme 1,
In the dynamic programming scheme 1, the performing punctuation addition processing on the global word sequence by using a dynamic programming algorithm to obtain an optimal punctuation addition result corresponding to the text to be processed may specifically include:
acquiring a word sequence set corresponding to the global word sequence;
determining target punctuation marks of punctuation addition results of the optimal subsets corresponding to the subsets in a recursion mode according to the sequence of the subsets of the word sequence set from small to large; the language model corresponding to the optimal subset punctuation addition result is optimal in probability;
and obtaining an optimal punctuation addition result corresponding to the text to be processed according to the punctuation addition result of the optimal subset corresponding to the subset of the word sequence set.
Wherein, the word sequence setThe word sequence set may be used to represent a set of word sequences comprising consecutive words contained in the global word sequence, and optionally a subset of the word sequence set may comprise the first i consecutive words of the global word sequence, e.g. the global word sequence C 1 C 2 …C M ]The corresponding set of word sequences may include: { C 1 ,C 1 C 2 ,C 1 C 2 C 3 ,…,C 1 C 2 …C M The subsets included in the word sequence set can be represented as follows according to the length of the subsets (i.e. the number of words included in the subsets) from small to large: { C 1 }、{C 1 C 2 }、{C 1 C 2 C 3 }…{C 1 C 2 …C M In which C i The word number is used for representing the ith word contained in the text to be processed, i is a positive integer greater than 0, M represents the number of words (namely the length of the global word sequence) of the text to be processed, and M is a positive integer. It will be appreciated that the above global word sequence [ C ] 1 C 2 …C M ]The length difference of adjacent subsets in the corresponding word sequence set is 1 only as an alternative embodiment, in fact, the global word sequence [ C 1 C 2 …C M ]The difference in length of adjacent subsets in the corresponding set of word sequences may also be greater than 1.
For each subset of the word sequence set, the corresponding subset punctuation addition result corresponds to the language model probability, so that the embodiment of the invention can determine the target punctuation mark of the optimal subset punctuation addition result corresponding to each subset; the target punctuation marks of the optimal subset punctuation addition result can be used for representing which punctuation marks are used for dividing adjacent words when the optimal subset punctuation addition result corresponding to the subset is optimal. Suppose a subset { C 1 C 2 C 3 The addition result of the punctuation points corresponding to the optimal subset is { (C) 1 ),(C 2 C 3 ) Description of the subset { C } 1 C 2 C 3 Adjacent word in the Chinese character 1 And C 2 "between", "split, subset { C 1 C 2 C 3 The adjacent word "C in (C) } 2 And C 3 "between them are separated by space, and the correspondent target punctuation mark can be listedShown as follows: "C 1 "number 1 and comma", it should be understood that the embodiment of the present invention does not limit the specific representation manner of the target punctuation mark.
The embodiment of the invention can determine the target punctuation marks of the punctuation addition result of the optimal subset corresponding to each subset in a recursion mode according to the sequence from small to large of the subsets of the word sequence set, and supposing that each subset is expressed as follows according to the sequence from small to large of the subsets of the word sequence set: g 1 、G 2 、G 3 …G u Wherein u is a positive integer, G can be obtained sequentially 1 、G 2 、G 3 …G u And adding a target punctuation mark of the result corresponding to the punctuation of the optimal subset. Also, for Go (1 ≦ o ≦ u), a subset before Go (e.g., G) is needed o-1 、G o-2 Etc.), determining a target punctuation mark of Go corresponding to the optimal subset punctuation addition result, in particular Go may reuse the optimal subset punctuation addition result of a subset before Go, e.g., the subset { C } 1 C 2 C 3 C 4 The punctuation addition process between the first 3 consecutive words in the subset C can be multiplexed 1 C 2 C 3 The optimal subset punctuation addition results.
In an optional embodiment of the present invention, the subset of the set of consecutive sequences of words may comprise: if the number M of words included in the text to be processed is greater than 0 and less than or equal to i, the target punctuation marks of the optimal subset punctuation addition result corresponding to each subset are determined in a recursion manner according to the sequence of the subsets of the word sequence set from small to large, which may specifically include:
adding punctuation marks between adjacent words in the first i continuous words according to target punctuation marks of the punctuation addition result of the optimal subset corresponding to the first k continuous words so as to obtain at least one subset punctuation addition path corresponding to the first i continuous words; wherein 0< k < i, k being a positive integer;
determining the language model probability of the subset punctuation adding path corresponding to the first semantic fragment by using a neural network language model;
selecting an optimal subset punctuation adding path with optimal language model probability from the at least one subset punctuation adding path according to the language model probability of the first semantic segment;
and obtaining target punctuations of the punctuation addition results of the optimal subset corresponding to the first i continuous words according to punctuation marks contained in the punctuation addition path of the optimal subset.
The subset punctuation adding path can be used for representing a path corresponding to a state taking the first word of the subset as a starting state and taking punctuation marks after the last word of the subset as a target state. Optionally, punctuation marks are added between the kth word and the adjacent word included in the ith word according to the target punctuation mark of the optimal subset punctuation addition result corresponding to the k continuous words among the first i continuous words, so that at least one subset punctuation addition path corresponding to the first i continuous words can be obtained. Each seed set punctuation adding path may correspond to a first semantic segment, and the first semantic segment may be used to represent punctuation adding results corresponding to the first i consecutive words.
Since RNNLM is suitable for a wide range of length semantic fragments, for example, the range of length of the semantic fragment corresponding to RNNLM can be: 1-a second length threshold, so that for the number of words M contained in the text to be processed, where 0< i > is less than or equal to 0< i >, the embodiment of the present invention may determine the language model probability of the first semantic segment corresponding to the subset punctuation addition path by using a neural network language model.
Since multiple punctuations can be added between a pair of adjacent words of the first i consecutive words, under normal circumstances, the type of the subset punctuation addition path corresponding to the first i consecutive words is greater than 1, and therefore, in the embodiment of the present invention, an optimal subset punctuation addition path with an optimal language model probability is selected from the at least one subset punctuation addition path according to the language model probability of the first semantic segment, and a target punctuation of the optimal subset punctuation addition result corresponding to the first i consecutive words is obtained according to punctuations included in the optimal subset punctuation addition path. Optionally, the first i continuous words correspond to target punctuation marks of the optimal subset punctuation addition result, punctuation marks are added between adjacent words in the first j continuous words, so as to obtain at least one seed set punctuation addition path corresponding to the first j continuous words; wherein j is more than i, and j is a positive integer.
Optionally, the obtaining, according to punctuation marks included in the optimal subset punctuation adding path, target punctuation marks of the optimal subset punctuation adding result corresponding to the first i consecutive words may include: recording the target punctuations of the punctuation addition results of the optimal subsets corresponding to each subset; or recording the mapping relation between the information of each subset and the target punctuations of the optimal subset punctuation addition result corresponding to the information of each subset so as to obtain corresponding recording contents. Wherein, the information of the subset may include: the number information of the end word corresponding to the subset, and/or the number information corresponding to the subset, etc. For example, for the first i consecutive words, the corresponding number information may be i, and the corresponding number information may correspond to the last word, that is, the information of the ith word, and the like. It is to be understood that the embodiments of the present invention do not impose limitations on the specific information of the subsets. In the process of recording the target punctuation marks of the optimal subset punctuation addition result corresponding to each subset, all the target punctuation marks of the optimal subset punctuation addition result corresponding to each subset can be recorded, and partial target punctuation marks which are different from the adjacent previous subset and correspond to the optimal subset punctuation addition result of each subset can also be recorded.
In an optional embodiment of the present invention, the obtaining, according to the optimal subset punctuation addition result corresponding to the subset of the word sequence set, the optimal punctuation addition result corresponding to the text to be processed may specifically include:
taking the optimal subset punctuation addition result corresponding to the maximum subset of the word sequence set as the optimal punctuation addition result corresponding to the text to be processed; and/or
Performing sentence breaking on the word sequence according to all target punctuations of the optimal subset punctuation addition result corresponding to the maximum subset of the word sequence set so as to obtain the optimal punctuation addition result corresponding to the text to be processed; and/or
And carrying out sentence breaking on the word sequence according to part of target punctuations of the punctuation addition result of the optimal subset corresponding to each subset of the word sequence set so as to obtain the optimal punctuation addition result corresponding to the text to be processed.
In summary, the dynamic programming scheme 1 determines, in a recursive manner, target punctuations of the optimal subset punctuation addition results corresponding to the subsets according to the sequence from small to large of the subsets of the word sequence set, and obtains the optimal punctuation addition results corresponding to the text to be processed according to the optimal subset punctuation addition results corresponding to the subsets of the word sequence set; the language model corresponding to the optimal subset punctuation addition result is optimal in probability, so that the later subset can cover the former subset, the later subset can multiplex the target punctuation marks of the optimal subset punctuation addition result corresponding to the former subset, and the computation required for obtaining the optimal punctuation addition result can be reduced in a recursion mode; moreover, the subsets can gradually cover the semantic segments contained in the word sequence from small to large, so that the subsets can gradually realize the optimal probability of the semantic segments contained in the word sequence corresponding to the language model.
Dynamic planning scheme 2,
In the dynamic programming scheme 2, the performing punctuation addition processing on the global word sequence by using a dynamic programming algorithm to obtain an optimal punctuation addition result corresponding to the text to be processed may specifically include:
adding punctuation marks between adjacent words in the global word sequence to obtain a global punctuation adding path corresponding to the global word sequence;
according to the sequence from front to back, a local punctuation adding path and a second semantic fragment corresponding to the local punctuation adding path are obtained from the global punctuation adding path in a moving mode; the number of character units contained in different second semantic fragments is the same, and the adjacent second semantic fragments have repeated character units, wherein the character units include: words and/or punctuation;
determining the target punctuation marks corresponding to the optimal second semantic segment in a recursion mode according to the sequence from front to back; the language model corresponding to the optimal second semantic fragment has optimal probability;
and obtaining an optimal punctuation addition result corresponding to the text to be processed according to the target punctuation marks corresponding to the optimal second semantic segments.
And the dynamic planning scheme 2 acquires the second semantic fragments which have the same length (contain the same number of character units) and are repeated from the global punctuation adding path in a moving mode according to the sequence from front to back, and determines the target punctuation marks corresponding to the optimal second semantic fragments in a recursion mode according to the sequence from front to back. The process of acquiring the global punctuation addition path may refer to fig. 2, and the embodiment of the present invention does not limit the specific process of acquiring the global punctuation addition path. The local punctuation addition paths may be used to represent portions of global punctuation addition paths, each of which may correspond to a second semantic segment.
In practical applications, the language model probability corresponding to the second semantic fragment can be determined by the N-gram language model. Assuming that N is 5, the length of the second semantic segment may be 5, and assuming that the number of the first character unit of the word sequence is 1, the following order of numbering may be adopted: 1-5, 2-6, 3-7, 4-8 and the like, obtaining corresponding second semantic fragments with the length of 5 from the punctuation addition result, and determining language model probabilities corresponding to the second semantic fragments by using an N-gram language model, for example, if the second semantic fragments are input into the N-gram, the N-gram can output the corresponding language model probabilities. Of course, the language model probability corresponding to the second semantic segment may also be determined by a neural network language model (e.g., a recurrent neural network language model), and the specific determination process of the language model probability corresponding to the second semantic segment is not limited in the embodiment of the present invention. It is understood that the moving distance between the adjacent second semantic segments is 1, which is only an example, and in fact, a person skilled in the art may determine the moving distance between the adjacent second semantic segments according to the actual application requirement, for example, the moving distance may also be 2, 3, etc.
In an optional embodiment of the present invention, the determining, in order from front to back, the target punctuation mark corresponding to the optimal second semantic segment in a recursive manner specifically includes:
determining the language model probability corresponding to the current second semantic fragment by using an N-element grammar language model and/or a neural network language model;
selecting an optimal current second semantic fragment from the multiple current second semantic fragments according to the language model probability corresponding to the current second semantic fragment;
punctuation marks contained in the optimal current second semantic segment are used as target punctuation marks corresponding to the optimal current second semantic segment;
and obtaining a next second semantic fragment according to the target punctuation marks corresponding to the optimal current second semantic fragment.
The current second semantic fragment can be used for representing a second semantic field corresponding to a local punctuation adding path in the recursion process, if the number of the current second semantic fragment is k, and k is a positive integer, the language model probability corresponding to the kth second semantic fragment can be determined by using an N-gram language model and/or a neural network language model, the optimal kth second semantic fragment with the optimal language model probability is selected from multiple kth second semantic fragments, and punctuation marks contained in the optimal kth second semantic fragment are used as corresponding target punctuation marks; and obtaining a (k + 1) th second semantic fragment according to the target punctuation mark corresponding to the optimal kth second semantic fragment, wherein the (k + 1) th second semantic fragment can multiplex the target punctuation mark corresponding to the optimal kth second semantic fragment. Taking fig. 2 as an example, assuming that the length of the second semantic segment is 5, and the optimal 1 st second semantic segment is "hello/,/my is/space/xiaoming", the 2 nd second semantic segment "punctuation/my is/punctuation/xiaoming/punctuation" may multiplex the target punctuation corresponding to the optimal 1 st second semantic segment, so that the punctuation may be added to the 2 nd second semantic segment on the basis of ",/my is/space/xiaoming/punctuation", so that the optimal punctuation may be selected from the multiple punctuation symbols after "xiaoming".
In practical applications, the obtaining an optimal punctuation addition result corresponding to the text to be processed according to the target punctuation marks corresponding to the optimal second semantic fragments may specifically include: and adding punctuation marks to the global word sequence according to the sequence from back to front or the sequence from front to back and the target punctuation marks corresponding to the optimal second semantic segments so as to obtain an optimal punctuation adding result corresponding to the text to be processed. That is, the target punctuations corresponding to the positions (between adjacent words) of the punctuations of the global punctuation adding path may be determined according to a certain sequence, and the optimal punctuation adding result corresponding to the text to be processed may be obtained according to the target punctuations.
In summary, the dynamic planning scheme 2 obtains the local punctuation addition path and the corresponding second semantic segment from the global punctuation addition path in a moving manner according to the sequence from front to back, and determines the target punctuation marks corresponding to the optimal second semantic segment in a recursion manner according to the sequence from front to back; because the adjacent second semantic segments have repeated character units, the next second semantic segment can multiplex the optimal target punctuation mark corresponding to the current second semantic segment, thereby reducing the computation required by obtaining the optimal punctuation addition result in a recursion mode; in addition, because the adjacent second semantic fragments have a moving distance, the embodiment of the invention can realize the optimization of the optimal language model probability corresponding to all the second semantic fragments through the optimization of the optimal language model probability of the second semantic fragments.
Step 104 may output the optimal punctuation addition result obtained in step 103. It can be understood that, a person skilled in the art can output the optimal punctuation addition result obtained in step 103 according to the actual application requirements. For example, the optimal punctuation addition result obtained in step 103 may be displayed on a display device of the current computing device; for another example, the optimal punctuation addition result obtained in step 103 may be sent by the current computing device to other computing devices, for example, when the current computing device is a server, the other computing devices may be clients, other servers, or the like.
To sum up, in the processing method of the embodiment of the present invention, target punctuation marks are added between adjacent words in the global word sequence corresponding to the text to be processed through punctuation addition processing, and the language model corresponding to the optimal punctuation addition result obtained through the punctuation addition processing has the optimal probability, where the optimal punctuation addition result may include: at least one semantic segment, wherein the semantic segment may include: continuous words of the global word sequence and/or continuous words added with punctuation marks; the language model probability can be the synthesis of the language model probabilities corresponding to all semantic fragments contained in the optimal punctuation addition result, so that the optimal punctuation addition result can realize the global optimization of the language model probability, and the global situation can be used for expressing the whole corresponding to the punctuation addition result of the text to be processed, and the accuracy of punctuation addition can be improved by the optimal punctuation addition result in the embodiment of the invention.
It should be noted that, for simplicity of description, the method embodiments are described as a series of motion combinations, but those skilled in the art should understand that the present invention is not limited by the described motion sequences, because some steps may be performed in other sequences or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no moving act is required as an embodiment of the invention.
Device embodiment
Referring to fig. 3, a block diagram of a processing apparatus according to an embodiment of the present invention is shown, which may specifically include: a text to be processed acquisition module 301, a word segmentation module 302, a punctuation addition processing module 303 and a result output module 304.
The to-be-processed text acquisition module 301 is configured to acquire a to-be-processed text;
a word segmentation module 302, configured to perform word segmentation on the text to be processed to obtain a global word sequence corresponding to the text to be processed;
a punctuation addition processing module 303, configured to perform punctuation addition processing on the global word sequence to obtain an optimal punctuation addition result corresponding to the text to be processed; the punctuation adding process adds target punctuation marks between adjacent words in the global word sequence, where the language model probability corresponding to the optimal punctuation adding result is optimal, and the optimal punctuation adding result may include: at least one semantic segment, which may include: continuous words of the global word sequence and/or continuous words added with punctuation marks; and
and a result output module 304, configured to output the optimal punctuation addition result.
Optionally, the punctuation addition processing module 303 may include:
and the dynamic programming processing submodule is used for performing punctuation addition processing on the global word sequence by utilizing a dynamic programming algorithm so as to obtain an optimal punctuation addition result corresponding to the text to be processed.
Optionally, the dynamic planning processing sub-module may include:
the set acquisition unit is used for acquiring a word sequence set corresponding to the global word sequence;
a first recursion unit, configured to determine, in a recursion manner, target punctuation marks of punctuation addition results of the optimal subsets corresponding to the subsets according to the order of the subsets of the word sequence set from small to large; the language model corresponding to the optimal subset punctuation addition result is optimal in probability;
and the first optimal result acquisition unit is used for acquiring an optimal punctuation addition result corresponding to the text to be processed according to the punctuation addition result of the optimal subset corresponding to the subset of the word sequence set.
Optionally, the subset of the set of sequences of consecutive words may comprise: if 0< i > is less than or equal to the number M of words included in the text to be processed, the first recursion unit may include:
an adding subunit, configured to add punctuation marks between adjacent words in the first i consecutive words according to a target punctuation mark of an optimal subset punctuation addition result corresponding to the first k consecutive words, so as to obtain at least one subset punctuation addition path corresponding to the first i consecutive words; wherein 0< k < i, k being a positive integer;
the first language model probability determining subunit is used for determining the language model probability of the first semantic segment corresponding to the subset punctuation adding path by utilizing a neural network language model;
the first selection subunit is used for selecting an optimal subset punctuation adding path with optimal language model probability from the at least one subset punctuation adding path according to the language model probability of the first semantic segment;
and the target punctuation mark obtaining subunit is configured to obtain, according to punctuation marks included in the optimal subset punctuation adding path, target punctuation marks of the optimal subset punctuation adding result corresponding to the first i consecutive words.
Optionally, the dynamic planning processing sub-module may include:
a global path obtaining unit, configured to add punctuation marks between adjacent words in the global word sequence to obtain a global punctuation adding path corresponding to the global word sequence;
the mobile acquisition unit is used for acquiring a local punctuation adding path and a second semantic segment corresponding to the local punctuation adding path from the global punctuation adding path in a mobile mode according to the sequence from front to back; the number of character units included in different second semantic fragments is the same, and there are repeated character units in adjacent second semantic fragments, where the character units may include: word and/or punctuation;
the second recursion unit is used for determining the target punctuation marks corresponding to the optimal second semantic fragments in a recursion mode according to the sequence from front to back; the language model corresponding to the optimal second semantic fragment has optimal probability;
and the second optimal result acquisition unit is used for acquiring an optimal punctuation addition result corresponding to the text to be processed according to the target punctuation symbols corresponding to the optimal second semantic segments.
Optionally, the second recursion unit may include:
the second language model probability determining subunit is used for determining the language model probability corresponding to the current second semantic segment by utilizing the N-element grammar language model and/or the neural network language model;
the second selection subunit is used for selecting the optimal current second semantic fragment from the multiple current second semantic fragments according to the language model probability corresponding to the current second semantic fragment;
a target punctuation mark determination subunit, configured to use punctuation marks included in the optimal current second semantic fragment as target punctuation marks corresponding to the optimal current second semantic fragment;
and the second semantic segment determining subunit is used for obtaining a next second semantic segment according to the optimal target punctuation mark corresponding to the current second semantic segment.
Optionally, the second optimal result obtaining unit may include:
and the adding subunit is configured to add punctuation marks to the global word sequence according to the target punctuation marks corresponding to the optimal second semantic segments in the order from back to front or the order from front to back, so as to obtain an optimal punctuation adding result corresponding to the text to be processed.
Optionally, the punctuation addition processing module 303 may include:
the result exhaustion submodule is used for acquiring various punctuation addition results corresponding to the global word sequence;
the language model probability determining submodule is used for determining the language model probability corresponding to the punctuation addition result; and
and the result selection sub-module is used for selecting a punctuation addition result with the optimal language model probability from the multiple punctuation addition results corresponding to the global word sequence as the optimal punctuation addition result corresponding to the text to be processed.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 4 is a block diagram illustrating an apparatus for information processing as a terminal according to an exemplary embodiment. For example, the terminal 900 can be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 4, terminal 900 can include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.
Processing component 902 generally controls overall operation of terminal 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.
Memory 904 is configured to store various types of data to support operation at terminal 900. Examples of such data include instructions for any application or method operating on terminal 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile and non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power components 906 provide power to the various components of the terminal 900. The power components 906 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal 900.
The multimedia components 908 include a screen providing an output interface between the terminal 900 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide motion action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the terminal 900 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when terminal 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.
I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
Sensor component 914 includes one or more sensors for providing various aspects of state evaluation for terminal 900. For example, sensor assembly 914 can detect an open/closed state of terminal 900, a relative positioning of components, such as a display and keypad of terminal 900, a change in position of terminal 900 or a component of terminal 900, the presence or absence of user contact with terminal 900, an orientation or acceleration/deceleration of terminal 900, and a change in temperature of terminal 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 916 is configured to facilitate communication between the terminal 900 and other devices in a wired or wireless manner. Terminal 900 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the terminal 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as memory 904 comprising instructions, executable by processor 920 of terminal 900 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Fig. 5 is a block diagram illustrating an apparatus for information processing as a server according to an example embodiment. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a sequence of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided that includes instructions, such as the memory 1932 that includes instructions executable by the processor 1922 of the server 1900 to perform the above-described method. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer-readable storage medium in which instructions, when executed by a processor of an apparatus (terminal or server), enable the apparatus to perform a method of processing, the method comprising: acquiring a text to be processed; performing word segmentation on the text to be processed to obtain a global word sequence corresponding to the text to be processed; punctuation adding processing is carried out on the global word sequence to obtain an optimal punctuation adding result corresponding to the text to be processed; the punctuation adding process adds target punctuation marks between adjacent words in the global word sequence, the language model corresponding to the optimal punctuation adding result has optimal probability, and the optimal punctuation adding result comprises the following steps: at least one semantic segment, the semantic segment comprising: continuous words of the global word sequence and/or continuous words added with punctuation marks; and outputting the optimal punctuation addition result.
Optionally, the performing punctuation addition processing on the global word sequence to obtain an optimal punctuation addition result corresponding to the text to be processed includes: and performing punctuation addition processing on the global word sequence by using a dynamic programming algorithm to obtain an optimal punctuation addition result corresponding to the text to be processed.
Optionally, the performing punctuation addition processing on the global word sequence by using a dynamic programming algorithm to obtain an optimal punctuation addition result corresponding to the text to be processed includes: acquiring a word sequence set corresponding to the global word sequence; determining target punctuation marks of punctuation addition results of the optimal subsets corresponding to the subsets in a recursion mode according to the sequence of the subsets of the word sequence set from small to large; the language model corresponding to the optimal subset punctuation addition result is optimal in probability; and obtaining an optimal punctuation addition result corresponding to the text to be processed according to the optimal punctuation addition result corresponding to the subset of the word sequence set.
Optionally, the subset of the set of sequences of consecutive words comprises: if the number of words M contained in the text to be processed is greater than 0 and less than or equal to i, the target punctuation marks of the punctuation addition result of the optimal subset corresponding to each subset are determined in a recursion mode according to the sequence of the subsets of the word sequence set from small to large, including: adding punctuation marks between adjacent words in the first i continuous words according to target punctuation marks of the punctuation addition result of the optimal subset corresponding to the first k continuous words so as to obtain at least one subset punctuation addition path corresponding to the first i continuous words; wherein 0< k < i, k being a positive integer; determining the language model probability of the subset punctuation adding path corresponding to the first semantic fragment by using a neural network language model; selecting an optimal subset punctuation adding path with optimal language model probability from the at least one subset punctuation adding path according to the language model probability of the first semantic segment; and obtaining target punctuations of the punctuation addition result of the optimal subset corresponding to the first i continuous words according to punctuation marks contained in the punctuation addition path of the optimal subset.
Optionally, the performing punctuation addition processing on the global word sequence by using a dynamic programming algorithm to obtain an optimal punctuation addition result corresponding to the text to be processed includes: adding punctuation marks between adjacent words in the global word sequence to obtain a global punctuation adding path corresponding to the global word sequence; according to the sequence from front to back, a local punctuation adding path and a second semantic segment corresponding to the local punctuation adding path are obtained from the global punctuation adding path in a moving mode; the number of character units contained in different second semantic fragments is the same, and the adjacent second semantic fragments have repeated character units, wherein the character units comprise: word and/or punctuation; determining the target punctuation marks corresponding to the optimal second semantic segment in a recursion mode according to the sequence from front to back; the language model corresponding to the optimal second semantic fragment has optimal probability; and obtaining an optimal punctuation addition result corresponding to the text to be processed according to the target punctuation marks corresponding to the optimal second semantic fragments.
Optionally, the determining, in a recursive manner, the target punctuation marks corresponding to the optimal second semantic segment according to the sequence from front to back includes: determining the language model probability corresponding to the current second semantic fragment by using an N-element grammar language model and/or a neural network language model; selecting an optimal current second semantic fragment from the multiple current second semantic fragments according to the language model probability corresponding to the current second semantic fragment; taking punctuation marks contained in the optimal current second semantic segment as target punctuation marks corresponding to the optimal current second semantic segment; and obtaining the next second semantic segment according to the target punctuation marks corresponding to the optimal current second semantic segment.
Optionally, the obtaining an optimal punctuation addition result corresponding to the text to be processed according to the target punctuation symbols corresponding to the optimal second semantic fragments includes: adding punctuation marks to the global word sequence according to the sequence from back to front or the sequence from front to back and the target punctuation marks corresponding to the optimal second semantic segments so as to obtain the optimal punctuation adding result corresponding to the text to be processed.
Optionally, the performing punctuation addition processing on the global word sequence to obtain an optimal punctuation addition result corresponding to the text to be processed includes: obtaining various punctuation addition results corresponding to the global word sequence; determining the language model probability corresponding to the punctuation addition result; and selecting a punctuation addition result with the optimal language model probability from the multiple punctuation addition results corresponding to the global word sequence as the optimal punctuation addition result corresponding to the text to be processed.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes can be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
The present invention provides a processing method, a processing apparatus, and an apparatus for processing, which are described in detail above, and the principles and embodiments of the present invention are explained herein by using specific examples, and the descriptions of the above examples are only used to help understand the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (13)

1. A method of processing, comprising:
acquiring a text to be processed;
performing word segmentation on the text to be processed to obtain a global word sequence corresponding to the text to be processed;
performing punctuation addition processing on the global word sequence by using a dynamic programming algorithm to obtain an optimal punctuation addition result corresponding to the text to be processed; the punctuation adding process adds target punctuation marks between adjacent words in the global word sequence, the language model corresponding to the optimal punctuation adding result has optimal probability, and the optimal punctuation adding result comprises the following steps: at least one semantic segment, the semantic segment comprising: continuous words of the global word sequence and/or continuous words added with punctuation marks;
outputting the optimal punctuation addition result;
the punctuation adding processing on the global word sequence comprises the following steps:
adding punctuation marks between adjacent words in the global word sequence to obtain a global punctuation adding path corresponding to the global word sequence; according to the sequence from front to back, a local punctuation adding path and a second semantic segment corresponding to the local punctuation adding path are obtained from the global punctuation adding path in a moving mode; the number of character units contained in different second semantic fragments is the same, and the adjacent second semantic fragments have repeated character units, wherein the character units comprise: word and/or punctuation; determining the target punctuation marks corresponding to the optimal second semantic segment in a recursion mode according to the sequence from front to back; the language model corresponding to the optimal second semantic fragment has optimal probability; obtaining an optimal punctuation addition result corresponding to the text to be processed according to the target punctuation marks corresponding to the optimal second semantic fragments; or
Acquiring a word sequence set corresponding to the global word sequence; the subset of the set of word sequences comprises: adding punctuation marks between adjacent words in the first i continuous words according to target punctuation marks of a punctuation adding result of the optimal subset corresponding to the first k continuous words, wherein the number M of words contained in the text to be processed is more than or equal to 0< i, so as to obtain at least one subset punctuation adding path corresponding to the first i continuous words; wherein 0< k < i, k being a positive integer; determining the language model probability of the subset punctuation adding path corresponding to the first semantic fragment by using a neural network language model; selecting an optimal subset punctuation adding path with optimal language model probability from the at least one subset punctuation adding path according to the language model probability of the first semantic segment; according to punctuation marks contained in the optimal subset punctuation adding path, obtaining target punctuation marks of the punctuation adding result of the optimal subset corresponding to the first i continuous words; the language model corresponding to the optimal subset punctuation addition result is optimal in probability; and obtaining an optimal punctuation addition result corresponding to the text to be processed according to the punctuation addition result of the optimal subset corresponding to the subset of the word sequence set.
2. The method according to claim 1, wherein the determining the optimal target punctuation marks corresponding to the second semantic segment by a recursive method in the order from front to back comprises:
determining the language model probability corresponding to the current second semantic fragment by using an N-element grammar language model and/or a neural network language model;
selecting an optimal current second semantic fragment from the multiple current second semantic fragments according to the language model probability corresponding to the current second semantic fragment;
punctuation marks contained in the optimal current second semantic segment are used as target punctuation marks corresponding to the optimal current second semantic segment;
and obtaining the next second semantic segment according to the target punctuation marks corresponding to the optimal current second semantic segment.
3. The method according to claim 1, wherein obtaining the optimal punctuation addition result corresponding to the text to be processed according to the target punctuation symbols corresponding to the optimal second semantic segments comprises:
and adding punctuation marks to the global word sequence according to the sequence from back to front or the sequence from front to back and the target punctuation marks corresponding to the optimal second semantic segments so as to obtain an optimal punctuation adding result corresponding to the text to be processed.
4. The method according to claim 1, wherein the punctuation addition processing is performed on the global word sequence to obtain an optimal punctuation addition result corresponding to the text to be processed, further comprising:
obtaining various punctuation addition results corresponding to the global word sequence;
determining the language model probability corresponding to the punctuation addition result;
and selecting a punctuation addition result with the optimal language model probability from the multiple punctuation addition results corresponding to the global word sequence as the optimal punctuation addition result corresponding to the text to be processed.
5. A processing apparatus, comprising:
the text to be processed acquisition module is used for acquiring a text to be processed;
the word segmentation module is used for segmenting words of the text to be processed to obtain a global word sequence corresponding to the text to be processed;
the punctuation adding processing module is used for performing punctuation adding processing on the global word sequence by utilizing a dynamic programming algorithm so as to obtain an optimal punctuation adding result corresponding to the text to be processed; the punctuation adding process adds target punctuation marks between adjacent words in the global word sequence, the language model corresponding to the optimal punctuation adding result has optimal probability, and the optimal punctuation adding result comprises the following steps: at least one semantic segment, the semantic segment comprising: continuous words of the global word sequence and/or continuous words added with punctuation marks; and
the result output module is used for outputting the optimal punctuation addition result;
wherein, the punctuation adding processing module comprises: the system comprises a global path acquisition unit, a mobile acquisition unit, a second recursion unit and a second optimal result acquisition unit; or, the punctuation addition processing module comprises: the system comprises a set acquisition unit, a first recursion unit and a first optimal result acquisition unit;
the global path obtaining unit is configured to add punctuation marks between adjacent words in the global word sequence to obtain a global punctuation adding path corresponding to the global word sequence;
the mobile acquisition unit is used for acquiring a local punctuation addition path and a second semantic segment corresponding to the local punctuation addition path from the global punctuation addition path in a mobile mode according to the sequence from front to back; the number of character units contained in different second semantic fragments is the same, and the adjacent second semantic fragments have repeated character units, wherein the character units include: word and/or punctuation;
the second recursion unit is used for determining the target punctuation marks corresponding to the optimal second semantic segment in a recursion mode according to the sequence from front to back; the language model corresponding to the optimal second semantic fragment has optimal probability;
the second optimal result acquisition unit is used for acquiring an optimal punctuation addition result corresponding to the text to be processed according to the target punctuation marks corresponding to the optimal second semantic fragments;
the set acquisition unit is used for acquiring a word sequence set corresponding to the global word sequence;
the first recursion unit is used for determining target punctuation marks of the punctuation addition results of the optimal subsets corresponding to each subset in a recursion mode according to the sequence of the subsets of the word sequence set from small to large; the language model corresponding to the optimal subset punctuation addition result is optimal in probability;
the first optimal result obtaining unit is used for obtaining an optimal punctuation addition result corresponding to the text to be processed according to an optimal subset punctuation addition result corresponding to the subset of the word sequence set;
the subset of the set of word sequences comprises: if the number of words M contained in the first i continuous words of the text to be processed is greater than or equal to 0 and less than or equal to i, the first recursion unit includes:
an adding subunit, configured to add punctuation marks between adjacent words in the first i consecutive words according to a target punctuation mark of an optimal subset punctuation addition result corresponding to the first k consecutive words, so as to obtain at least one subset punctuation addition path corresponding to the first i consecutive words; wherein 0< k < i, k being a positive integer;
the first language model probability determining subunit is used for determining the language model probability of the first semantic segment corresponding to the subset punctuation adding path by utilizing a neural network language model;
the first selection subunit is used for selecting an optimal subset punctuation adding path with optimal language model probability from the at least one subset punctuation adding path according to the language model probability of the first semantic segment;
and the target punctuation mark obtaining subunit is configured to obtain, according to punctuation marks included in the optimal subset punctuation adding path, target punctuation marks of the optimal subset punctuation adding result corresponding to the first i consecutive words.
6. The apparatus of claim 5, wherein the second recursion unit comprises:
the second language model probability determining subunit is used for determining the language model probability corresponding to the current second semantic segment by utilizing the N-element grammar language model and/or the neural network language model;
the second selection subunit is used for selecting the optimal current second semantic fragment from the multiple current second semantic fragments according to the language model probability corresponding to the current second semantic fragment;
a target punctuation mark determining subunit, configured to use punctuation marks included in the optimal current second semantic segment as target punctuation marks corresponding to the optimal current second semantic segment;
and the second semantic segment determining subunit is used for obtaining a next second semantic segment according to the optimal target punctuation mark corresponding to the current second semantic segment.
7. The apparatus of claim 5, wherein the second optimal result obtaining unit comprises:
and the adding subunit is configured to add punctuation marks to the global word sequence according to the target punctuation marks corresponding to the optimal second semantic segments in the order from back to front or the order from front to back, so as to obtain an optimal punctuation adding result corresponding to the text to be processed.
8. The apparatus of claim 5, wherein the punctuation addition processing module further comprises:
the result exhaustion submodule is used for acquiring various punctuation addition results corresponding to the global word sequence;
the language model probability determining submodule is used for determining the language model probability corresponding to the punctuation addition result; and
and the result selection submodule is used for selecting a punctuation addition result with the optimal language model probability from the multiple punctuation addition results corresponding to the global word sequence as the optimal punctuation addition result corresponding to the text to be processed.
9. An apparatus for processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:
acquiring a text to be processed;
performing word segmentation on the text to be processed to obtain a global word sequence corresponding to the text to be processed;
performing punctuation addition processing on the global word sequence by using a dynamic programming algorithm to obtain an optimal punctuation addition result corresponding to the text to be processed; the punctuation adding process adds target punctuation marks between adjacent words in the global word sequence, the language model corresponding to the optimal punctuation adding result has optimal probability, and the optimal punctuation adding result comprises the following steps: at least one semantic segment, the semantic segment comprising: continuous words of the global word sequence and/or continuous words added with punctuation marks;
outputting the optimal punctuation addition result;
the punctuation adding processing on the global word sequence comprises the following steps:
adding punctuation marks between adjacent words in the global word sequence to obtain a global punctuation adding path corresponding to the global word sequence; according to the sequence from front to back, a local punctuation adding path and a second semantic segment corresponding to the local punctuation adding path are obtained from the global punctuation adding path in a moving mode; the number of character units contained in different second semantic fragments is the same, and the adjacent second semantic fragments have repeated character units, wherein the character units comprise: word and/or punctuation; determining the target punctuation marks corresponding to the optimal second semantic segment in a recursion mode according to the sequence from front to back; the language model corresponding to the optimal second semantic fragment has optimal probability; obtaining an optimal punctuation addition result corresponding to the text to be processed according to the target punctuation symbols corresponding to the optimal second semantic segments; or
Acquiring a word sequence set corresponding to the global word sequence; the subset of the set of word sequences comprises: adding punctuation marks between adjacent words in the first i continuous words according to target punctuation marks of a punctuation adding result of the optimal subset corresponding to the first k continuous words, wherein the number M of words contained in the text to be processed is more than or equal to 0< i, so as to obtain at least one subset punctuation adding path corresponding to the first i continuous words; wherein 0< k < i, k being a positive integer; determining the language model probability of the subset punctuation adding path corresponding to the first semantic fragment by using a neural network language model; selecting an optimal subset punctuation adding path with optimal language model probability from the at least one subset punctuation adding path according to the language model probability of the first semantic segment; according to punctuation marks contained in the optimal subset punctuation adding path, obtaining target punctuation marks of the punctuation adding result of the optimal subset corresponding to the first i continuous words; the language model corresponding to the optimal subset punctuation addition result is optimal in probability; and obtaining an optimal punctuation addition result corresponding to the text to be processed according to the punctuation addition result of the optimal subset corresponding to the subset of the word sequence set.
10. The apparatus according to claim 9, wherein said determining the optimal target punctuation mark corresponding to the second semantic segment by a recursive method in a front-to-back order comprises:
determining the language model probability corresponding to the current second semantic fragment by using an N-element grammar language model and/or a neural network language model;
selecting an optimal current second semantic fragment from the multiple current second semantic fragments according to the language model probability corresponding to the current second semantic fragment;
taking punctuation marks contained in the optimal current second semantic segment as target punctuation marks corresponding to the optimal current second semantic segment;
and obtaining the next second semantic segment according to the target punctuation marks corresponding to the optimal current second semantic segment.
11. The apparatus according to claim 9, wherein the obtaining an optimal punctuation addition result corresponding to the text to be processed according to the target punctuation symbols corresponding to the optimal second semantic segments comprises:
adding punctuation marks to the global word sequence according to the sequence from back to front or the sequence from front to back and the target punctuation marks corresponding to the optimal second semantic segments so as to obtain the optimal punctuation adding result corresponding to the text to be processed.
12. The apparatus according to claim 9, wherein the punctuation adding processing is performed on the global word sequence to obtain an optimal punctuation addition result corresponding to the text to be processed, further comprising:
obtaining various punctuation addition results corresponding to the global word sequence;
determining the language model probability corresponding to the punctuation addition result;
and selecting a punctuation addition result with the optimal language model probability from the multiple punctuation addition results corresponding to the global word sequence as the optimal punctuation addition result corresponding to the text to be processed.
13. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method recited by one or more of claims 1-4.
CN201710162165.2A 2017-03-17 2017-03-17 Processing method and device for processing Active CN108628813B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710162165.2A CN108628813B (en) 2017-03-17 2017-03-17 Processing method and device for processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710162165.2A CN108628813B (en) 2017-03-17 2017-03-17 Processing method and device for processing

Publications (2)

Publication Number Publication Date
CN108628813A CN108628813A (en) 2018-10-09
CN108628813B true CN108628813B (en) 2022-09-23

Family

ID=63686639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710162165.2A Active CN108628813B (en) 2017-03-17 2017-03-17 Processing method and device for processing

Country Status (1)

Country Link
CN (1) CN108628813B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109410949B (en) * 2018-10-11 2021-11-16 厦门大学 Text content punctuation adding method based on weighted finite state converter
CN111046649A (en) * 2019-11-22 2020-04-21 北京捷通华声科技股份有限公司 Text segmentation method and device
CN110908583B (en) * 2019-11-29 2022-10-14 维沃移动通信有限公司 Symbol display method and electronic equipment
CN111241810B (en) * 2020-01-16 2023-08-01 百度在线网络技术(北京)有限公司 Punctuation prediction method and device
CN112685996B (en) * 2020-12-23 2024-03-22 北京有竹居网络技术有限公司 Text punctuation prediction method, device, readable medium and electronic device
CN113053390B (en) * 2021-03-22 2022-12-02 深圳如布科技有限公司 Text processing method, device, electronic equipment and medium based on speech recognition

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105609107A (en) * 2015-12-23 2016-05-25 北京奇虎科技有限公司 Text processing method and device based on voice identification
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4860265B2 (en) * 2004-01-16 2012-01-25 日本電気株式会社 Text processing method / program / program recording medium / device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105609107A (en) * 2015-12-23 2016-05-25 北京奇虎科技有限公司 Text processing method and device based on voice identification
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device

Also Published As

Publication number Publication date
CN108628813A (en) 2018-10-09

Similar Documents

Publication Publication Date Title
CN107632980B (en) Voice translation method and device for voice translation
CN107291690B (en) Punctuation adding method and device and punctuation adding device
CN108628813B (en) Processing method and device for processing
CN107221330B (en) Punctuation adding method and device and punctuation adding device
CN107291704B (en) Processing method and device for processing
CN111368541B (en) Named entity identification method and device
CN107564526B (en) Processing method, apparatus and machine-readable medium
US20190340233A1 (en) Input method, input device and apparatus for input
CN107274903B (en) Text processing method and device for text processing
CN111831806B (en) Semantic integrity determination method, device, electronic equipment and storage medium
CN108304412B (en) Cross-language search method and device for cross-language search
CN107291260B (en) Information input method and device for inputting information
CN111369978B (en) A data processing method, a data processing device and a data processing device
CN110069624B (en) Text processing method and device
CN108073572B (en) Information processing method and device, simultaneous interpretation system
CN108628819B (en) Processing method and device for processing
CN114154459A (en) Speech recognition text processing method, device, electronic device and storage medium
CN109887492B (en) Data processing method and device and electronic equipment
CN110633017A (en) Input method, input device and input device
CN109979435B (en) Data processing method and device for data processing
CN107424612B (en) Processing method, apparatus and machine-readable medium
CN107422872B (en) Input method, input device and input device
CN113409766A (en) Recognition method, device for recognition and voice synthesis method
CN113589949A (en) Input method and device and electronic equipment
CN113589954B (en) Data processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant