Disclosure of Invention
The invention provides a method, a device, equipment and a storage medium for extracting character action related data, which are used for carrying out syntactic analysis and part-of-speech tagging on text data through a Chinese natural language processing HanLP algorithm, and screening out the related data of the occurring behavior action based on the grammatical relation and the modal verb of a subject-predicate guest, thereby improving the accuracy of data extraction and reducing the noise of an extracted data set.
The invention provides a method for extracting data related to human actions, which comprises the following steps: acquiring preset text data, wherein the preset text data is novel text data containing character behavior and actions; classifying the preset text data, and screening out text data containing character information to obtain initial text data; performing word segmentation processing and part-of-speech tagging on the initial text data based on a preset Chinese natural language processing HanLP algorithm to generate intermediate text data; performing dependency syntax analysis and semantic dependency analysis on the intermediate text data based on the preset HanLP algorithm to generate analysis text data; and filtering the analyzed text data to obtain target text data containing a plurality of character behaviors.
Optionally, in a first implementation manner of the first aspect of the present invention, the classifying the preset text data, and screening out text data including personal information, to obtain initial text data includes: classifying the preset text data according to preset classification rules, screening out text data containing character pronouns or character names, and generating classified text data; and identifying target punctuations in the classified text data, deleting text data containing character conversations according to the target punctuations, and generating initial text data, wherein the target punctuations are used for indicating character conversations.
Optionally, in a second implementation manner of the first aspect of the present invention, the performing word segmentation processing and part-of-speech tagging on the initial text data based on a preset chinese natural language processing HanLP algorithm, and generating intermediate text data includes: sentence division processing is carried out on the initial text data through punctuations to obtain a sentence division result; performing word segmentation processing on the sentence segmentation result based on a preset Chinese natural language processing HanLP algorithm to obtain a word segmentation result; and performing part-of-speech tagging on the word segmentation result based on the preset Chinese natural language processing HanLP algorithm and a preset HanLP part-of-speech tagging set to generate intermediate text data.
Optionally, in a third implementation manner of the first aspect of the present invention, the performing, based on the preset chinese natural language processing HanLP algorithm, dependency syntax analysis and semantic dependency analysis on the intermediate text data, and generating analyzed text data includes: calling the preset Chinese natural language processing HanLP algorithm to identify and analyze the relation between grammatical components in the intermediate text data, and when the core relation of an object points to a verb predicate, extracting the core subject-predicate relationship to generate first analysis text data; calling the preset Chinese natural language processing HanLP algorithm to analyze semantic association in the intermediate text data, determining a relation type, screening out text data containing a construction relation, and generating second analysis text data; and combining the first analysis text data and the second analysis text data to generate analysis text data.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the filtering the analysis text data to generate target text data, where the target text data includes the extracted multiple character behaviors includes: acquiring the analysis text data, filtering the text data containing the emotional verbs in the analysis text data, and generating filtered text data; and carrying out normalization processing on the filtered text data to generate target text data containing a plurality of character behaviors and actions.
Optionally, in a fifth implementation manner of the first aspect of the present invention, the filtering the text data that includes the verb-of-state in the analyzed text data, and generating filtered text data includes: identifying text data containing verb emotion in the analysis text data, wherein the verb emotion is used for indicating character behavior actions which do not occur; and deleting the text data containing the emotional verbs to generate filtered text data.
Optionally, in a sixth implementation manner of the first aspect of the present invention, after performing dependency syntax analysis and semantic dependency analysis on the intermediate text data based on the preset chinese natural language processing HanLP algorithm to generate analyzed text data, and before performing filtering processing on the analyzed text data to generate target text data, the method further includes: and identifying whether the analysis text data contains the character behavior action which occurs in the past or not, if the analysis text data does not contain the character behavior action which occurs in the past, retaining the analysis text data, and if the analysis text data contains the character behavior action which occurs in the past, deleting the related data containing the character behavior action which occurs in the past.
The second aspect of the present invention provides an extraction apparatus for data related to human actions, comprising: the acquisition module is used for acquiring preset text data, wherein the preset text data is novel text data containing character behavior and actions; the classification module is used for classifying the preset text data, screening out the text data containing the character information and obtaining initial text data; the word segmentation module is used for carrying out word segmentation processing and part-of-speech tagging on the initial text data based on a preset Chinese natural language processing HanLP algorithm to generate intermediate text data; the analysis module is used for carrying out dependency syntax analysis and semantic dependency analysis on the intermediate text data based on the preset Chinese natural language processing HanLP algorithm to generate analysis text data; and the filtering module is used for filtering the analyzed text data to obtain target text data containing a plurality of character behaviors and actions.
Optionally, in a first implementation manner of the second aspect of the present invention, the classification module includes: the classification unit is used for classifying the preset text data according to a preset classification rule, screening out text data containing character pronouns or character names and generating classified text data; and the deleting unit is used for identifying a target punctuation mark in the classified text data, deleting text data containing character conversation according to the target punctuation mark and generating initial text data, wherein the target punctuation mark is used for indicating character conversation.
Optionally, in a second implementation manner of the second aspect of the present invention, the word segmentation module includes: a sentence dividing unit, configured to perform sentence dividing processing on the initial text data through punctuation marks to obtain a sentence dividing result; the word segmentation unit is used for carrying out word segmentation processing on the sentence segmentation result based on a preset Chinese natural language processing HanLP algorithm to obtain a word segmentation result; and the part-of-speech tagging unit is used for carrying out part-of-speech tagging on the word segmentation result based on the preset HanLP algorithm for Chinese natural language processing and a preset HanLP part-of-speech tagging set so as to generate intermediate text data.
Optionally, in a third implementation manner of the second aspect of the present invention, the analysis module includes: the first analysis unit is used for calling the preset Chinese natural language processing HanLP algorithm to identify and analyze the relation between grammatical elements in the intermediate text data, and when the core relation of the object points to a verb predicate, the core subject-predicate relation is extracted to generate first analysis text data; the second analysis unit is used for calling the preset Chinese natural language processing HanLP algorithm to analyze semantic association in the intermediate text data, determining the relation type, screening out text data containing the construction relation and generating second analysis text data; and the merging unit is used for merging the first analysis text data and the second analysis text data to generate analysis text data.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the filtering module includes: the filtering unit is used for filtering the text data containing the emotional verbs in the analysis text data to generate filtered text data; and the normalization unit is used for performing normalization processing on the filtered text data to generate target text data containing a plurality of character behaviors and actions.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the filtering unit is specifically configured to: identifying text data containing verb emotion in the analysis text data, wherein the verb emotion is used for indicating character behavior actions which do not occur; and deleting the text data containing the emotional verbs to generate filtered text data.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the apparatus further includes: and the identification module is used for identifying whether the analysis text data contains the character behavior action which occurs in the past or not, keeping the analysis text data when the analysis text data does not contain the character behavior action which occurs in the past, and deleting the related data containing the character behavior action which occurs in the past when the analysis text data contains the character behavior action which occurs in the past.
A third aspect of the present invention provides a character motion-related data extraction device, including: a memory and at least one processor, the memory having instructions stored therein; the at least one processor calls the instructions in the memory to enable the extraction device of the human action related data to execute the extraction method of the human action related data.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-described extraction method of character motion-related data.
In the technical scheme provided by the invention, preset text data is obtained, wherein the preset text data is novel text data containing character behavior and actions; classifying the preset text data, and screening out text data containing character information to obtain initial text data; performing word segmentation processing and part-of-speech tagging on the initial text data based on a preset Chinese natural language processing HanLP algorithm to generate intermediate text data; performing dependency syntax analysis and semantic dependency analysis on the intermediate text data based on the preset HanLP algorithm to generate analysis text data; and filtering the analyzed text data to obtain target text data containing a plurality of character behaviors. In the embodiment of the invention, the text data is subjected to syntactic analysis and part-of-speech tagging through the HanLP algorithm for Chinese natural language processing, and relevant data of the behavior action which is happening is screened out based on the grammatical relation and the modal verb of the subject-predicate guest, so that the accuracy of data extraction is improved, and the noise of the extracted data set is reduced.
Detailed Description
The embodiment of the invention provides a method, a device, equipment and a storage medium for extracting character action related data, which are used for carrying out syntactic analysis and part-of-speech tagging on text data through a Chinese natural language processing HanLP algorithm and screening out the related data of the occurring behavior action based on the grammatical relation of a leading verb and a predicate verb, thereby improving the accuracy of data extraction and reducing the noise of an extracted data set.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, a detailed flow of the embodiment of the present invention is described below, and referring to fig. 1, an embodiment of the method for extracting data related to a character action according to the embodiment of the present invention includes:
101. and acquiring preset text data, wherein the preset text data is novel text data containing character behavior and actions.
The server acquires preset text data, wherein the preset text data is novel text data containing character behavior and actions. The server obtains a plurality of novel texts in the appointed label from the network through the crawler, and a preset data set is made based on the plurality of novel texts.
It is to be understood that the executing subject of the present invention may be an extracting apparatus of data related to human actions, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.
102. And classifying the preset text data, and screening out the text data containing the character information to obtain initial text data.
The server classifies the preset text data, screens out the text data containing the character information and obtains initial text data. Specifically, the server classifies preset text data according to preset classification rules, screens out text data containing character pronouns or character names, and generates classified text data; and the server filters the classified text data, identifies target punctuations and deletes text data containing character conversations to generate initial text data, wherein the target punctuations are used for indicating the character conversations. The server divides the preset text data into two types according to whether the character information is contained or not, eliminates the text data which does not contain the character information, for example, "dog runs in a yard", "bird chamois on the outside of a window", "squirrel uses the fluffy big tail as a quilt cover", and the like, and screens out the text data which comprises the character pronouns or the names of the characters, wherein the character pronouns comprise me(s), you(s), he(s) and she(s). The preset punctuation mark is a combination of 'colon' and 'double quotation marks' and is used for indicating character dialogue, and although the text data with the character dialogue contains character information, the method is not suitable for analyzing and extracting the data related to the character behaviors and actions in the scheme, so that the data need to be removed.
103. And performing word segmentation processing and part-of-speech tagging on the initial text data based on a preset Chinese natural language processing HanLP algorithm to generate intermediate text data.
The server carries out word segmentation processing and part-of-speech tagging on the initial text data based on a preset HanLP algorithm of Chinese natural language processing to generate intermediate text data. Specifically, the server performs sentence division processing on the initial text data through punctuation marks to obtain a sentence division result; the server carries out word segmentation processing on the sentence segmentation result based on a preset Chinese natural language processing HanLP algorithm to obtain a word segmentation result; and the server carries out part-of-speech tagging on the word segmentation result based on a preset Chinese natural language processing HanLP algorithm and a preset HanLP part-of-speech tagging set to generate intermediate text data. The word is the most basic unit of the text, the word segmentation is the most basic step in natural language processing, the word segmentation algorithm is divided into a dictionary method and a statistical method, wherein the method based on the dictionary and the artificial rules is to match the word to be analyzed with the entry in the dictionary according to a certain strategy, and the statistical method is the statistical frequency of the occurrence of the basic character string in the corpus. Each punctuation mark is provided with a corresponding regular expression, sentence division processing is carried out on the initial text data through the punctuation marks, a long sentence is divided into a plurality of short sentences, and first text data are obtained. The Chinese natural language processing (HanLP) is a toolkit consisting of a series of models and algorithms, aims to promote the application of natural language processing in a production environment, has the characteristics of complete functions, high performance, clear architecture, new linguistic data and customization, and performs word segmentation processing on text data through the HanLP firstly in the scheme, for example, inputting 'Xiaoming is eating', and the result after word segmentation is 'Xiaoming', 'eating'. The part-of-speech tagging refers to a process of tagging each word in the word segmentation result with a correct part-of-speech, namely a process of determining that each word in the word segmentation result is a noun, a verb, an adjective or other parts-of-speech, in the scheme, part-of-speech tagging is performed on the word segmentation result through a preset HanLP part-of-speech tagging set, the part-of-speech corresponding to "Xiaoming" is a "noun", the part-of-speech corresponding to "now" is a "subtext", and the part-of-speech corresponding to "eating" is a "verb".
104. And performing dependency syntax analysis and semantic dependency analysis on the intermediate text data based on a preset Chinese natural language processing HanLP algorithm to generate analysis text data.
The server carries out dependency syntax analysis and semantic dependency analysis on the intermediate text data based on a preset Chinese natural language processing HanLP algorithm to generate analysis text data. Dependency Parsing (DP) analyzes the dependency relationship between the components in the language units to reveal the syntax structure, i.e. analyzes the grammatical components such as "major predicate object", "shape complement" and the like in the sentence, and analyzes the relationship of each component, and Semantic Dependency Parsing (SDP) analyzes the semantic association between the language units in the sentence and presents the semantic association as the dependency structure, the semantic dependency parsing is not affected by the syntax structure, the language units with direct semantic association are directly connected with the dependency arcs and labeled with the corresponding semantic relationships, which is also an important difference between the semantic dependency parsing and the syntax parsing. For example, "xiaoming has eaten an apple" and "an apple has been eaten by xiaoming", although three sentences have different syntactic structures and produce different syntactic analysis results, the semantic relationship among the language units in the three sentences does not change, and the same semantic information is expressed, that is, xiaoming implements an eating action, which is implemented on an apple.
105. And filtering the analysis text data to obtain target text data containing a plurality of character behaviors.
And the server filters the analysis text data to obtain target text data containing a plurality of character behaviors and actions. Specifically, the server acquires analysis text data, filters the text data containing the emotional verbs in the analysis text data, and generates filtered text data; and the server performs normalization processing on the filtered text data to generate target text data, wherein the target text data comprises the extracted multiple character behavior actions. After the screened main predicate person acts, when an emotional verb modifying the predicate verb appears in the sentence, the condition is not met, because the sentence presents an action or a state at a certain future time due to the appearance of the emotional verb, the person action does not occur yet, for example, "a little will go out to swing" and the action of swinging does not occur yet, and therefore, related text data needs to be filtered and deleted.
In the embodiment of the invention, the text data is subjected to syntactic analysis and part-of-speech tagging through the HanLP algorithm for Chinese natural language processing, and relevant data of the behavior action which is happening is screened out based on the grammatical relation and the modal verb of the subject-predicate guest, so that the accuracy of data extraction is improved, and the noise of the extracted data set is reduced.
Referring to fig. 2, another embodiment of the method for extracting data related to human actions according to the embodiment of the present invention includes:
201. and acquiring preset text data, wherein the preset text data is novel text data containing character behavior and actions.
The server acquires preset text data, wherein the preset text data is novel text data containing character behavior and actions. The server obtains a plurality of novel texts in the appointed label from the network through the crawler, and a preset data set is made based on the plurality of novel texts.
It is to be understood that the executing subject of the present invention may be an extracting apparatus of data related to human actions, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.
202. And classifying the preset text data, and screening out the text data containing the character information to obtain initial text data.
The server classifies the preset text data, screens out the text data containing the character information and obtains initial text data. Specifically, the server classifies preset text data according to preset classification rules, screens out text data containing character pronouns or character names, and generates classified text data; and the server filters the classified text data, identifies target punctuations and deletes text data containing character conversations to generate initial text data, wherein the target punctuations are used for indicating the character conversations. The server divides the preset text data into two types according to whether the character information is contained or not, eliminates the text data which does not contain the character information, for example, "dog runs in a yard", "bird chamois on the outside of a window", "squirrel uses the fluffy big tail as a quilt cover", and the like, and screens out the text data which comprises the character pronouns or the names of the characters, wherein the character pronouns comprise me(s), you(s), he(s) and she(s). The preset punctuation mark is a combination of 'colon' and 'double quotation marks' and is used for indicating character dialogue, and although the text data with the character dialogue contains character information, the method is not suitable for analyzing and extracting the data related to the character behaviors and actions in the scheme, so that the data need to be removed.
203. And performing word segmentation processing and part-of-speech tagging on the initial text data based on a preset Chinese natural language processing HanLP algorithm to generate intermediate text data.
The server carries out word segmentation processing and part-of-speech tagging on the initial text data based on a preset HanLP algorithm of Chinese natural language processing to generate intermediate text data. Specifically, the server performs sentence division processing on the initial text data through punctuation marks to obtain a sentence division result; the server carries out word segmentation processing on the sentence segmentation result based on a preset Chinese natural language processing HanLP algorithm to obtain a word segmentation result; and the server carries out part-of-speech tagging on the word segmentation result based on a preset Chinese natural language processing HanLP algorithm and a preset HanLP part-of-speech tagging set to generate intermediate text data. The word is the most basic unit of the text, the word segmentation is the most basic step in natural language processing, the word segmentation algorithm is divided into a dictionary method and a statistical method, wherein the method based on the dictionary and the artificial rules is to match the word to be analyzed with the entry in the dictionary according to a certain strategy, and the statistical method is the statistical frequency of the occurrence of the basic character string in the corpus. Each punctuation mark is provided with a corresponding regular expression, sentence division processing is carried out on the initial text data through the punctuation marks, a long sentence is divided into a plurality of short sentences, and first text data are obtained. The Chinese natural language processing (HanLP) is a toolkit consisting of a series of models and algorithms, aims to promote the application of natural language processing in a production environment, has the characteristics of complete functions, high performance, clear architecture, new linguistic data and customization, and performs word segmentation processing on text data through the HanLP firstly in the scheme, for example, inputting 'Xiaoming is eating', and the result after word segmentation is 'Xiaoming', 'eating'. The part-of-speech tagging refers to a process of tagging each word in the word segmentation result with a correct part-of-speech, namely a process of determining that each word in the word segmentation result is a noun, a verb, an adjective or other parts-of-speech, in the scheme, part-of-speech tagging is performed on the word segmentation result through a preset HanLP part-of-speech tagging set, the part-of-speech corresponding to "Xiaoming" is a "noun", the part-of-speech corresponding to "now" is a "subtext", and the part-of-speech corresponding to "eating" is a "verb".
204. And calling a preset Chinese natural language processing HanLP algorithm to identify and analyze the relation between grammatical components in the intermediate text data, and when the core relation of the object points to a verb predicate, extracting the core subject-predicate relationship to generate first analysis text data.
And the server calls a preset Chinese natural language processing HanLP algorithm to identify and analyze the relation between grammatical components in the intermediate text data, and when the core relation of the object points to a verb of a predicate, the core subject-predicate relation is extracted to generate first analysis text data. For example, "xiaoming is playing in a room," xiaoming "belongs to a lexical subject," positive "belongs to a lexical object," in "belongs to a prepositional modifier," room "belongs to a prepositional site modifier," lii "belongs to a temporal preposition," playing "belongs to a verb predicate," game "belongs to a direct object, and the verb" plays "is a core word, and thus the sentence can be extracted as a" xiaoming game "including a relationship of a subject and a predicate.
205. And calling a preset Chinese natural language processing HanLP algorithm to analyze semantic association in the intermediate text data, determining the relationship type, screening out text data containing the construction relationship, and generating second analysis text data.
And the server calls a preset Chinese natural language processing HanLP algorithm to analyze semantic association in the intermediate text data, determines the relationship type, screens out text data containing the construction relationship and generates second analysis text data. The relationship types comprise an event relation, a party relation, an event-sensitive relation, a lead relation, an event-related relation, a guest relation, an event relation, a source relation, an event-related relation and a comparison role, for example, "sending her flowers with little brightness", the semantic relationship type in the sentence is the event relation, "sending flowers" is a specific action made by a person, the screening condition in the scheme is met, "sending flowers with little brightness in a room while watching television and speaking", the sentence comprises a plurality of predicate verbs "eat", "see" and "speak", and the predicate verbs have an order-bearing relation, and the screening condition in the scheme is also met.
206. And combining the first analysis text data and the second analysis text data to generate analysis text data.
And the server combines the first analysis text data and the second analysis text data to generate analysis text data. In the scheme, word segmentation, part-of-speech tagging, syntactic analysis and semantic analysis are all based on a HanLP algorithm, each layer can form an independent data result, and the data result of each layer can be used independently and also can be transmitted to the next layer for further analysis.
207. And filtering the analysis text data to obtain target text data containing a plurality of character behaviors.
And the server filters the analysis text data to obtain target text data containing a plurality of character behaviors and actions. Specifically, the server acquires analysis text data, filters the text data containing the emotional verbs in the analysis text data, and generates filtered text data; and the server performs normalization processing on the filtered text data to generate target text data, wherein the target text data comprises the extracted multiple character behavior actions. After the screened main predicate person acts, when an emotional verb modifying the predicate verb appears in the sentence, the condition is not met, because the sentence presents an action or a state at a certain future time due to the appearance of the emotional verb, the person action does not occur yet, for example, "a little will go out to swing" and the action of swinging does not occur yet, and therefore, related text data needs to be filtered and deleted.
In the embodiment of the invention, the text data is subjected to syntactic analysis and part-of-speech tagging through the HanLP algorithm for Chinese natural language processing, and relevant data of the behavior action which is happening is screened out based on the grammatical relation and the modal verb of the subject-predicate guest, so that the accuracy of data extraction is improved, and the noise of the extracted data set is reduced.
With reference to fig. 3, the method for extracting data related to human actions in the embodiment of the present invention is described above, and an embodiment of the apparatus for extracting data related to human actions in the embodiment of the present invention includes:
the acquisition module 301 is configured to acquire preset text data, where the preset text data is novel text data containing character behaviors and actions;
the classification module 302 is configured to classify preset text data, and screen out text data including character information to obtain initial text data;
the word segmentation module 303 is configured to perform word segmentation processing and part-of-speech tagging on the initial text data based on a preset chinese natural language processing HanLP algorithm, and generate intermediate text data;
the analysis module 304 is used for performing dependency syntax analysis and semantic dependency analysis on the intermediate text data based on a preset Chinese natural language processing HanLP algorithm to generate analysis text data;
and the filtering module 305 is configured to filter the analysis text data to obtain target text data including a plurality of character behaviors.
In the embodiment of the invention, the text data is subjected to syntactic analysis and part-of-speech tagging through the HanLP algorithm for Chinese natural language processing, and relevant data of the behavior action which is happening is screened out based on the grammatical relation and the modal verb of the subject-predicate guest, so that the accuracy of data extraction is improved, and the noise of the extracted data set is reduced.
Referring to fig. 4, another embodiment of the device for extracting data related to human actions according to the embodiment of the present invention includes:
the acquisition module 301 is configured to acquire preset text data, where the preset text data is novel text data containing character behaviors and actions;
the classification module 302 is configured to classify preset text data, and screen out text data including character information to obtain initial text data;
the word segmentation module 303 is configured to perform word segmentation processing and part-of-speech tagging on the initial text data based on a preset chinese natural language processing HanLP algorithm, and generate intermediate text data;
the analysis module 304 is used for performing dependency syntax analysis and semantic dependency analysis on the intermediate text data based on a preset Chinese natural language processing HanLP algorithm to generate analysis text data;
and the filtering module 305 is configured to filter the analysis text data to obtain target text data including a plurality of character behaviors.
Optionally, the classification module 302 includes:
a classification unit 3021 configured to classify preset text data according to preset classification rules, screen out text data including a character pronoun or a character name, and generate classified text data;
a deleting unit 3022 configured to recognize a target punctuation mark in the classified text data and delete text data including a human dialogue according to the target punctuation mark to generate initial text data, the target punctuation mark being used to indicate the human dialogue.
Optionally, the word segmentation module 303 includes:
a clause unit 3031, configured to perform clause processing on the initial text data through punctuation marks to obtain a clause result;
a word segmentation unit 3032, configured to perform word segmentation processing on the sentence segmentation result based on a preset chinese natural language processing HanLP algorithm, to obtain a word segmentation result;
and a part-of-speech tagging unit 3033, configured to perform part-of-speech tagging on the word segmentation result based on a preset chinese natural language processing HanLP algorithm and a preset HanLP part-of-speech tagging set, and generate intermediate text data.
Optionally, the analysis module 304 includes:
the first analysis unit 3041 is configured to invoke a preset chinese natural language processing HanLP algorithm to identify and analyze relationships between grammatical elements in the intermediate text data, and when a core relationship of an object points to a verb predicate, extract a core subject-predicate relationship to generate first analysis text data;
a second analysis unit 3042, configured to invoke a preset chinese natural language processing HanLP algorithm to analyze semantic association in the intermediate text data, determine a relationship type, screen out text data including a relationship between events, and generate second analysis text data;
a merging unit 3043, configured to merge the first analysis text data and the second analysis text data to generate analysis text data.
Optionally, the filtering module 305 includes:
the filtering unit 3051, configured to filter and analyze text data including the verb in the text data, and generate filtered text data;
and a normalization unit 3052, configured to perform normalization processing on the filtered text data, and generate target text data including a plurality of character behaviors.
Optionally, after the analyzing module 304 and before the filtering module 305, the device for extracting the data related to the human actions further includes:
and the recognition module 306 is used for recognizing whether the analysis text data contains the character behavior action which occurs in the past or not, keeping the analysis text data when the analysis text data does not contain the character behavior action which occurs in the past, and deleting the related data containing the character behavior action which occurs in the past when the analysis text data contains the character behavior action which occurs in the past.
Specifically, for example, in the case where "xiaoming has already eaten" is a verb predicate, but when a general past appears in a sentence, the state of xiaoming in the past is expressed in the semantic relationship and the current action is not performed, and therefore, it is necessary to delete the relevant text data.
In the embodiment of the invention, the text data is subjected to syntactic analysis and part-of-speech tagging through the HanLP algorithm for Chinese natural language processing, and relevant data of the behavior action which is happening is screened out based on the grammatical relation and the modal verb of the subject-predicate guest, so that the accuracy of data extraction is improved, and the noise of the extracted data set is reduced.
Fig. 3 and 4 describe the extraction device of the data related to the human movement in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the extraction device of the data related to the human movement in the embodiment of the present invention is described in detail from the perspective of hardware processing.
Fig. 5 is a schematic structural diagram of a device for extracting human motion related data, where the device 500 for extracting human motion related data may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) for storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations in the extraction device 500 for the character motion-related data. Still further, the processor 510 may be configured to communicate with the storage medium 530 and execute a series of instruction operations in the storage medium 530 on the human motion related data extraction device 500.
The human-action-related data extraction device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the configuration of the extraction device for human motion related data shown in fig. 5 does not constitute a limitation of the extraction device for human motion related data, and may include more or less components than those shown, or some components may be combined, or a different arrangement of components may be used.
The invention also provides a device for extracting data related to human actions, which comprises a memory and a processor, wherein computer readable instructions are stored in the memory, and when being executed by the processor, the computer readable instructions cause the processor to execute the steps of the method for extracting data related to human actions in the embodiments.
The invention also provides a computer readable storage medium, which can be a non-volatile computer readable storage medium, and can also be a volatile computer readable storage medium, wherein the computer readable storage medium has stored therein instructions, which when run on a computer, cause the computer to execute the steps of the method for extracting the data related to the human actions.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.