CN111813905A

CN111813905A - Corpus generation method and device, computer equipment and storage medium

Info

Publication number: CN111813905A
Application number: CN202010555008.XA
Authority: CN
Inventors: 黎旭东; 林桂
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-06-17
Filing date: 2020-06-17
Publication date: 2020-10-23
Anticipated expiration: 2040-06-17
Also published as: WO2021120588A1; CN111813905B

Abstract

The invention relates to the field of artificial intelligence, and discloses a corpus generation method, a corpus generation device, a computer device and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining consultation texts and response texts related to vaccines from a medical inquiry library as initial texts, cleaning the initial texts to obtain original corpus data, clustering the original corpus data by adopting a K-means clustering model to obtain at least two clusters of coarse-grained clustered corpuses, carrying out secondary clustering on the coarse-grained clustered corpuses by a density clustering algorithm aiming at each cluster of coarse-grained clustered corpuses, and taking the obtained density clustered corpuses as target corpuses.

Description

Corpus generation method and device, computer equipment and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a corpus generating method and device, computer equipment and a storage medium.

Background

With the improvement of living standards of people, many people begin to concern about self health problems, vaccine-related problems also become a hot problem of health problems, in order to relieve the pressure of hospital consultation windows, some hospitals begin to adopt intelligent robot service systems, effective feedback is given to consultants through intelligent question-answering robots, and the intelligent question-answering robots need to adopt a large amount of linguistic data in related fields for training before use so as to improve the accuracy of question-answering.

At present, acquire the relevant corpus of bacterin, mainly crawl from relevant website through the web crawler, and use the mode of regular matching, keyword extraction, carry out the corpus and select, the corpus that adopts these modes to select trains the questioning and answering robot, its accurate degree can not reach the requirement far away, make the response rate of accuracy of questioning and answering robot not high, also influence user experience simultaneously, therefore, how to acquire the higher training corpus of accurate degree, become a difficult problem of waiting to solve urgently.

Disclosure of Invention

The embodiment of the invention provides a corpus generating method, a corpus generating device, computer equipment and a storage medium, which are used for improving the accuracy of generating a corpus of a vaccine question-answering robot.

In order to solve the foregoing technical problem, an embodiment of the present application provides a corpus generating method, including:

acquiring a consultation text and a response text related to the vaccine from a medical inquiry library as initial texts;

carrying out data cleaning on the initial text to obtain original corpus data;

clustering the original corpus data by adopting a K-means clustering model to obtain at least two clusters of coarse-grained clustered corpuses;

and aiming at each cluster of coarse-grained clustered corpora, performing secondary clustering processing on the coarse-grained clustered corpora through a density clustering algorithm, and taking the obtained density clustered corpora as target corpora.

Optionally, the obtaining of the vaccine-related consultation text and the response text from the medical inquiry library as initial texts includes:

determining the page weight of each preset path in the medical inquiry library in a link analysis mode;

determining a target page according to the page weight of each preset path;

calculating a page ranking value of each target page based on a preset page ranking strategy, and sequencing the target pages according to the descending order of the page ranking values to obtain a target page queue;

and capturing the content in the target page based on the target page queue to obtain the vaccine-related consultation text and the response text.

Optionally, for each cluster of coarse-grained clustered corpora, performing secondary clustering processing on the coarse-grained clustered corpora through a density clustering algorithm, where the obtained density clustered corpora as target corpora includes:

acquiring a preset scanning radius eps and a preset minimum contained point number minPts;

counting the number of other corpus data contained in the preset scanning radius eps by aiming at each corpus data in the coarse-grained clustering corpus, and taking the number as the number of neighborhood points corresponding to the corpus data;

using the corpus data with the neighborhood point number more than or equal to a preset minimum inclusion point number minPts as a core point;

using corpus data with the number of neighborhood points smaller than a preset minimum contained point number minPts and within a preset scanning radius eps of a core point as a boundary point;

and connecting boundary points with a distance not exceeding a preset scanning radius eps to form a density cluster, and adding core points in the range of the density cluster to obtain the target corpus.

Optionally, after the clustering is performed on the original corpus data by using the K-means clustering model to obtain coarse-grained clustered corpuses of at least two clusters, and performing secondary clustering on the coarse-grained clustered corpuses by using a density clustering algorithm for each cluster of the coarse-grained clustered corpuses, before the obtained density clustered corpuses are used as target corpuses, the corpus generating method further includes:

setting different category labels aiming at the coarse-grained clustering corpus of each cluster, and storing the cluster coarse-grained clustering corpus, the category labels and the corresponding relation between the cluster coarse-grained clustering corpus and the category labels into an Elasticissearch engine.

Optionally, after performing secondary clustering processing on the coarse-grained clustered corpus by using a density clustering algorithm for each cluster of coarse-grained clustered corpuses and taking the obtained density clustered corpus as a target corpus, the corpus generating method further includes:

acquiring a preset threshold, and adopting an Elasticissearch engine to aggregate the target corpus according to the preset threshold to obtain a clustering result;

and screening out the irrelevant corpora according to the clustering result, and removing the irrelevant corpora to obtain the updated target corpora.

Optionally, after performing secondary clustering processing on the coarse-grained clustered corpus by using a density clustering algorithm for each cluster of coarse-grained clustered corpuses and taking the obtained density clustered corpus as a target corpus, the method further includes: and storing the target corpus in a block chain network node.

In order to solve the foregoing technical problem, an embodiment of the present application further provides a corpus generating device, including:

the data acquisition module is used for acquiring a consultation text and a response text related to the vaccine from the medical inquiry library as initial texts;

the data cleaning module is used for cleaning the data of the initial text to obtain original corpus data;

the coarse-grained clustering module is used for clustering the original corpus data by adopting a K-means clustering model to obtain at least two clusters of coarse-grained clustered corpuses;

and the corpus determining module is used for performing secondary clustering processing on the coarse-grained clustering corpus by a density clustering algorithm aiming at each cluster of coarse-grained clustering corpuses, and taking the obtained density clustering corpuses as target corpuses.

Optionally, the data obtaining module includes:

the link analysis unit is used for determining the page weight of each preset path in the medical inquiry library in a link analysis mode;

the target page determining unit is used for determining a target page according to the page weight of each preset path;

the page ordering unit is used for calculating the page ordering value of each target page based on a preset page ordering strategy, and ordering the target pages according to the descending order of the page ordering values to obtain a target page queue;

and the content acquisition unit is used for capturing the content in the target page based on the target page queue to obtain the vaccine-related consultation text and the response text.

Optionally, the corpus determining module includes:

a preset parameter obtaining unit, configured to obtain a preset scanning radius eps and a preset minimum inclusion point number minPts;

a domain point number determining unit, configured to count, for each corpus data in the coarse-grained clustered corpus, the number of other corpus data included in the corpus data within the preset scanning radius eps, and use the number as the number of neighborhood points corresponding to the corpus data;

a core store determining unit, configured to use corpus data in which the number of neighborhood points is greater than or equal to a preset minimum inclusion point minPts as a core point;

a boundary point determining unit, configured to use corpus data, which is smaller than a preset minimum inclusion point minPts in the number of neighborhood points and is within a preset scanning radius eps of the core point, as a boundary point;

and the target corpus acquiring unit is used for interconnecting boundary points with the distance not exceeding the preset scanning radius eps to form a density clustering cluster, and adding core points in the range of the density clustering cluster into the density clustering cluster to obtain the target corpus.

Optionally, the corpus generating device further includes:

the first storage module is used for setting different category labels aiming at the coarse-grained clustering corpus of each cluster, and storing the cluster coarse-grained clustering corpus, the category labels and the corresponding relation between the cluster coarse-grained clustering corpus and the category labels into an Elasticissearch engine.

Optionally, the corpus generating device further includes:

the aggregation module is used for acquiring a preset threshold value, and aggregating the target corpus by adopting an Elasticissearch engine according to the preset threshold value to obtain a clustering result;

and the updating module is used for screening the irrelevant linguistic data according to the clustering result and removing the irrelevant linguistic data to obtain the updated target linguistic data.

Optionally, the corpus generating device further includes:

and the second storage module is used for storing the target corpus in the block chain network node.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the corpus generating method when executing the computer program.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the corpus generating method.

According to the corpus generating method, device, computer equipment and storage medium provided by the embodiment of the invention, consultation texts and response texts related to vaccines are obtained from a medical inquiry library and are used as initial texts, the initial texts are subjected to data cleaning to obtain original corpus data, a K-means clustering model is further adopted to perform clustering processing on the original corpus data to obtain coarse-grained clustered corpuses of at least two clusters, and for each cluster of coarse-grained clustered corpuses, the coarse-grained clustered corpuses are subjected to secondary clustering processing through a density clustering algorithm to realize multi-level clustering processing to obtain more accurate classification, and the obtained density clustered corpuses are used as target corpuses to ensure more accurate classification of the target corpuses and simultaneously improve the accuracy degree of the target corpuses for vaccine inquiry and response.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a corpus generation method of the present application;

FIG. 3 is a schematic block diagram of an embodiment of a corpus generating device according to the present application;

FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, as shown in fig. 1, a system architecture 100 may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like.

The

terminal devices

101, 102, 103 may be various electronic devices having display screens and supporting web browsing, including but not limited to smart phones, tablet computers, E-book readers, MP3 players (Moving Picture E interface displays a properties Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture E interface displays a properties Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that, the corpus generating method provided in the embodiment of the present application is executed by a server, and accordingly, the corpus generating device is disposed in the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks and servers may be provided according to implementation needs, and the

terminal devices

101, 102 and 103 in this embodiment may specifically correspond to an application system in actual production.

Referring to fig. 2, fig. 2 shows a corpus generating method according to an embodiment of the present invention, which is described by taking the method applied to the server in fig. 1 as an example, and is detailed as follows:

s201: and acquiring a consultation text and a response text related to the vaccine from the medical inquiry library as initial texts.

Specifically, a medical inquiry library is queried by retrieving preset keywords, and a vaccine-related consultation text and a response text are obtained as initial texts.

The preset keywords may be related phrases or phrases covering vaccination time, program, cautionary matters, population adaptability and the like of the vaccine.

The medical inquiry library is a resource library for storing information (text information and voice information) of vaccine-related questions consulted through a network or a telephone.

As a preferable mode, in order to facilitate the query, the voice information may be converted into text information through a third-party voice conversion text tool, and then the text information is stored in the medical inquiry library.

It should be noted that the medical inquiry library in this embodiment corresponds to a plurality of site pages, and the site pages provide query and reading of the record information of the medical inquiry.

Preferably, in the embodiment, a crawler mode is adopted, the consultation text and the response text related to the vaccine are quickly and accurately crawled from the site page of the medical inquiry library, the acquisition speed of the initial text is increased, and the generation efficiency of the training corpus is favorably improved.

S202: and carrying out data cleaning on the initial text to obtain original corpus data.

Specifically, the acquired initial text includes some punctuations, text formats, invalid expressions, pictures, and the like, and data cleaning is required before data processing is performed on the data.

Wherein, data cleansing includes but is not limited to: removing punctuation images, dividing the text, extracting key sentences and the like.

Furthermore, vectorizing the text after data cleaning, and taking the obtained word vector as the original corpus data.

Specifically, the text after data cleaning is mapped into a vector, and the vectors are connected together to form a word vector space, wherein each vector corresponds to a point in the space.

For example, two keywords, namely, a bmw and a gallop, are contained in a product name of a certain automobile sales company, and all possible classifications of the two keywords are obtained according to a preset corpus: "car", "luxury", "animal", "action", and "food". Therefore, a vector representation is introduced for these two keys:

< cars, luxuries, animals, actions, food >

The probability that the two keywords belong to each classification is calculated according to a statistical learning method, and the probability learned by a computer is as follows:

bma ═ 0.5,0.2,0.2,0.0,0.1>

Gallop ═ 0.7,0.2,0.0,0.1,0.0>

It is understood that the value of each dimension of the base word vector represents a feature that has certain semantic and grammatical interpretations, and thus each dimension of the base word vector may be referred to as a keyword feature.

It should be noted that, in this embodiment, the word vector representation may specifically be a word segmentation, a short sentence, or a question-and-answer sentence, and no more idle word is made here.

S203: and clustering the original corpus data by adopting a K-means clustering model to obtain at least two clusters of coarse-grained clustered corpuses.

Specifically, a K-means clustering model is adopted to perform clustering processing on the original corpus data, and the original corpus data corresponding to each clustering center is used as a cluster of coarse-grained clustering corpuses to obtain at least two clusters of coarse-grained clustering corpuses.

The coarse-grained clustering corpora refer to clustering corpora with low precision, which include some common semantics, but final semantics are not necessarily the same. For example, the two pieces of original corpus data are that "i eat a little after a meal and the belly is hungry" and "i eat the belly with pain", after the K-means clustering model clustering, the two pieces of corpus data are grouped into a cluster, and therefore the two pieces of corpus data belong to coarse-grained clustered corpus, and in order to ensure the accuracy of the classification, the coarse-grained clustered corpus data need to be further classified in a refined manner subsequently.

The K-means algorithm is a distance-based clustering algorithm, and the distance is used as an evaluation index of similarity, that is, the closer the distance between two objects is, the greater the similarity of the two objects is. The algorithm considers clusters to be composed of closely spaced objects, and therefore targets the resulting compact and independent clusters as final targets.

S204: and aiming at each cluster of coarse-grained clustered corpora, performing secondary clustering processing on the coarse-grained clustered corpora through a density clustering algorithm, and taking the obtained density clustered corpora as target corpora.

In particular, because the vaccine question answering has strong specialization, the training corpora with more detailed classification and higher accuracy are needed, because the K-means algorithm has limited functions, the problem of perfectly clustering each type of vaccine can not be solved, so that the K-means clustering algorithm is firstly utilized to carry out coarse-grained clustering on the original corpus, the proper text cluster can be obtained by adjusting the algorithm hyper-parameter in the clustering process, so that the context within a cluster has some similarity, each cluster roughly represents a class of vaccine problems, e.g. different questions concerning the vaccination time of a certain vaccine may be concentrated within the same cluster, in order to further improve the classification fineness and improve the accuracy of the corpus to the vaccine problem, the embodiment uses a density clustering algorithm to perform secondary clustering processing on the coarse-grained clustered corpus, and uses the obtained density clustered corpus as a target corpus.

Preferably, the density clustering algorithm adopted in this embodiment is DBSCAN, and specifically, the process of performing secondary clustering processing by using DBSCAN may refer to the description of the subsequent embodiments, and is not described herein again to avoid repetition.

Among them, DBSCAN (sensitivity-Based Spatial Clustering of Applications with noise) is a relatively representative Density-Based Clustering algorithm. Unlike the partitioning and hierarchical clustering method, which defines clusters as the largest set of density-connected points, it is possible to partition areas with sufficiently high density into clusters and find clusters of arbitrary shape in a spatial database of noise.

In this embodiment, a consultation text and an answer text related to a vaccine are obtained from a medical inquiry library and are used as initial texts, data cleaning is performed on the initial texts to obtain original corpus data, a K-means clustering model is further adopted to perform clustering processing on the original corpus data to obtain coarse-grained clustered corpuses of at least two clusters, secondary clustering processing is performed on the coarse-grained clustered corpuses through a density clustering algorithm for each cluster of the coarse-grained clustered corpuses, multi-level clustering processing is achieved to obtain more accurate classification, the obtained density clustered corpuses are used as target corpuses, classification of the target corpuses is ensured to be more accurate, and meanwhile, the accuracy of the target corpuses for vaccine inquiry and answer is also improved.

In an embodiment, after the target corpora are obtained, each target corpus is stored in a blockchain network node, and data information is shared among different platforms through blockchain storage, so that data can be prevented from being tampered.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

In some optional implementations of this embodiment, in step S201, the vaccine-related consultation text and the response text are obtained from the medical inquiry library, and the initial text includes:

determining a target page according to the page weight of each preset path;

calculating a page ranking value of each target page based on a preset page ranking strategy, and sequencing the target pages according to the sequence of the page ranking values from big to small to obtain a target page queue;

Specifically, a plurality of preset paths are stored in a medical inquiry library in advance, each preset path is stored with 1 or a plurality of pages, corresponding information is obtained through crawling of page content, before page crawling is carried out, link analysis is carried out on sites to be crawled first, the weight of each site page is confirmed, so that a target page needing to be crawled is determined according to the weight later, a server side is preset with reference weight, when the calculated page weight is larger than the preset reference weight, the page is confirmed to have a crawling price value, the page is determined to be a target page, the page ranking value of each target page is calculated through a preset page ranking strategy, the target pages are ranked according to the ranking sequence of the page values from large to small to obtain a target page queue, then the content of the target pages is crawled according to the sequence of the pages in the target page queue, and obtaining basic data contained in the target page and user information corresponding to the basic data.

The link analysis refers to analyzing the basic features of the page corresponding to each preset path in the medical inquiry library, and in this embodiment, the selected basic features for analysis include, but are not limited to: vaccine-related, network topology, and page content, among others.

The network topology analysis comprises the analysis of data such as external links, layers and levels of the web pages.

The page content analysis comprises the analysis of content characteristic data such as appearance, text and the like of a webpage.

In the embodiment, three analysis results are obtained by analyzing the relevant vaccine text, the network topology and the webpage content, and the three analysis results are comprehensively evaluated to obtain the page weight of the website. The specific manner of the comprehensive evaluation may be realized by a preset weighting formula, or may be set according to actual needs, which is not limited herein.

The preset page ranking policy includes, but is not limited to: PageRank strategy, Hilltop algorithm, link relation based ranking (TrustRank) algorithm, ExpertRank and the like.

The PageRank strategy, also called a webpage ranking strategy, a Google left-side ranking strategy or a PageRank strategy, is a technology calculated according to mutual hyperlinks between webpages, is one of elements of webpage ranking, can be used for reflecting the relevance and importance of the webpages, is an important factor frequently used for evaluating webpage optimization in search engine optimization operation, and is sorted from large to small according to the PageRank value, so that pages with higher importance levels are ranked in the front, and when content crawling is performed later, information of the webpages ranked in the front is preferentially acquired.

In the embodiment, the important information is preferentially crawled by constructing the page weight queue and then crawling according to the sequence in the page weight queue, so that the quality and the crawling efficiency of crawling content are improved.

In some optional implementation manners of this embodiment, in step S204, for each cluster of coarse-grained clustered corpora, performing secondary clustering processing on the coarse-grained clustered corpora through a density clustering algorithm, and using the obtained density clustered corpora as a target corpus includes:

counting the number of other corpus data contained in the corpus data within a preset scanning radius eps aiming at each corpus data in the coarse-grained clustering corpus, and taking the number as the number of neighborhood points corresponding to the corpus data;

using corpus data with neighborhood point number more than or equal to a preset minimum inclusion point number minPts as core points;

Specifically, the number of other corpus data included in the corpus data within a preset scanning radius eps is counted for each corpus data in the coarse-grained clustered corpus, the number is used as the number of neighborhood points corresponding to the corpus data, then the corpus data with the neighborhood point number being greater than or equal to a preset minimum inclusion point minPts is used as a core point, the neighborhood point number is smaller than the preset minimum inclusion point minPts, the corpus data within the preset scanning radius eps of the core point is used as a boundary point, the boundary points with the distance not exceeding the preset scanning radius eps are connected with one another to form a density cluster in the shape of a closed polygon, and the core point within the range of the density cluster is added to the density cluster to obtain the target corpus.

The preset scanning radius eps and the preset minimum inclusion point number minPts may be set according to actual needs, which is not limited herein, for example, the preset scanning radius eps is set to 10, and the preset minimum inclusion point number minPts is set to 5.

It should be understood that boundary points whose distances do not exceed the preset scanning radius eps are connected to form a density cluster, and the density cluster obtained finally may be one or more, each density cluster is a collection of branch problems of a category of vaccine problems, and the category and the number of branch problems of a specific vaccine problem depend on the contents of the initial text crawled.

It should be noted that, in the embodiment, the corpus data that does not belong to any one of the core point and the boundary point in the coarse-grained clustered corpus is used as the noise point, and the noise point is cleaned up, so that the accuracy of the corpus is improved.

In the embodiment, each type of vaccine problem is refined and classified by secondary clustering of coarse-grained clustering corpora, so that the accuracy of the training corpora is improved, meanwhile, noise points are filtered, interference of the corpora which are relatively weak with the vaccine question answering on subsequent vaccine question answering training is avoided, and the accuracy of corpus generation is improved.

In some optional implementations of the present embodiment, after step S203 and before step S204, the corpus generating method further includes:

and setting different category labels aiming at the coarse-grained clustering corpus of each cluster, and storing the cluster coarse-grained clustering corpus, the category labels and the corresponding relation between the cluster coarse-grained clustering corpus and the category labels into an Elasticissearch engine.

Specifically, a unique category label is set for the coarse-grained clustering language of each cluster aiming at the coarse-grained clustering language material of each cluster, the cluster coarse-grained clustering language material, the category label and the corresponding relation between the cluster coarse-grained clustering language material and the category label are stored in an Elasticissearch engine, and the cluster coarse-grained clustering language material, the category label and the corresponding relation between the cluster coarse-grained clustering language material and the category label are quickly stored and ordered by utilizing the characteristics of the Elasticissearch engine, so that the data and the corresponding relation stored by the Elasticissearch engine are quickly extracted and aggregated, and the screening efficiency of the follow-up language material is improved.

The implementation principle of the Elasticissearch engine is mainly divided into the following steps that firstly, a user submits data to an Elasticissearch database, then a word controller divides a corresponding sentence into words, the weight and the word division result are stored into the data together, when the user searches the data, the results are ranked and scored according to the weight, and then the results are returned and presented to the user according to the sequence of the scores.

In this embodiment, a unique classification tag is set for the coarse-grained clustering corpus of each cluster, and a corresponding relationship is established and stored in the Elasticsearch engine, which is beneficial to performing data fusion and screening some non-related corpus data through the Elasticsearch engine in the following.

In some optional implementation manners of this embodiment, after step S204, the corpus generating method further includes:

Specifically, the Elasticsearch engine may be configured to obtain the expression similar text, and under the condition that the Elasticsearch engine searches for a certain threshold, the Elasticsearch engine may obtain a similar problem of a representative problem from the target corpus by using the aggregation function of the Elasticsearch engine, and then, the target corpus is screened again, so that the non-strongly related corpus may be removed, and thus, the quality of the corpus is improved.

A certain threshold is also the preset threshold in this embodiment, and the preset threshold is different according to different actual application scenarios, and may be set according to actual needs, for example, set to 0.6, and is not specifically limited herein.

The irrelevant linguistic data refers to clustering clusters or linguistic data with the relevance lower than a preset threshold after aggregation of the target linguistic data is performed by the aid of an elastic search engine.

Optionally, in this embodiment, the distance between the non-strongly related corpus and all cluster centers of the target corpus is calculated through a sentence similarity algorithm, if there is a non-strongly related corpus smaller than a preset distance, it is determined that the non-strongly related corpus is used as a weakly similar text, that is, a problem isolated point, the problem isolated point is used as a category of problems alone, and the problem isolated point is updated to the target corpus as a new corpus, so as to improve the support of the target corpus on the vaccine problem of the paraportal cold.

Wherein the sentence similarity algorithm includes, but is not limited to: brute Force (Brute Force) Algorithm, RK Algorithm, KMP (the knuth-Morris-Pratt Algorithm) Algorithm, and character string correction similarity calculation method based on phonographic code, EditDistance, and the like. The method can be selected and used according to actual requirements, and is not limited here.

In this embodiment, irrelevant linguistic data are removed through the elastic search engine, and the target linguistic data are updated, so that the simplification and the accuracy of the target linguistic data are ensured, the situation that the accuracy of subsequent vaccine question-answer training is not high due to the linguistic data with low relevance is avoided, meanwhile, some isolated problems are independently used as a type of problems, the target linguistic data are supplemented, and the support of the target linguistic data on the vaccine problem in the cold phylum is improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Fig. 3 is a schematic block diagram of a corpus generating device corresponding to the corpus generating method according to the above embodiment. As shown in fig. 3, the corpus generating device includes a data acquiring module 31, a data cleaning module 32, a coarse-grained clustering module 33, and a corpus determining module 34. The functional modules are explained in detail as follows:

the data acquisition module 31 is configured to acquire a vaccine-related consultation text and a response text from the medical inquiry library as initial texts;

the data cleaning module 32 is configured to perform data cleaning on the initial text to obtain original corpus data;

the coarse-grained clustering module 33 is used for clustering the original corpus data by adopting a K-means clustering model to obtain at least two clusters of coarse-grained clustered corpuses;

and the corpus determining module 34 is configured to perform secondary clustering processing on the coarse-grained clustered corpus by using a density clustering algorithm for each cluster of coarse-grained clustered corpuses, and use the obtained density clustered corpuses as target corpuses.

Optionally, the data obtaining module 31 includes:

and the content acquisition unit is used for capturing the content in the target page based on the target page queue to obtain the consultation text and the response text related to the vaccine.

Optionally, the corpus determining module 34 includes:

a domain point number determining unit, configured to count, for each corpus data in the coarse-grained clustered corpus, the number of other corpus data included in the corpus data within a preset scanning radius eps, and use the number as the number of neighborhood points corresponding to the corpus data;

the core store determining unit is used for taking the corpus data with the neighborhood point number more than or equal to the preset minimum contained point number minPts as the core point;

Optionally, the corpus generating device further includes:

the aggregation module is used for acquiring a preset threshold, and aggregating the target corpus by adopting an Elasticissearch engine according to the preset threshold to obtain a clustering result;

and the updating module is used for screening the non-related corpora according to the clustering result and removing the non-related corpora to obtain the updated target corpora.

Optionally, the corpus generating device further includes:

For the concrete limitation of the corpus generating device, reference may be made to the above limitation on the corpus generating method, which is not described herein again. All or part of the modules in the corpus generating device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only the computer device 4 having the components connection memory 41, processor 42, network interface 43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes for controlling electronic files. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, such as program code for executing control of an electronic file.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.

The present application further provides another embodiment, which is to provide a computer-readable storage medium, wherein the computer-readable storage medium stores an interface display program, and the interface display program is executable by at least one processor, so as to cause the at least one processor to execute the steps of the corpus generating method as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A corpus generation method is applied to training corpus generation of a vaccine question-answering robot and is characterized by comprising the following steps:

carrying out data cleaning on the initial text to obtain original corpus data;

2. The corpus generating method according to claim 1, wherein said obtaining vaccine-related advisory text and response text from a medical consulting repository as initial text comprises:

determining a target page according to the page weight of each preset path;

3. The corpus generation method according to claim 1, wherein said performing a secondary clustering process on each cluster of coarse-grained clustered corpus by using a density clustering algorithm, and using the obtained density clustered corpus as a target corpus comprises:

using corpus data with the number of neighborhood points smaller than a preset minimum contained point number minPts and within a preset scanning radius eps of any core point as a boundary point;

4. The corpus generating method according to any one of claims 1 to 3, wherein after said clustering process is performed on said original corpus data by using a K-means clustering model to obtain at least two clusters of coarse-grained clustered corpuses, and before said performing a secondary clustering process on said coarse-grained clustered corpuses by using a density clustering algorithm for each cluster of coarse-grained clustered corpuses and using the obtained density clustered corpuses as a target corpus, said corpus generating method further comprises:

5. The corpus generating method according to claim 4, wherein after performing secondary clustering processing on the coarse-grained clustered corpus by using a density clustering algorithm for each cluster of coarse-grained clustered corpuses and using the obtained density clustered corpus as a target corpus, the corpus generating method further comprises:

6. The corpus generating method according to claim 1, wherein after performing secondary clustering processing on the coarse-grained clustered corpus by using a density clustering algorithm for each cluster of coarse-grained clustered corpuses and taking the obtained density clustered corpuses as target corpuses, the method further comprises: and storing the target corpus in a block chain network node.

7. A corpus generating device is applied to training corpus generation of a vaccine question-answering robot, and is characterized by comprising:

8. The corpus generation apparatus of claim 7, wherein said corpus determining module comprises:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the corpus generation method according to any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the corpus generating method according to any one of claims 1 to 6.