CN118312816B

CN118312816B - Cluster weighted clustering integrated medical text processing method and system based on member selection

Info

Publication number: CN118312816B
Application number: CN202410553936.0A
Authority: CN
Inventors: 徐秀芳; 高婷; 徐森; 郭乃瑄; 许贺洋; 卞学胜; 花小朋; 陈博炜; 王志漩; 刘轩绮; 孙雯; 徐畅
Original assignee: Yancheng Institute of Technology
Current assignee: Yancheng Institute of Technology
Priority date: 2023-09-08
Filing date: 2024-05-07
Publication date: 2025-04-08
Anticipated expiration: 2044-05-07
Also published as: CN117195027A; CN118312816A

Abstract

The invention provides a cluster weighted clustering integrated medical data processing method and system based on member selection, wherein the method comprises the steps of constructing a cluster member set and inputting the cluster member set into a pre-trained decision tree model; screening out the label from the cluster member set output by the decision tree model as the cluster member of the pre-label, generating a target cluster set by the screened cluster member, determining a target CA matrix of the target cluster set according to the cluster layer weighting coefficient, and executing a hierarchical clustering algorithm based on the target CA matrix to obtain a final clustering result. The cluster weighted clustering integrated medical data processing method and system based on member selection can obtain more optimized clustering results, can better show the characteristics and the similarity of diseases of patients, is convenient for doctors to select medical services for different patients or customize personalized treatment schemes, and is beneficial to improving the medical experience and treatment effect of patients.

Description

Cluster weighted clustering integrated medical text processing method and system based on member selection

Technical Field

The invention relates to the technical field of data processing, in particular to a cluster weighted clustering integrated medical data processing method and system based on member selection.

Background

Medical diagnostics are often faced with a large number of complex cases and clinical data, and doctors need to quickly and accurately classify patients in order to formulate personalized treatment regimens. However, rapid classification and prediction of large-scale data is challenging due to the complexity and diversity of the disease, and furthermore, different patients may exhibit different symptoms and characteristics, which are also factors of concern for classification.

Cluster analysis is one of the hot spots of machine learning research, is widely used for data compression, information retrieval, image segmentation and text clustering, and is receiving more and more attention in the fields of biology, geology, geography, abnormal data detection and the like. The clustering analysis is an unsupervised machine learning, and the priori knowledge of the data set is lacking in advance, and the data set is automatically divided into a plurality of groups or clusters only according to the similarity measurement among the data points, the samples and the objects, so that the similarity among the points belonging to the same cluster is as high as possible, and the similarity among the points belonging to different clusters is as low as possible. The clustering is to introduce the ensemble learning idea into the clustering analysis, so that the clustering integration research is started. The method mainly comprises the steps of taking a data set as input, running a clustering algorithm, outputting a plurality of different clustering results, namely cluster member generation, taking a set formed by all cluster members, namely a cluster set, as input, combining the cluster members and outputting a final clustering result, namely cluster integration, namely consensus function design, and generally adopting a cluster integration algorithm for classifying medical data.

However, most of the existing cluster integration algorithms treat each cluster member equally, and do not pay attention to the quality difference of the cluster members, even if some algorithms propose to treat each cluster member differently, the cluster members can be evaluated, but the local diversity of the clusters in the same cluster member is ignored, and the cluster members are directly regarded as independent individuals, so that clusters with lower quality or base clusters with lower quality can appear.

Disclosure of Invention

The invention aims to provide a cluster weighted clustering integrated medical data processing method and system based on member selection, which can obtain more optimized clustering results, can better show the characteristics and similarity of diseases of patients, is convenient for doctors to select medical services for different patients or customize personalized treatment schemes, and is beneficial to improving the medical experience and treatment effect of patients.

The embodiment of the invention provides a cluster weighted clustering integrated medical data processing method based on member selection, which comprises the following steps:

Constructing a cluster member set and inputting the cluster member set into a pre-trained decision tree model;

Screening out the labels from the cluster member set output by the decision tree model as the cluster members of the pre-label, and generating a target cluster set by using the screened cluster members;

determining a target CA matrix of a target cluster set according to the cluster layer weighting coefficient;

and executing a hierarchical clustering algorithm based on the target CA matrix to obtain a final clustering result.

Preferably, the construction of the cluster member set comprises clustering medical data by adopting a K-Means algorithm to generate a plurality of cluster members.

Preferably, the training steps of the decision tree model are as follows:

calculating the Davies-Bouldin index of each sample cluster member in the sample cluster member set, and calculating the overall average value;

Comparing the Davies-Bouldin indexes of each sample cluster member with the average value respectively, marking the sample cluster members with the Davies-Bouldin indexes lower than the average value with a first label, and marking the sample cluster members with the Davies-Bouldin indexes higher than the average value with a second label;

training is carried out based on sample cluster members with labels as a training set, and a trained decision tree model is obtained.

Preferably, determining the target CA matrix of the target cluster according to the cluster layer weighting coefficient includes:

Constructing a CA matrix about the target cluster;

weighting the CA matrix based on the cluster layer weighting coefficient to obtain processed data B;

capturing high confidence information from the CA matrix to obtain an HC matrix;

and determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix.

Preferably, constructing a CA matrix for the target clusters includes:

Wherein A is CA matrix; Represent the first Cluster members; representing the total number of cluster members in the cluster set; Representing sample points A cluster in which the cluster is located; Representing sample points The cluster in which it is located.

The invention also provides a cluster weighted clustering integrated medical data processing system based on member selection, which comprises a construction unit, an input unit, a screening unit, a matrix determining unit and an executing unit;

The method comprises the steps of constructing a cluster member set by a construction unit, inputting the cluster member set into a pre-trained decision tree model by an input unit, screening out cluster members with labels being pre-labeled from the cluster member set output by the decision tree model by a screening unit, generating a target cluster set by the screened cluster members, determining a target CA matrix of the target cluster set by a matrix determination unit according to a cluster layer weighting coefficient, and executing a hierarchical clustering algorithm by an execution unit on the basis of the target CA matrix to obtain a final clustering result.

Preferably, the construction unit clusters the medical data by using a K-Means algorithm to generate a plurality of cluster members.

Preferably, the training steps of the decision tree model are as follows:

Preferably, the matrix determining unit performs the following operations:

Constructing a CA matrix about the target cluster;

Preferably, the matrix determining unit constructs a CA matrix for the target cluster group, performing the following operations:

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a schematic diagram of a cluster weighted clustering integrated medical data processing method based on member selection in an embodiment of the invention;

FIG. 2 is a flow diagram of building a cluster member set according to one embodiment of the invention;

FIG. 3 is a flow chart of a cluster weighted clustering integration method based on member selection in accordance with yet another embodiment of the invention;

FIG. 4 is a flow chart of training a decision tree according to one embodiment of the invention;

FIG. 5 is a schematic diagram of a trained decision tree model according to one embodiment of the invention;

FIG. 6 is a schematic diagram of a cluster weighted clustering integrated medical data processing system based on member selection in an embodiment of the invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

The embodiment of the invention provides a cluster weighted clustering integrated medical data processing method based on member selection, which is shown in figure 1 and comprises the following steps:

The hierarchical clustering algorithm comprises an average linking method, a preset label, a hierarchical clustering algorithm and a hierarchical clustering algorithm, wherein the preset label is a label marked as high;

the construction of the cluster member set comprises the steps of clustering medical data by adopting a K-Means algorithm to generate a plurality of cluster members. Setting the cluster membership number r and the cluster number k, setting the cluster number k to be [2, R epsilon N+ where N is the number of data sample points, and using K-Means algorithm to randomly generate r cluster members as cluster set P, and further generating cluster member set, specifically:

Obtaining the number r of cluster members and the number k of clusters;

initializing i=1;

judging whether i is less than or equal to r;

when the i is determined to be less than or equal to r, clustering by using a K-Means algorithm to generate cluster members, and obtaining a clustering result;

and (3) assigning i=i+1, and continuing to judge until i is not less than or equal to r, and constructing a cluster member set.

In addition, the cluster member set can be constructed by using a hierarchical clustering method, a spectral clustering method and the like.

As shown in fig. 2 to 5, the training steps of the decision tree model are as follows:

Comparing the Davies-Bouldin indexes of each sample cluster member with the average value respectively, marking the sample cluster members with the Davies-Bouldin indexes lower than the average value with a first label, marking the sample cluster members with the Davies-Bouldin indexes higher than the average value with a second label, wherein the first label is 'high', and the second label is 'low';

Training is carried out based on sample cluster members with labels as a training set, and a trained decision tree model is obtained. The ARI (adjusted Rand index), NMI (normalized mutual information) and F (Fowlkes-Mallows index) of each cluster member are taken as characteristic attribute sets, and all three characteristic attributes are all better as the characteristic attributes are close to 1, wherein the value range of the ARI is [ -1, 1], and the value ranges of the NMI and the F are [0, 1]. The importance of each feature attribute set is calculated using the coefficient of kunning to determine the root node and internal nodes of the decision tree, as follows:

(1.1)

(1.2)

Wherein, Representing the probability that a sample point belongs to class i,The tables are the feature attribute set and the state of the feature attribute. The larger the coefficient of the kunit, the greater the uncertainty of the feature properties.

First, the coefficient of the characteristic of the labeled cluster member is obtained by using the formulas (1.1) and (1.2), the coefficient of the characteristic of the labeled cluster member is compared with the coefficient of the characteristic of the labeled cluster member in the three aspects ARI, NMI, F, the "characteristic attribute 1" with the smallest coefficient of the characteristic is selected as the root node, and the label with the value of the "characteristic attribute 1" close to 1 is "high". Then, the cluster members with labels, the values of which are not close to 1, of the characteristic attribute 1 are used as a new label set, the coefficient of the foundation of the remaining two characteristic attributes is calculated continuously, and the smallest characteristic attribute 2 is selected as an internal node. Finally, the remaining feature attributes become the last internal nodes as "feature attribute 3".

The three characteristic attributes are all better as approaching to 1, wherein the value range of ARI is [ -1, 1], the ARI is divided into two types, one type is larger than 0 and smaller than or equal to 0, the value ranges of NMI and F-measure indexes are [0, 1], and the ARI is divided into two types, one type is larger than 0.5 and one type is smaller than or equal to 0.5. The importance of each feature attribute set is calculated using the coefficient of kunning to determine the root node and internal nodes of the decision tree. The larger the coefficient of the kunit, the greater the uncertainty of the feature properties.

The attributes of the root node and the internal nodes of the decision tree are measured using the coefficient of the radix;

For example, the key coefficients of ARI properties are determined as follows:

according to the following formula:

first, the coefficient of the kunit when ARI is greater than 0 is calculated as:

At this time, the liquid crystal display device, Representing a subset of dataIs of a non-purity of the base; Representing a subset of data obtained at a characteristic attribute ARI greater than 0; the number of samples representing a category label of "high" divided by the number of samples when ARI is greater than 0; the number of samples representing a category label of "low" divided by the number of samples when ARI is greater than 0;

next, the kunit when ARI is 0 or less is calculated as:

At this time, the liquid crystal display device, Representing a subset of dataIs of a non-purity of the base; Representing a subset of data obtained when the characteristic attribute ARI is 0 or less; The number of samples indicating that the category label is "high" divided by the number of samples when ARI is 0 or less; the number of samples indicating that the category label is "low" divided by the number of samples when ARI is 0 or less;

finally, according to the following formula:

The kunity coefficient of ARI is calculated as follows:

At this time, the liquid crystal display device, A coefficient of a characteristic attribute of the data set D is ARI; indicating the number of samples when ARI is greater than 0, Representing the number of samples of the data set D.

When the decision tree is constructed, the values of the characteristic attributes are equally divided into two sections of different value fields, and two branches are used.

The decision tree selects the best characteristic attribute as the root node according to the coefficient of the foundation, and continuously selects the characteristic attribute with higher importance in the subsequent internal node division. After training is completed, the decision tree model can be used for classification prediction of new cluster members.

When the decision tree is constructed, the value of the characteristic attribute can be further subdivided into a plurality of sections with different value ranges, namely multiple branches, branches of each decision tree are increased, the number of decision classification is increased, and the classification precision is further improved.

Based on the method, the clustering algorithm and the decision tree are effectively aggregated to obtain a decision tree model, the decision tree selects the best characteristic attribute as a root node according to the coefficient of the foundation, and the characteristic attribute with higher importance is continuously selected in the subsequent internal node division, so that the method can be used for classifying and predicting new cluster members based on the decision tree model.

The target cluster set is a set constructed by high-quality cluster members. When processing and identifying the cluster member set based on the decision tree model, ARI, NMI and F-measure indexes in the characteristic attribute set of each cluster member are output, and the decision tree model predicts the cluster quality label ("high" or "low") according to the learned node division rule. And forming the cluster members with the labels of high into a new cluster group P', and participating in subsequent processing, namely generating a target cluster group.

The method for determining the target CA matrix of the target cluster group according to the cluster layer weighting coefficient comprises the following steps:

Constructing a CA matrix about the target cluster;

Constructing a CA matrix for a target cluster set, comprising:

And weighting the CA matrix based on the cluster layer weighting coefficient to obtain processing data B, wherein the processing data B comprises the following steps:

Capturing High-confidence information from a CA matrix, supplementing and perfecting the captured High-confidence information to obtain an ideal CA matrix, and marking a High-confidence information HC (High-Confidence Matrix) matrix as H, wherein the method comprises the following steps of:

Wherein, The position of the highly reliable element is recorded,Is an element operator. When the ratio of the number of times two sample points are classified into the same cluster to the total number of cluster members exceeds a predefined thresholdThe corresponding position of the a matrix is considered as a piece of highly reliable information (i.e. one element in the H matrix).

Determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix, wherein the target CA matrix comprises the following components:

the final CA matrix, designated C, is obtained from the H matrix and the B matrix, and the method is as follows:

Wherein, Is a laplace matrix of the type described above,,Indicating the lagrangian term, default to a value of 1,For the purpose of balancing the error loss term,AndRepresenting the lagrangian multiplier and,The Frobenius norm of the matrix is represented, E represents the error term, and F is the intermediate matrix used to alleviate the value range constraint and the symmetry constraint of the ideal CA matrix.

Information entropy is introduced to the target clusters, and an uncertainty index IEI of each cluster is calculated and used as a cluster layer weighting coefficient. The IEI calculation method is as follows:

Wherein, Representing the total number of clusters; For measuring clusters And cluster ofSimilarity between IEI indicators reflects clustersThe likelihood that points in the cluster remain in the same cluster in other base clusters, the greater the IEI indicates the clusterCapturing High-confidence information from the CA matrix, using the captured High-confidence information to complement and perfect to obtain an ideal CA matrix, and using the High-confidence information to obtain an HC (High-Confidence Matrix) matrix which is marked as H; is the maximum value in the H matrix; is the minimum value in the H matrix; for clusters in the H matrix A corresponding value;

or KL divergence is introduced to determine a cluster layer weighting coefficient;

or a plurality of evaluation indexes are introduced and fused together to form a new evaluation index to determine the cluster layer weighting coefficient.

The application introduces the decision tree to assist in selecting the high-quality cluster members, and training the decision tree model to assist in selecting the cluster members, so that the quality and the diversity of the cluster members are considered from multiple angles, and meanwhile, the internal diversity of the cluster members is considered, thereby realizing the differentiated treatment of the cluster members more comprehensively. And finally, finely adjusting the CA matrix according to the weight and the high confidence information so as to improve the accuracy and the robustness of the clustering.

The cluster weighted clustering integrated medical data processing method based on member selection uses a clustering algorithm to cluster patient data to generate a plurality of cluster members. Each cluster member divides the patient into different clusters, each cluster representing a patient population with similar symptoms and features. Next, high quality cluster members are selected with the aid of a decision tree model. In order to further improve accuracy and interpretation of the clustering result, attention is given to the internal diversity of the clustering members, the diversity of the internal clusters is measured, fine adjustment is carried out on the CA matrix, and finally, hierarchical clustering analysis is used for obtaining the final clustering result. By the method, more optimized clustering results are obtained, the characteristics and the similarity of the diseases of the patients can be better displayed, doctors can conveniently select medical services for different patients or customize personalized treatment schemes, and the medical experience and treatment effect of the patients can be improved.

In one embodiment, the processing method further comprises evaluating the final clustering result based on the external index, determining an evaluation value, and judging the validity of the final clustering result based on the evaluation value. The cluster result validity evaluation index is generally classified into an internal index and an external index. In most cases, class labels of the data set are known (not used in the clustering process), and an external index can be used to evaluate the effectiveness of clustering, where an F-measure is a relatively common comprehensive index for evaluating the quality of text clustering. The larger the F value is, the higher the clustering quality is, and when the clustering result is completely consistent with the real category, the F value reaches the maximum value, and the value is 1. In addition, NMI value is also a popular clustering result effectiveness evaluation index, and can quantify the matching degree of the clustering result and the real text category label.

In one embodiment, computing the coefficient of the feature attribute set for the kurti on three aspects of the ARI, NMI, and F-measure indices includes:

Wherein, Represents the genii purity of dataset D; Representing class labels as The duty cycle in the dataset D, i.e. belonging to a categoryDivided by the total number of samples; expressed in characteristic attribute Is not pure in the data set D; expressed in characteristic attribute The value of (2) isThe number of samples of the subset of data obtained at that time; Representing a subset of data Is of non-purity. Based on the formula, the coefficient of the foundation of the labeled cluster members in respect of ARI, NMI, F-measure index is obtained, so that the corresponding coefficient of foundation is conveniently and accurately calculated, the accuracy of judging the sizes of the coefficient of foundation of the three is improved, and further, the root node and the internal node are accurately divided.

In one embodiment, the method for processing the cluster weighted clustering integrated medical data based on member selection further processes medical data as follows to mark data with interference on a clustering result before constructing a cluster member set, and specifically comprises the following processing steps:

Analyzing feedback data corresponding to the clustering result to determine a marking node;

outputting the marked nodes and the related data and training data of the decision tree model when the number of the marked nodes is more than or equal to the preset number (for example, any one value from 2 to 100);

When the number of the marking nodes is smaller than the preset number, screening and marking the medical data based on each marking node;

And when the clustering result is output, the marked medical data synchronously output corresponding marking information.

The feedback data acquisition steps are as follows:

after the clustering result of the patient is sent to a preset doctor terminal, the received annotation information of the doctor terminal;

And/or the number of the groups of groups,

After the clustering result of the patient and the treatment scheme of the patient are sent to the patient terminal, when the received patient receives an in-doubt correction instruction of the data, the clustering result, the treatment scheme and the in-doubt information are sent to a preset professional doctor terminal, and then the received annotation information of the professional doctor terminal is received;

Annotation information, clustering results, treatment schemes and/or doubtful information are used as feedback data.

The analysis steps for the feedback data are as follows:

Screening the feedback data; extracting keywords from annotation information and doubt information in feedback data based on a preset keyword library, acquiring keyword sets corresponding to the feedback data, and deleting the feedback data without extracting the keywords; because some feedback data have annotation information and doubt information at the same time, the keywords extracted by the annotation information and the doubt information have interference, the interference monitoring can be carried out on the keywords in the keyword set according to a preset keyword interference association library, and the feedback data corresponding to the monitored interference keyword set is deleted;

Acquiring original medical data corresponding to a clustering result corresponding to the feedback data after screening and characteristic data during classification;

Associating the original medical data and the classification characteristic data with feedback data to form data to be analyzed;

The specific grouping can be used for determining that two data to be analyzed are the same grouping by calculating the data similarity between the original medical data and the classified characteristic data when the sum of the data similarity is larger than a preset threshold value;

According to the similarity sum of each piece of data to be analyzed in the group and other pieces of original medical data and classification characteristic data of the data to be analyzed, determining a characteristic set constructed by the original medical data and the classification characteristic data corresponding to the data to be analyzed with the largest sum as a characteristic set of a marking node;

The method comprises the steps of scoring the composition and the content of data to be analyzed based on a preset scoring model, and taking key data extracted from labeling data in the data to be analyzed with the largest scoring value as marking data, wherein the more the composition components of the data to be analyzed are, the higher the scoring is, the larger the total amount of data in the data to be analyzed is, the higher the scoring is, the larger the total amount of the extracted key data is, and the specific scoring rules are as follows:

Intercepting data in the data to be analyzed according to the source;

Extracting characteristics of the intercepted data according to characteristic extraction rules corresponding to the sources, wherein the characteristics comprise parameter characteristic values corresponding to the total amount of the data, parameter characteristic values corresponding to keywords contained in a preset characteristic library in the intercepted data and the like;

determining the aspect grading value of each intercepted data according to the grading rule corresponding to the source of each data and the data characteristics corresponding to each intercepted data;

determining weights corresponding to the scoring values of all aspects according to the sources of all the data; the weight corresponding to the classification characteristic data is a fixed value and is preset, the weight corresponding to the labeling data corresponds to the authority value of a doctor terminal or a professional doctor terminal which gives the data, and the weight corresponding to the medical data corresponds to the authority value of a terminal which uploads the medical data;

and calculating the scoring value according to the weight and the scoring value of each aspect. The score value is the sum of the aspect score value and the corresponding weight product.

According to the embodiment, the data are marked in advance through the marking nodes so as to assist in clustering classification of the medical data, so that the characteristics and the similarity of the diseases of the patients can be better displayed, doctors can conveniently select medical services for different patients or customize personalized treatment schemes, and the medical experience and treatment effect of the patients can be improved.

For the field application scenario of medical data processing, in one embodiment, the method for constructing medical data in the cluster weighted clustering integrated medical data processing method based on member selection is as follows:

The legal uploading time interval can be manually configured, for example, when the first terminal is a computer for each medical courtyard clinic, the legal uploading time interval can be configured from half an hour before a doctor works to half an hour after the doctor works;

Determining whether the data quantity of the first data meets a preset first data requirement;

When the standard data is not reached, complementing the standard data stored by the system;

Wherein the first data requirement comprises data of a preset number of patients contained in the medical data. For example, 100, 200, 300.

The association rules of the first terminals are that the first terminals of different hospitals in the same department are associated into a group, and the association rules are convenient for clustering;

in addition, the legal uploading time can be set in a segmented mode, and one legal time corresponds to one analysis by taking each half hour as a section, so that the requirement of an outpatient doctor on data processing is met.

In the context of post-analysis and secondary analysis of medical data, in one embodiment, the method for constructing medical data in a cluster weighted clustering integrated medical data processing method based on member selection is as follows:

Acquiring second data uploaded by a second terminal, wherein the second terminal is terminal equipment (such as a computer arranged in an analysis room and the like) recorded in a system by each hospital;

when uploading, integrating all second data, determining whether the second data meets the preset second data requirement, and if not, directly determining whether the second data meets the preset second requirement;

When the preset second data requirement is not met, uploading data according to the history of the second terminal, outputting a data list, receiving the selection of the second terminal for the data on the data list, and completing the second data to meet the preset second data requirement, wherein the second data requirement comprises data of a preset number of patients contained in medical data. For example, 100, 200, 300.

Through the selection of the historical uploading data, on one hand, the data can meet the requirement of clustering processing, on the other hand, the historical data can be combined for clustering analysis, and the historical data is verified and analyzed on the side face.

The invention also provides a cluster weighted clustering integrated medical data processing system based on member selection, which is shown in figure 6 and comprises a construction unit 1, an input unit 2, a screening unit 3, a matrix determining unit 4 and an executing unit 5;

The method comprises the steps of constructing a cluster member set by a constructing unit 1, inputting the cluster member set into a pre-trained decision tree model by an input unit 2, screening out cluster members with labels being pre-labeled from the cluster member set output by the decision tree model by a screening unit 3, generating a target cluster set by the screened cluster members, determining a target CA matrix of the target cluster set by a matrix determining unit 4 according to a cluster layer weighting coefficient, and executing a hierarchical clustering algorithm by an executing unit 5 on the basis of the target CA matrix to obtain a final clustering result.

The construction unit 1 clusters the medical data using a K-Means algorithm, generating a plurality of cluster members.

The training steps of the decision tree model are as follows:

The matrix determining unit 4 performs the following operations:

Constructing a CA matrix about the target cluster;

The matrix determination unit 4 constructs a CA matrix for the target cluster group, performing the following operations:

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A cluster weighted clustering integrated medical text processing method based on member selection, characterized by comprising:

Construct a set of cluster members and input them into a pre-trained decision tree model;

Filter out cluster members with pre-labeled labels from the cluster member set output by the decision tree model, and generate a target cluster group with the filtered cluster members;

Determine the target CA matrix of the target clustering group according to the cluster layer weight coefficient;

Execute the hierarchical clustering algorithm based on the target CA matrix to obtain the final clustering result;

Evaluate the final clustering results based on external indicators, determine the evaluation value, and judge the effectiveness of the final clustering results based on the evaluation value;

Before constructing the cluster member set, the medical text data is processed as follows to mark out the data that interferes with the clustering results. The specific processing steps are as follows:

Analyze the feedback data corresponding to the clustering results and determine the marked nodes;

When the number of marked nodes is greater than or equal to the preset number, the marked nodes and related data are output, and the training data of the decision tree model is output; after receiving the re-marking from professionals, the decision tree model is retrained using the marked data;

When the number of marked nodes is less than a preset number, the medical text data is screened and marked based on each marked node;

When outputting the clustering results, the marked medical text data synchronously outputs the corresponding marking information;

The steps for obtaining feedback data are as follows:

After the clustering results of the patients are sent to the preset physician terminal, the annotation information of the physician terminal is received;

and/or,

After the patient's clustering results and the patient's treatment plan are sent to the patient terminal, when the patient receives a correction instruction for doubts about the received data, the clustering results, treatment plan and doubtful information are sent to the preset professional physician terminal, and the annotation information of the professional physician terminal is received;

Using annotation information, clustering results, treatment plans and/or questionable information as feedback data;

The analysis steps for feedback data are as follows:

Screening feedback data;

Obtaining the original medical text data corresponding to the clustering results corresponding to the filtered feedback data and the feature data during classification;

Associating the original medical text data, classification feature data and feedback data to form data to be analyzed;

The data to be analyzed are grouped according to the original medical text data and classification feature data;

According to the sum of similarities between each data to be analyzed in the group and the original medical text data and classification feature data of other data to be analyzed, a feature set constructed by the original medical text data and classification feature data corresponding to the data to be analyzed having the largest sum is determined as the feature set of the marking node;

Scoring the composition and content of the data to be analyzed based on a preset scoring model, using the key data extracted from the labeled data of the data to be analyzed with the largest scoring value as the labeled data;

The data to be analyzed are intercepted according to their sources;

According to the feature extraction rules corresponding to each data source, feature extraction is performed on the intercepted data;

Determine the aspect score value of each intercepted data according to the scoring rules corresponding to each data source and the data features corresponding to each intercepted data;

Determine the weights corresponding to the scores of each aspect based on the data sources;

Calculate the score based on the weights and scores of each aspect;

In most cases, the category labels of medical text data are known. In this case, external indicators are used to evaluate the effectiveness of clustering. The F value is a comprehensive indicator for evaluating the quality of clustering of medical text data. The larger the F value, the higher the clustering quality. When the clustering result is completely consistent with the category of the medical text data, the F value reaches its maximum value, which is 1. The NMI value is an evaluation indicator of the effectiveness of the clustering result, which quantifies the degree of match between the clustering result and the category label of the real medical text data.

2. According to the cluster weighted clustering integrated medical text processing method based on member selection in claim 1, it is characterized in that the construction of the cluster member set includes: clustering the medical text data using the K-Means algorithm to generate multiple cluster members.

3. The method for processing medical texts by cluster weighted clustering integration based on member selection according to claim 1, characterized in that the training steps of the decision tree model are as follows:

Calculate the Davies-Bouldin index of each sample cluster member in the sample cluster member set and find the overall average value;

Compare the Davies-Bouldin index of each sample cluster member with the average value, label the sample cluster members whose Davies-Bouldin index is lower than the average value with the first label, and label the sample cluster members whose Davies-Bouldin index is higher than the average value with the second label;

The training is performed based on the sample cluster members with labels as the training set to obtain a trained decision tree model.

4. The method for processing medical texts by cluster weighted clustering integration based on member selection according to claim 1, characterized in that the target CA matrix of the target clustering group is determined according to the cluster layer weight coefficient, comprising:

Construct the CA matrix of the target cluster collective;

The CA matrix is weighted based on the cluster layer weight coefficient to obtain the processed data B;

Capture high confidence information from the CA matrix to obtain the HC matrix;

The target CA matrix of the target cluster is determined based on the processed data B and the HC matrix.

5. The method for processing medical texts by cluster weighted clustering integration based on member selection according to claim 4, characterized in that the CA matrix of the target clustering group is constructed, comprising:

Where A is the CA matrix; Indicates cluster members; Represents the total number of cluster members in the cluster collective; Represents sample points The cluster in which it is located; Represents sample points The cluster where it is located.

6. A cluster weighted clustering integrated medical text processing system based on member selection, characterized by comprising: a construction unit, an input unit, a screening unit, a matrix determination unit and an execution unit;

Among them, the construction unit constructs a cluster member set; the input unit inputs the cluster member set into a pre-trained decision tree model; the screening unit screens out cluster members with pre-labeled labels from the cluster member set output by the decision tree model, and generates a target cluster group with the screened cluster members; the matrix determination unit determines the target CA matrix of the target cluster group according to the cluster layer weight coefficient; the execution unit executes the hierarchical clustering algorithm based on the target CA matrix to obtain the final clustering result;

The steps for obtaining feedback data are as follows:

and/or,

The analysis steps for feedback data are as follows:

Screening feedback data;

The data to be analyzed are intercepted according to their sources;

Calculate the score based on the weights and scores of each aspect;

The execution unit also evaluates the final clustering result based on the external indicator, determines the evaluation value, and judges the validity of the final clustering result based on the evaluation value;

7. The cluster weighted clustering integrated medical text processing system based on member selection according to claim 6 is characterized in that the construction unit uses the K-Means algorithm to cluster the medical text data to generate multiple cluster members.

8. The cluster weighted clustering integrated medical text processing system based on member selection according to claim 6, characterized in that the training steps of the decision tree model are as follows:

9. The cluster weighted clustering integrated medical text processing system based on member selection according to claim 6, characterized in that the matrix determination unit performs the following operations:

Construct the CA matrix of the target cluster collective;

Capture high confidence information from the CA matrix to obtain the HC matrix;

The target CA matrix of the target cluster group is determined based on the processed data B and the HC matrix.

10. The cluster weighted clustering integrated medical text processing system based on member selection according to claim 9, characterized in that the matrix determination unit constructs a CA matrix about the target cluster collective, comprising:

Where A is the CA matrix; Indicates cluster members; Represents the total number of cluster members in the cluster set; Represents sample points The cluster in which it is located; Represents sample points The cluster where it is located.