CN118312816B - Cluster weighted clustering integrated medical text processing method and system based on member selection - Google Patents
Cluster weighted clustering integrated medical text processing method and system based on member selection Download PDFInfo
- Publication number
- CN118312816B CN118312816B CN202410553936.0A CN202410553936A CN118312816B CN 118312816 B CN118312816 B CN 118312816B CN 202410553936 A CN202410553936 A CN 202410553936A CN 118312816 B CN118312816 B CN 118312816B
- Authority
- CN
- China
- Prior art keywords
- data
- cluster
- clustering
- matrix
- medical text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/231—Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a cluster weighted clustering integrated medical data processing method and system based on member selection, wherein the method comprises the steps of constructing a cluster member set and inputting the cluster member set into a pre-trained decision tree model; screening out the label from the cluster member set output by the decision tree model as the cluster member of the pre-label, generating a target cluster set by the screened cluster member, determining a target CA matrix of the target cluster set according to the cluster layer weighting coefficient, and executing a hierarchical clustering algorithm based on the target CA matrix to obtain a final clustering result. The cluster weighted clustering integrated medical data processing method and system based on member selection can obtain more optimized clustering results, can better show the characteristics and the similarity of diseases of patients, is convenient for doctors to select medical services for different patients or customize personalized treatment schemes, and is beneficial to improving the medical experience and treatment effect of patients.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a cluster weighted clustering integrated medical data processing method and system based on member selection.
Background
Medical diagnostics are often faced with a large number of complex cases and clinical data, and doctors need to quickly and accurately classify patients in order to formulate personalized treatment regimens. However, rapid classification and prediction of large-scale data is challenging due to the complexity and diversity of the disease, and furthermore, different patients may exhibit different symptoms and characteristics, which are also factors of concern for classification.
Cluster analysis is one of the hot spots of machine learning research, is widely used for data compression, information retrieval, image segmentation and text clustering, and is receiving more and more attention in the fields of biology, geology, geography, abnormal data detection and the like. The clustering analysis is an unsupervised machine learning, and the priori knowledge of the data set is lacking in advance, and the data set is automatically divided into a plurality of groups or clusters only according to the similarity measurement among the data points, the samples and the objects, so that the similarity among the points belonging to the same cluster is as high as possible, and the similarity among the points belonging to different clusters is as low as possible. The clustering is to introduce the ensemble learning idea into the clustering analysis, so that the clustering integration research is started. The method mainly comprises the steps of taking a data set as input, running a clustering algorithm, outputting a plurality of different clustering results, namely cluster member generation, taking a set formed by all cluster members, namely a cluster set, as input, combining the cluster members and outputting a final clustering result, namely cluster integration, namely consensus function design, and generally adopting a cluster integration algorithm for classifying medical data.
However, most of the existing cluster integration algorithms treat each cluster member equally, and do not pay attention to the quality difference of the cluster members, even if some algorithms propose to treat each cluster member differently, the cluster members can be evaluated, but the local diversity of the clusters in the same cluster member is ignored, and the cluster members are directly regarded as independent individuals, so that clusters with lower quality or base clusters with lower quality can appear.
Disclosure of Invention
The invention aims to provide a cluster weighted clustering integrated medical data processing method and system based on member selection, which can obtain more optimized clustering results, can better show the characteristics and similarity of diseases of patients, is convenient for doctors to select medical services for different patients or customize personalized treatment schemes, and is beneficial to improving the medical experience and treatment effect of patients.
The embodiment of the invention provides a cluster weighted clustering integrated medical data processing method based on member selection, which comprises the following steps:
Constructing a cluster member set and inputting the cluster member set into a pre-trained decision tree model;
Screening out the labels from the cluster member set output by the decision tree model as the cluster members of the pre-label, and generating a target cluster set by using the screened cluster members;
determining a target CA matrix of a target cluster set according to the cluster layer weighting coefficient;
and executing a hierarchical clustering algorithm based on the target CA matrix to obtain a final clustering result.
Preferably, the construction of the cluster member set comprises clustering medical data by adopting a K-Means algorithm to generate a plurality of cluster members.
Preferably, the training steps of the decision tree model are as follows:
calculating the Davies-Bouldin index of each sample cluster member in the sample cluster member set, and calculating the overall average value;
Comparing the Davies-Bouldin indexes of each sample cluster member with the average value respectively, marking the sample cluster members with the Davies-Bouldin indexes lower than the average value with a first label, and marking the sample cluster members with the Davies-Bouldin indexes higher than the average value with a second label;
training is carried out based on sample cluster members with labels as a training set, and a trained decision tree model is obtained.
Preferably, determining the target CA matrix of the target cluster according to the cluster layer weighting coefficient includes:
Constructing a CA matrix about the target cluster;
weighting the CA matrix based on the cluster layer weighting coefficient to obtain processed data B;
capturing high confidence information from the CA matrix to obtain an HC matrix;
and determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix.
Preferably, constructing a CA matrix for the target clusters includes:
Wherein A is CA matrix; Represent the first Cluster members; representing the total number of cluster members in the cluster set; Representing sample points A cluster in which the cluster is located; Representing sample points The cluster in which it is located.
The invention also provides a cluster weighted clustering integrated medical data processing system based on member selection, which comprises a construction unit, an input unit, a screening unit, a matrix determining unit and an executing unit;
The method comprises the steps of constructing a cluster member set by a construction unit, inputting the cluster member set into a pre-trained decision tree model by an input unit, screening out cluster members with labels being pre-labeled from the cluster member set output by the decision tree model by a screening unit, generating a target cluster set by the screened cluster members, determining a target CA matrix of the target cluster set by a matrix determination unit according to a cluster layer weighting coefficient, and executing a hierarchical clustering algorithm by an execution unit on the basis of the target CA matrix to obtain a final clustering result.
Preferably, the construction unit clusters the medical data by using a K-Means algorithm to generate a plurality of cluster members.
Preferably, the training steps of the decision tree model are as follows:
calculating the Davies-Bouldin index of each sample cluster member in the sample cluster member set, and calculating the overall average value;
Comparing the Davies-Bouldin indexes of each sample cluster member with the average value respectively, marking the sample cluster members with the Davies-Bouldin indexes lower than the average value with a first label, and marking the sample cluster members with the Davies-Bouldin indexes higher than the average value with a second label;
training is carried out based on sample cluster members with labels as a training set, and a trained decision tree model is obtained.
Preferably, the matrix determining unit performs the following operations:
Constructing a CA matrix about the target cluster;
weighting the CA matrix based on the cluster layer weighting coefficient to obtain processed data B;
capturing high confidence information from the CA matrix to obtain an HC matrix;
and determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix.
Preferably, the matrix determining unit constructs a CA matrix for the target cluster group, performing the following operations:
Wherein A is CA matrix; Represent the first Cluster members; representing the total number of cluster members in the cluster set; Representing sample points A cluster in which the cluster is located; Representing sample points The cluster in which it is located.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a schematic diagram of a cluster weighted clustering integrated medical data processing method based on member selection in an embodiment of the invention;
FIG. 2 is a flow diagram of building a cluster member set according to one embodiment of the invention;
FIG. 3 is a flow chart of a cluster weighted clustering integration method based on member selection in accordance with yet another embodiment of the invention;
FIG. 4 is a flow chart of training a decision tree according to one embodiment of the invention;
FIG. 5 is a schematic diagram of a trained decision tree model according to one embodiment of the invention;
FIG. 6 is a schematic diagram of a cluster weighted clustering integrated medical data processing system based on member selection in an embodiment of the invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
The embodiment of the invention provides a cluster weighted clustering integrated medical data processing method based on member selection, which is shown in figure 1 and comprises the following steps:
Constructing a cluster member set and inputting the cluster member set into a pre-trained decision tree model;
Screening out the labels from the cluster member set output by the decision tree model as the cluster members of the pre-label, and generating a target cluster set by using the screened cluster members;
determining a target CA matrix of a target cluster set according to the cluster layer weighting coefficient;
and executing a hierarchical clustering algorithm based on the target CA matrix to obtain a final clustering result.
The hierarchical clustering algorithm comprises an average linking method, a preset label, a hierarchical clustering algorithm and a hierarchical clustering algorithm, wherein the preset label is a label marked as high;
the construction of the cluster member set comprises the steps of clustering medical data by adopting a K-Means algorithm to generate a plurality of cluster members. Setting the cluster membership number r and the cluster number k, setting the cluster number k to be [2, R epsilon N+ where N is the number of data sample points, and using K-Means algorithm to randomly generate r cluster members as cluster set P, and further generating cluster member set, specifically:
Obtaining the number r of cluster members and the number k of clusters;
initializing i=1;
judging whether i is less than or equal to r;
when the i is determined to be less than or equal to r, clustering by using a K-Means algorithm to generate cluster members, and obtaining a clustering result;
and (3) assigning i=i+1, and continuing to judge until i is not less than or equal to r, and constructing a cluster member set.
In addition, the cluster member set can be constructed by using a hierarchical clustering method, a spectral clustering method and the like.
As shown in fig. 2 to 5, the training steps of the decision tree model are as follows:
calculating the Davies-Bouldin index of each sample cluster member in the sample cluster member set, and calculating the overall average value;
Comparing the Davies-Bouldin indexes of each sample cluster member with the average value respectively, marking the sample cluster members with the Davies-Bouldin indexes lower than the average value with a first label, marking the sample cluster members with the Davies-Bouldin indexes higher than the average value with a second label, wherein the first label is 'high', and the second label is 'low';
Training is carried out based on sample cluster members with labels as a training set, and a trained decision tree model is obtained. The ARI (adjusted Rand index), NMI (normalized mutual information) and F (Fowlkes-Mallows index) of each cluster member are taken as characteristic attribute sets, and all three characteristic attributes are all better as the characteristic attributes are close to 1, wherein the value range of the ARI is [ -1, 1], and the value ranges of the NMI and the F are [0, 1]. The importance of each feature attribute set is calculated using the coefficient of kunning to determine the root node and internal nodes of the decision tree, as follows:
(1.1)
(1.2)
Wherein, Representing the probability that a sample point belongs to class i,The tables are the feature attribute set and the state of the feature attribute. The larger the coefficient of the kunit, the greater the uncertainty of the feature properties.
First, the coefficient of the characteristic of the labeled cluster member is obtained by using the formulas (1.1) and (1.2), the coefficient of the characteristic of the labeled cluster member is compared with the coefficient of the characteristic of the labeled cluster member in the three aspects ARI, NMI, F, the "characteristic attribute 1" with the smallest coefficient of the characteristic is selected as the root node, and the label with the value of the "characteristic attribute 1" close to 1 is "high". Then, the cluster members with labels, the values of which are not close to 1, of the characteristic attribute 1 are used as a new label set, the coefficient of the foundation of the remaining two characteristic attributes is calculated continuously, and the smallest characteristic attribute 2 is selected as an internal node. Finally, the remaining feature attributes become the last internal nodes as "feature attribute 3".
The three characteristic attributes are all better as approaching to 1, wherein the value range of ARI is [ -1, 1], the ARI is divided into two types, one type is larger than 0 and smaller than or equal to 0, the value ranges of NMI and F-measure indexes are [0, 1], and the ARI is divided into two types, one type is larger than 0.5 and one type is smaller than or equal to 0.5. The importance of each feature attribute set is calculated using the coefficient of kunning to determine the root node and internal nodes of the decision tree. The larger the coefficient of the kunit, the greater the uncertainty of the feature properties.
The attributes of the root node and the internal nodes of the decision tree are measured using the coefficient of the radix;
For example, the key coefficients of ARI properties are determined as follows:
according to the following formula:
first, the coefficient of the kunit when ARI is greater than 0 is calculated as:
At this time, the liquid crystal display device, Representing a subset of dataIs of a non-purity of the base; Representing a subset of data obtained at a characteristic attribute ARI greater than 0; the number of samples representing a category label of "high" divided by the number of samples when ARI is greater than 0; the number of samples representing a category label of "low" divided by the number of samples when ARI is greater than 0;
next, the kunit when ARI is 0 or less is calculated as:
At this time, the liquid crystal display device, Representing a subset of dataIs of a non-purity of the base; Representing a subset of data obtained when the characteristic attribute ARI is 0 or less; The number of samples indicating that the category label is "high" divided by the number of samples when ARI is 0 or less; the number of samples indicating that the category label is "low" divided by the number of samples when ARI is 0 or less;
finally, according to the following formula:
The kunity coefficient of ARI is calculated as follows:
At this time, the liquid crystal display device, A coefficient of a characteristic attribute of the data set D is ARI; indicating the number of samples when ARI is greater than 0, Representing the number of samples of the data set D.
When the decision tree is constructed, the values of the characteristic attributes are equally divided into two sections of different value fields, and two branches are used.
The decision tree selects the best characteristic attribute as the root node according to the coefficient of the foundation, and continuously selects the characteristic attribute with higher importance in the subsequent internal node division. After training is completed, the decision tree model can be used for classification prediction of new cluster members.
When the decision tree is constructed, the value of the characteristic attribute can be further subdivided into a plurality of sections with different value ranges, namely multiple branches, branches of each decision tree are increased, the number of decision classification is increased, and the classification precision is further improved.
Based on the method, the clustering algorithm and the decision tree are effectively aggregated to obtain a decision tree model, the decision tree selects the best characteristic attribute as a root node according to the coefficient of the foundation, and the characteristic attribute with higher importance is continuously selected in the subsequent internal node division, so that the method can be used for classifying and predicting new cluster members based on the decision tree model.
The target cluster set is a set constructed by high-quality cluster members. When processing and identifying the cluster member set based on the decision tree model, ARI, NMI and F-measure indexes in the characteristic attribute set of each cluster member are output, and the decision tree model predicts the cluster quality label ("high" or "low") according to the learned node division rule. And forming the cluster members with the labels of high into a new cluster group P', and participating in subsequent processing, namely generating a target cluster group.
The method for determining the target CA matrix of the target cluster group according to the cluster layer weighting coefficient comprises the following steps:
Constructing a CA matrix about the target cluster;
weighting the CA matrix based on the cluster layer weighting coefficient to obtain processed data B;
capturing high confidence information from the CA matrix to obtain an HC matrix;
and determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix.
Constructing a CA matrix for a target cluster set, comprising:
Wherein A is CA matrix; Represent the first Cluster members; representing the total number of cluster members in the cluster set; Representing sample points A cluster in which the cluster is located; Representing sample points The cluster in which it is located.
And weighting the CA matrix based on the cluster layer weighting coefficient to obtain processing data B, wherein the processing data B comprises the following steps:
Capturing High-confidence information from a CA matrix, supplementing and perfecting the captured High-confidence information to obtain an ideal CA matrix, and marking a High-confidence information HC (High-Confidence Matrix) matrix as H, wherein the method comprises the following steps of:
Wherein, The position of the highly reliable element is recorded,Is an element operator. When the ratio of the number of times two sample points are classified into the same cluster to the total number of cluster members exceeds a predefined thresholdThe corresponding position of the a matrix is considered as a piece of highly reliable information (i.e. one element in the H matrix).
Determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix, wherein the target CA matrix comprises the following components:
the final CA matrix, designated C, is obtained from the H matrix and the B matrix, and the method is as follows:
Wherein, Is a laplace matrix of the type described above,,Indicating the lagrangian term, default to a value of 1,For the purpose of balancing the error loss term,AndRepresenting the lagrangian multiplier and,The Frobenius norm of the matrix is represented, E represents the error term, and F is the intermediate matrix used to alleviate the value range constraint and the symmetry constraint of the ideal CA matrix.
Information entropy is introduced to the target clusters, and an uncertainty index IEI of each cluster is calculated and used as a cluster layer weighting coefficient. The IEI calculation method is as follows:
Wherein, Representing the total number of clusters; For measuring clusters And cluster ofSimilarity between IEI indicators reflects clustersThe likelihood that points in the cluster remain in the same cluster in other base clusters, the greater the IEI indicates the clusterCapturing High-confidence information from the CA matrix, using the captured High-confidence information to complement and perfect to obtain an ideal CA matrix, and using the High-confidence information to obtain an HC (High-Confidence Matrix) matrix which is marked as H; is the maximum value in the H matrix; is the minimum value in the H matrix; for clusters in the H matrix A corresponding value;
or KL divergence is introduced to determine a cluster layer weighting coefficient;
or a plurality of evaluation indexes are introduced and fused together to form a new evaluation index to determine the cluster layer weighting coefficient.
The application introduces the decision tree to assist in selecting the high-quality cluster members, and training the decision tree model to assist in selecting the cluster members, so that the quality and the diversity of the cluster members are considered from multiple angles, and meanwhile, the internal diversity of the cluster members is considered, thereby realizing the differentiated treatment of the cluster members more comprehensively. And finally, finely adjusting the CA matrix according to the weight and the high confidence information so as to improve the accuracy and the robustness of the clustering.
The cluster weighted clustering integrated medical data processing method based on member selection uses a clustering algorithm to cluster patient data to generate a plurality of cluster members. Each cluster member divides the patient into different clusters, each cluster representing a patient population with similar symptoms and features. Next, high quality cluster members are selected with the aid of a decision tree model. In order to further improve accuracy and interpretation of the clustering result, attention is given to the internal diversity of the clustering members, the diversity of the internal clusters is measured, fine adjustment is carried out on the CA matrix, and finally, hierarchical clustering analysis is used for obtaining the final clustering result. By the method, more optimized clustering results are obtained, the characteristics and the similarity of the diseases of the patients can be better displayed, doctors can conveniently select medical services for different patients or customize personalized treatment schemes, and the medical experience and treatment effect of the patients can be improved.
In one embodiment, the processing method further comprises evaluating the final clustering result based on the external index, determining an evaluation value, and judging the validity of the final clustering result based on the evaluation value. The cluster result validity evaluation index is generally classified into an internal index and an external index. In most cases, class labels of the data set are known (not used in the clustering process), and an external index can be used to evaluate the effectiveness of clustering, where an F-measure is a relatively common comprehensive index for evaluating the quality of text clustering. The larger the F value is, the higher the clustering quality is, and when the clustering result is completely consistent with the real category, the F value reaches the maximum value, and the value is 1. In addition, NMI value is also a popular clustering result effectiveness evaluation index, and can quantify the matching degree of the clustering result and the real text category label.
In one embodiment, computing the coefficient of the feature attribute set for the kurti on three aspects of the ARI, NMI, and F-measure indices includes:
Wherein, Represents the genii purity of dataset D; Representing class labels as The duty cycle in the dataset D, i.e. belonging to a categoryDivided by the total number of samples; expressed in characteristic attribute Is not pure in the data set D; expressed in characteristic attribute The value of (2) isThe number of samples of the subset of data obtained at that time; Representing a subset of data Is of non-purity. Based on the formula, the coefficient of the foundation of the labeled cluster members in respect of ARI, NMI, F-measure index is obtained, so that the corresponding coefficient of foundation is conveniently and accurately calculated, the accuracy of judging the sizes of the coefficient of foundation of the three is improved, and further, the root node and the internal node are accurately divided.
In one embodiment, the method for processing the cluster weighted clustering integrated medical data based on member selection further processes medical data as follows to mark data with interference on a clustering result before constructing a cluster member set, and specifically comprises the following processing steps:
Analyzing feedback data corresponding to the clustering result to determine a marking node;
outputting the marked nodes and the related data and training data of the decision tree model when the number of the marked nodes is more than or equal to the preset number (for example, any one value from 2 to 100);
When the number of the marking nodes is smaller than the preset number, screening and marking the medical data based on each marking node;
And when the clustering result is output, the marked medical data synchronously output corresponding marking information.
The feedback data acquisition steps are as follows:
after the clustering result of the patient is sent to a preset doctor terminal, the received annotation information of the doctor terminal;
And/or the number of the groups of groups,
After the clustering result of the patient and the treatment scheme of the patient are sent to the patient terminal, when the received patient receives an in-doubt correction instruction of the data, the clustering result, the treatment scheme and the in-doubt information are sent to a preset professional doctor terminal, and then the received annotation information of the professional doctor terminal is received;
Annotation information, clustering results, treatment schemes and/or doubtful information are used as feedback data.
The analysis steps for the feedback data are as follows:
Screening the feedback data; extracting keywords from annotation information and doubt information in feedback data based on a preset keyword library, acquiring keyword sets corresponding to the feedback data, and deleting the feedback data without extracting the keywords; because some feedback data have annotation information and doubt information at the same time, the keywords extracted by the annotation information and the doubt information have interference, the interference monitoring can be carried out on the keywords in the keyword set according to a preset keyword interference association library, and the feedback data corresponding to the monitored interference keyword set is deleted;
Acquiring original medical data corresponding to a clustering result corresponding to the feedback data after screening and characteristic data during classification;
Associating the original medical data and the classification characteristic data with feedback data to form data to be analyzed;
The specific grouping can be used for determining that two data to be analyzed are the same grouping by calculating the data similarity between the original medical data and the classified characteristic data when the sum of the data similarity is larger than a preset threshold value;
According to the similarity sum of each piece of data to be analyzed in the group and other pieces of original medical data and classification characteristic data of the data to be analyzed, determining a characteristic set constructed by the original medical data and the classification characteristic data corresponding to the data to be analyzed with the largest sum as a characteristic set of a marking node;
The method comprises the steps of scoring the composition and the content of data to be analyzed based on a preset scoring model, and taking key data extracted from labeling data in the data to be analyzed with the largest scoring value as marking data, wherein the more the composition components of the data to be analyzed are, the higher the scoring is, the larger the total amount of data in the data to be analyzed is, the higher the scoring is, the larger the total amount of the extracted key data is, and the specific scoring rules are as follows:
Intercepting data in the data to be analyzed according to the source;
Extracting characteristics of the intercepted data according to characteristic extraction rules corresponding to the sources, wherein the characteristics comprise parameter characteristic values corresponding to the total amount of the data, parameter characteristic values corresponding to keywords contained in a preset characteristic library in the intercepted data and the like;
determining the aspect grading value of each intercepted data according to the grading rule corresponding to the source of each data and the data characteristics corresponding to each intercepted data;
determining weights corresponding to the scoring values of all aspects according to the sources of all the data; the weight corresponding to the classification characteristic data is a fixed value and is preset, the weight corresponding to the labeling data corresponds to the authority value of a doctor terminal or a professional doctor terminal which gives the data, and the weight corresponding to the medical data corresponds to the authority value of a terminal which uploads the medical data;
and calculating the scoring value according to the weight and the scoring value of each aspect. The score value is the sum of the aspect score value and the corresponding weight product.
According to the embodiment, the data are marked in advance through the marking nodes so as to assist in clustering classification of the medical data, so that the characteristics and the similarity of the diseases of the patients can be better displayed, doctors can conveniently select medical services for different patients or customize personalized treatment schemes, and the medical experience and treatment effect of the patients can be improved.
For the field application scenario of medical data processing, in one embodiment, the method for constructing medical data in the cluster weighted clustering integrated medical data processing method based on member selection is as follows:
The legal uploading time interval can be manually configured, for example, when the first terminal is a computer for each medical courtyard clinic, the legal uploading time interval can be configured from half an hour before a doctor works to half an hour after the doctor works;
Determining whether the data quantity of the first data meets a preset first data requirement;
When the standard data is not reached, complementing the standard data stored by the system;
Wherein the first data requirement comprises data of a preset number of patients contained in the medical data. For example, 100, 200, 300.
The association rules of the first terminals are that the first terminals of different hospitals in the same department are associated into a group, and the association rules are convenient for clustering;
in addition, the legal uploading time can be set in a segmented mode, and one legal time corresponds to one analysis by taking each half hour as a section, so that the requirement of an outpatient doctor on data processing is met.
In the context of post-analysis and secondary analysis of medical data, in one embodiment, the method for constructing medical data in a cluster weighted clustering integrated medical data processing method based on member selection is as follows:
Acquiring second data uploaded by a second terminal, wherein the second terminal is terminal equipment (such as a computer arranged in an analysis room and the like) recorded in a system by each hospital;
when uploading, integrating all second data, determining whether the second data meets the preset second data requirement, and if not, directly determining whether the second data meets the preset second requirement;
When the preset second data requirement is not met, uploading data according to the history of the second terminal, outputting a data list, receiving the selection of the second terminal for the data on the data list, and completing the second data to meet the preset second data requirement, wherein the second data requirement comprises data of a preset number of patients contained in medical data. For example, 100, 200, 300.
Through the selection of the historical uploading data, on one hand, the data can meet the requirement of clustering processing, on the other hand, the historical data can be combined for clustering analysis, and the historical data is verified and analyzed on the side face.
The invention also provides a cluster weighted clustering integrated medical data processing system based on member selection, which is shown in figure 6 and comprises a construction unit 1, an input unit 2, a screening unit 3, a matrix determining unit 4 and an executing unit 5;
The method comprises the steps of constructing a cluster member set by a constructing unit 1, inputting the cluster member set into a pre-trained decision tree model by an input unit 2, screening out cluster members with labels being pre-labeled from the cluster member set output by the decision tree model by a screening unit 3, generating a target cluster set by the screened cluster members, determining a target CA matrix of the target cluster set by a matrix determining unit 4 according to a cluster layer weighting coefficient, and executing a hierarchical clustering algorithm by an executing unit 5 on the basis of the target CA matrix to obtain a final clustering result.
The construction unit 1 clusters the medical data using a K-Means algorithm, generating a plurality of cluster members.
The training steps of the decision tree model are as follows:
calculating the Davies-Bouldin index of each sample cluster member in the sample cluster member set, and calculating the overall average value;
Comparing the Davies-Bouldin indexes of each sample cluster member with the average value respectively, marking the sample cluster members with the Davies-Bouldin indexes lower than the average value with a first label, and marking the sample cluster members with the Davies-Bouldin indexes higher than the average value with a second label;
training is carried out based on sample cluster members with labels as a training set, and a trained decision tree model is obtained.
The matrix determining unit 4 performs the following operations:
Constructing a CA matrix about the target cluster;
weighting the CA matrix based on the cluster layer weighting coefficient to obtain processed data B;
capturing high confidence information from the CA matrix to obtain an HC matrix;
and determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix.
The matrix determination unit 4 constructs a CA matrix for the target cluster group, performing the following operations:
Wherein A is CA matrix; Represent the first Cluster members; representing the total number of cluster members in the cluster set; Representing sample points A cluster in which the cluster is located; Representing sample points The cluster in which it is located.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (10)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2023111662103 | 2023-09-08 | ||
| CN202311166210.3A CN117195027A (en) | 2023-09-08 | 2023-09-08 | Cluster weighted clustering integration method based on member selection |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN118312816A CN118312816A (en) | 2024-07-09 |
| CN118312816B true CN118312816B (en) | 2025-04-08 |
Family
ID=89001143
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202311166210.3A Pending CN117195027A (en) | 2023-09-08 | 2023-09-08 | Cluster weighted clustering integration method based on member selection |
| CN202410553936.0A Active CN118312816B (en) | 2023-09-08 | 2024-05-07 | Cluster weighted clustering integrated medical text processing method and system based on member selection |
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202311166210.3A Pending CN117195027A (en) | 2023-09-08 | 2023-09-08 | Cluster weighted clustering integration method based on member selection |
Country Status (1)
| Country | Link |
|---|---|
| CN (2) | CN117195027A (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117688412B (en) * | 2024-02-02 | 2024-05-07 | 中国人民解放军海军青岛特勤疗养中心 | An intelligent data processing system for orthopedic care |
| CN120452661B (en) * | 2025-07-10 | 2025-10-17 | 四川互慧软件有限公司 | Respiratory failure treatment scheme recommendation method and system based on artificial intelligence |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106599913A (en) * | 2016-12-07 | 2017-04-26 | 重庆邮电大学 | Cluster-based multi-label imbalance biomedical data classification method |
| CN117391456A (en) * | 2023-11-27 | 2024-01-12 | 浙江南斗数智科技有限公司 | Village management method and service platform system based on artificial intelligence |
Family Cites Families (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| RU2603601C2 (en) * | 2010-12-20 | 2016-11-27 | Конинклейке Филипс Электроникс Н.В. | Methods and systems for identifying patients with mild cognitive impairment with risk of transition to alzheimer's disease |
| US11587678B2 (en) * | 2020-03-23 | 2023-02-21 | Clover Health | Machine learning models for diagnosis suspecting |
| US20230111999A1 (en) * | 2021-10-08 | 2023-04-13 | Microsoft Technology Licensing, Llc | Method and system of creating clusters for feedback data |
| US11860824B2 (en) * | 2022-04-27 | 2024-01-02 | Truist Bank | Graphical user interface for display of real-time feedback data changes |
| CN115910255A (en) * | 2022-09-29 | 2023-04-04 | 海南星捷安科技集团股份有限公司 | A diagnostic aid system |
| CN116072302A (en) * | 2023-02-17 | 2023-05-05 | 西安电子科技大学 | Medical unbalanced data classification method based on biased random forest model |
| CN116451100A (en) * | 2023-04-13 | 2023-07-18 | 盐城工学院 | Three-layer weighted clustering integration algorithm based on cluster member selection |
-
2023
- 2023-09-08 CN CN202311166210.3A patent/CN117195027A/en active Pending
-
2024
- 2024-05-07 CN CN202410553936.0A patent/CN118312816B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106599913A (en) * | 2016-12-07 | 2017-04-26 | 重庆邮电大学 | Cluster-based multi-label imbalance biomedical data classification method |
| CN117391456A (en) * | 2023-11-27 | 2024-01-12 | 浙江南斗数智科技有限公司 | Village management method and service platform system based on artificial intelligence |
Non-Patent Citations (2)
| Title |
|---|
| "Ensemble Clustering via Co-association Matrix Self-enhancement";Yuheng Jia等;《IEEE Transactions on Neural Networks and Learning Systems》;20230306;第35卷(第8期);摘要、第Ⅱ-Ⅳ节 * |
| "一种基于成员选择的簇加权聚类集成算法";徐森等;《控制与决策》;20240411;第1-5页 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117195027A (en) | 2023-12-08 |
| CN118312816A (en) | 2024-07-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN118312816B (en) | Cluster weighted clustering integrated medical text processing method and system based on member selection | |
| WO2020181805A1 (en) | Diabetes prediction method and apparatus, storage medium, and computer device | |
| CN107358014B (en) | Clinical pretreatment method and system of physiological data | |
| CN110503155B (en) | A method for information classification and related device and server | |
| CN101297297A (en) | Medical risk stratification method and system | |
| JP2023532292A (en) | Machine learning based medical data checker | |
| CN119008013A (en) | Viral pneumonia risk assessment prediction system | |
| CN108206056B (en) | An artificial intelligence-assisted diagnosis and treatment decision-making terminal for nasopharyngeal carcinoma | |
| CN120636766A (en) | Traditional Chinese Medicine Intelligent Health Diagnosis System and Method Based on Multi-source Data Fusion | |
| CN116259415A (en) | A machine learning-based prediction method for patient medication compliance | |
| CN119092144B (en) | Health management system and method based on multi-source data acquisition and analysis | |
| CN115910326A (en) | Auxiliary diagnosis method and system for bronchial asthma based on interpretable machine learning | |
| CN110033862B (en) | Traditional Chinese medicine quantitative diagnosis system based on weighted directed graph and storage medium | |
| CN119560149A (en) | A data processing method and system for comprehensive health assessment of the elderly based on digital twins | |
| CN119339964A (en) | Brain tumor multidimensional data analysis method and system | |
| Klemm et al. | Interactive visual analysis of lumbar back pain-what the lumbar spine tells about your life | |
| Kumari et al. | Machine Learning Classification Techniques to investigate Parkinson's disease | |
| EP4609406A1 (en) | Cancer progression assessment method and system thereof | |
| Jha et al. | Enhanced Predictive Modeling Techniques for Early Detection of COPD Utilizing 1D Convolutional Neural Networks | |
| Wulandari et al. | Application of the Random Forest Algorithm for Breast Cancer Analysis with Data Balancing: Case Study at Al-Ihsan Hospital | |
| Raj | Enhancing Thyroid Cancer Diagnostics Through Hybrid Machine Learning and Metabolomics Approaches. | |
| Sadhu et al. | Exploratory Modeling Strategies for Robust Medical Condition Classification Using Machine Learning | |
| CN119851962B (en) | Analysis methods, systems, equipment and media for nuclear medicine imaging radiology reports | |
| CN119833077B (en) | CKD full-flow information management system based on active learning | |
| CN119889561B (en) | Reference interval construction system for test items based on gender and age |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |