[go: up one dir, main page]

CN118312816B - Cluster weighted clustering integrated medical text processing method and system based on member selection - Google Patents

Cluster weighted clustering integrated medical text processing method and system based on member selection Download PDF

Info

Publication number
CN118312816B
CN118312816B CN202410553936.0A CN202410553936A CN118312816B CN 118312816 B CN118312816 B CN 118312816B CN 202410553936 A CN202410553936 A CN 202410553936A CN 118312816 B CN118312816 B CN 118312816B
Authority
CN
China
Prior art keywords
data
cluster
clustering
matrix
medical text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410553936.0A
Other languages
Chinese (zh)
Other versions
CN118312816A (en
Inventor
徐秀芳
高婷
徐森
郭乃瑄
许贺洋
卞学胜
花小朋
陈博炜
王志漩
刘轩绮
孙雯
徐畅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yancheng Institute of Technology
Original Assignee
Yancheng Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yancheng Institute of Technology filed Critical Yancheng Institute of Technology
Publication of CN118312816A publication Critical patent/CN118312816A/en
Application granted granted Critical
Publication of CN118312816B publication Critical patent/CN118312816B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a cluster weighted clustering integrated medical data processing method and system based on member selection, wherein the method comprises the steps of constructing a cluster member set and inputting the cluster member set into a pre-trained decision tree model; screening out the label from the cluster member set output by the decision tree model as the cluster member of the pre-label, generating a target cluster set by the screened cluster member, determining a target CA matrix of the target cluster set according to the cluster layer weighting coefficient, and executing a hierarchical clustering algorithm based on the target CA matrix to obtain a final clustering result. The cluster weighted clustering integrated medical data processing method and system based on member selection can obtain more optimized clustering results, can better show the characteristics and the similarity of diseases of patients, is convenient for doctors to select medical services for different patients or customize personalized treatment schemes, and is beneficial to improving the medical experience and treatment effect of patients.

Description

Cluster weighted clustering integrated medical text processing method and system based on member selection
Technical Field
The invention relates to the technical field of data processing, in particular to a cluster weighted clustering integrated medical data processing method and system based on member selection.
Background
Medical diagnostics are often faced with a large number of complex cases and clinical data, and doctors need to quickly and accurately classify patients in order to formulate personalized treatment regimens. However, rapid classification and prediction of large-scale data is challenging due to the complexity and diversity of the disease, and furthermore, different patients may exhibit different symptoms and characteristics, which are also factors of concern for classification.
Cluster analysis is one of the hot spots of machine learning research, is widely used for data compression, information retrieval, image segmentation and text clustering, and is receiving more and more attention in the fields of biology, geology, geography, abnormal data detection and the like. The clustering analysis is an unsupervised machine learning, and the priori knowledge of the data set is lacking in advance, and the data set is automatically divided into a plurality of groups or clusters only according to the similarity measurement among the data points, the samples and the objects, so that the similarity among the points belonging to the same cluster is as high as possible, and the similarity among the points belonging to different clusters is as low as possible. The clustering is to introduce the ensemble learning idea into the clustering analysis, so that the clustering integration research is started. The method mainly comprises the steps of taking a data set as input, running a clustering algorithm, outputting a plurality of different clustering results, namely cluster member generation, taking a set formed by all cluster members, namely a cluster set, as input, combining the cluster members and outputting a final clustering result, namely cluster integration, namely consensus function design, and generally adopting a cluster integration algorithm for classifying medical data.
However, most of the existing cluster integration algorithms treat each cluster member equally, and do not pay attention to the quality difference of the cluster members, even if some algorithms propose to treat each cluster member differently, the cluster members can be evaluated, but the local diversity of the clusters in the same cluster member is ignored, and the cluster members are directly regarded as independent individuals, so that clusters with lower quality or base clusters with lower quality can appear.
Disclosure of Invention
The invention aims to provide a cluster weighted clustering integrated medical data processing method and system based on member selection, which can obtain more optimized clustering results, can better show the characteristics and similarity of diseases of patients, is convenient for doctors to select medical services for different patients or customize personalized treatment schemes, and is beneficial to improving the medical experience and treatment effect of patients.
The embodiment of the invention provides a cluster weighted clustering integrated medical data processing method based on member selection, which comprises the following steps:
Constructing a cluster member set and inputting the cluster member set into a pre-trained decision tree model;
Screening out the labels from the cluster member set output by the decision tree model as the cluster members of the pre-label, and generating a target cluster set by using the screened cluster members;
determining a target CA matrix of a target cluster set according to the cluster layer weighting coefficient;
and executing a hierarchical clustering algorithm based on the target CA matrix to obtain a final clustering result.
Preferably, the construction of the cluster member set comprises clustering medical data by adopting a K-Means algorithm to generate a plurality of cluster members.
Preferably, the training steps of the decision tree model are as follows:
calculating the Davies-Bouldin index of each sample cluster member in the sample cluster member set, and calculating the overall average value;
Comparing the Davies-Bouldin indexes of each sample cluster member with the average value respectively, marking the sample cluster members with the Davies-Bouldin indexes lower than the average value with a first label, and marking the sample cluster members with the Davies-Bouldin indexes higher than the average value with a second label;
training is carried out based on sample cluster members with labels as a training set, and a trained decision tree model is obtained.
Preferably, determining the target CA matrix of the target cluster according to the cluster layer weighting coefficient includes:
Constructing a CA matrix about the target cluster;
weighting the CA matrix based on the cluster layer weighting coefficient to obtain processed data B;
capturing high confidence information from the CA matrix to obtain an HC matrix;
and determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix.
Preferably, constructing a CA matrix for the target clusters includes:
Wherein A is CA matrix; Represent the first Cluster members; representing the total number of cluster members in the cluster set; Representing sample points A cluster in which the cluster is located; Representing sample points The cluster in which it is located.
The invention also provides a cluster weighted clustering integrated medical data processing system based on member selection, which comprises a construction unit, an input unit, a screening unit, a matrix determining unit and an executing unit;
The method comprises the steps of constructing a cluster member set by a construction unit, inputting the cluster member set into a pre-trained decision tree model by an input unit, screening out cluster members with labels being pre-labeled from the cluster member set output by the decision tree model by a screening unit, generating a target cluster set by the screened cluster members, determining a target CA matrix of the target cluster set by a matrix determination unit according to a cluster layer weighting coefficient, and executing a hierarchical clustering algorithm by an execution unit on the basis of the target CA matrix to obtain a final clustering result.
Preferably, the construction unit clusters the medical data by using a K-Means algorithm to generate a plurality of cluster members.
Preferably, the training steps of the decision tree model are as follows:
calculating the Davies-Bouldin index of each sample cluster member in the sample cluster member set, and calculating the overall average value;
Comparing the Davies-Bouldin indexes of each sample cluster member with the average value respectively, marking the sample cluster members with the Davies-Bouldin indexes lower than the average value with a first label, and marking the sample cluster members with the Davies-Bouldin indexes higher than the average value with a second label;
training is carried out based on sample cluster members with labels as a training set, and a trained decision tree model is obtained.
Preferably, the matrix determining unit performs the following operations:
Constructing a CA matrix about the target cluster;
weighting the CA matrix based on the cluster layer weighting coefficient to obtain processed data B;
capturing high confidence information from the CA matrix to obtain an HC matrix;
and determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix.
Preferably, the matrix determining unit constructs a CA matrix for the target cluster group, performing the following operations:
Wherein A is CA matrix; Represent the first Cluster members; representing the total number of cluster members in the cluster set; Representing sample points A cluster in which the cluster is located; Representing sample points The cluster in which it is located.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a schematic diagram of a cluster weighted clustering integrated medical data processing method based on member selection in an embodiment of the invention;
FIG. 2 is a flow diagram of building a cluster member set according to one embodiment of the invention;
FIG. 3 is a flow chart of a cluster weighted clustering integration method based on member selection in accordance with yet another embodiment of the invention;
FIG. 4 is a flow chart of training a decision tree according to one embodiment of the invention;
FIG. 5 is a schematic diagram of a trained decision tree model according to one embodiment of the invention;
FIG. 6 is a schematic diagram of a cluster weighted clustering integrated medical data processing system based on member selection in an embodiment of the invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
The embodiment of the invention provides a cluster weighted clustering integrated medical data processing method based on member selection, which is shown in figure 1 and comprises the following steps:
Constructing a cluster member set and inputting the cluster member set into a pre-trained decision tree model;
Screening out the labels from the cluster member set output by the decision tree model as the cluster members of the pre-label, and generating a target cluster set by using the screened cluster members;
determining a target CA matrix of a target cluster set according to the cluster layer weighting coefficient;
and executing a hierarchical clustering algorithm based on the target CA matrix to obtain a final clustering result.
The hierarchical clustering algorithm comprises an average linking method, a preset label, a hierarchical clustering algorithm and a hierarchical clustering algorithm, wherein the preset label is a label marked as high;
the construction of the cluster member set comprises the steps of clustering medical data by adopting a K-Means algorithm to generate a plurality of cluster members. Setting the cluster membership number r and the cluster number k, setting the cluster number k to be [2, R epsilon N+ where N is the number of data sample points, and using K-Means algorithm to randomly generate r cluster members as cluster set P, and further generating cluster member set, specifically:
Obtaining the number r of cluster members and the number k of clusters;
initializing i=1;
judging whether i is less than or equal to r;
when the i is determined to be less than or equal to r, clustering by using a K-Means algorithm to generate cluster members, and obtaining a clustering result;
and (3) assigning i=i+1, and continuing to judge until i is not less than or equal to r, and constructing a cluster member set.
In addition, the cluster member set can be constructed by using a hierarchical clustering method, a spectral clustering method and the like.
As shown in fig. 2 to 5, the training steps of the decision tree model are as follows:
calculating the Davies-Bouldin index of each sample cluster member in the sample cluster member set, and calculating the overall average value;
Comparing the Davies-Bouldin indexes of each sample cluster member with the average value respectively, marking the sample cluster members with the Davies-Bouldin indexes lower than the average value with a first label, marking the sample cluster members with the Davies-Bouldin indexes higher than the average value with a second label, wherein the first label is 'high', and the second label is 'low';
Training is carried out based on sample cluster members with labels as a training set, and a trained decision tree model is obtained. The ARI (adjusted Rand index), NMI (normalized mutual information) and F (Fowlkes-Mallows index) of each cluster member are taken as characteristic attribute sets, and all three characteristic attributes are all better as the characteristic attributes are close to 1, wherein the value range of the ARI is [ -1, 1], and the value ranges of the NMI and the F are [0, 1]. The importance of each feature attribute set is calculated using the coefficient of kunning to determine the root node and internal nodes of the decision tree, as follows:
(1.1)
(1.2)
Wherein, Representing the probability that a sample point belongs to class i,The tables are the feature attribute set and the state of the feature attribute. The larger the coefficient of the kunit, the greater the uncertainty of the feature properties.
First, the coefficient of the characteristic of the labeled cluster member is obtained by using the formulas (1.1) and (1.2), the coefficient of the characteristic of the labeled cluster member is compared with the coefficient of the characteristic of the labeled cluster member in the three aspects ARI, NMI, F, the "characteristic attribute 1" with the smallest coefficient of the characteristic is selected as the root node, and the label with the value of the "characteristic attribute 1" close to 1 is "high". Then, the cluster members with labels, the values of which are not close to 1, of the characteristic attribute 1 are used as a new label set, the coefficient of the foundation of the remaining two characteristic attributes is calculated continuously, and the smallest characteristic attribute 2 is selected as an internal node. Finally, the remaining feature attributes become the last internal nodes as "feature attribute 3".
The three characteristic attributes are all better as approaching to 1, wherein the value range of ARI is [ -1, 1], the ARI is divided into two types, one type is larger than 0 and smaller than or equal to 0, the value ranges of NMI and F-measure indexes are [0, 1], and the ARI is divided into two types, one type is larger than 0.5 and one type is smaller than or equal to 0.5. The importance of each feature attribute set is calculated using the coefficient of kunning to determine the root node and internal nodes of the decision tree. The larger the coefficient of the kunit, the greater the uncertainty of the feature properties.
The attributes of the root node and the internal nodes of the decision tree are measured using the coefficient of the radix;
For example, the key coefficients of ARI properties are determined as follows:
according to the following formula:
first, the coefficient of the kunit when ARI is greater than 0 is calculated as:
At this time, the liquid crystal display device, Representing a subset of dataIs of a non-purity of the base; Representing a subset of data obtained at a characteristic attribute ARI greater than 0; the number of samples representing a category label of "high" divided by the number of samples when ARI is greater than 0; the number of samples representing a category label of "low" divided by the number of samples when ARI is greater than 0;
next, the kunit when ARI is 0 or less is calculated as:
At this time, the liquid crystal display device, Representing a subset of dataIs of a non-purity of the base; Representing a subset of data obtained when the characteristic attribute ARI is 0 or less; The number of samples indicating that the category label is "high" divided by the number of samples when ARI is 0 or less; the number of samples indicating that the category label is "low" divided by the number of samples when ARI is 0 or less;
finally, according to the following formula:
The kunity coefficient of ARI is calculated as follows:
At this time, the liquid crystal display device, A coefficient of a characteristic attribute of the data set D is ARI; indicating the number of samples when ARI is greater than 0, Representing the number of samples of the data set D.
When the decision tree is constructed, the values of the characteristic attributes are equally divided into two sections of different value fields, and two branches are used.
The decision tree selects the best characteristic attribute as the root node according to the coefficient of the foundation, and continuously selects the characteristic attribute with higher importance in the subsequent internal node division. After training is completed, the decision tree model can be used for classification prediction of new cluster members.
When the decision tree is constructed, the value of the characteristic attribute can be further subdivided into a plurality of sections with different value ranges, namely multiple branches, branches of each decision tree are increased, the number of decision classification is increased, and the classification precision is further improved.
Based on the method, the clustering algorithm and the decision tree are effectively aggregated to obtain a decision tree model, the decision tree selects the best characteristic attribute as a root node according to the coefficient of the foundation, and the characteristic attribute with higher importance is continuously selected in the subsequent internal node division, so that the method can be used for classifying and predicting new cluster members based on the decision tree model.
The target cluster set is a set constructed by high-quality cluster members. When processing and identifying the cluster member set based on the decision tree model, ARI, NMI and F-measure indexes in the characteristic attribute set of each cluster member are output, and the decision tree model predicts the cluster quality label ("high" or "low") according to the learned node division rule. And forming the cluster members with the labels of high into a new cluster group P', and participating in subsequent processing, namely generating a target cluster group.
The method for determining the target CA matrix of the target cluster group according to the cluster layer weighting coefficient comprises the following steps:
Constructing a CA matrix about the target cluster;
weighting the CA matrix based on the cluster layer weighting coefficient to obtain processed data B;
capturing high confidence information from the CA matrix to obtain an HC matrix;
and determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix.
Constructing a CA matrix for a target cluster set, comprising:
Wherein A is CA matrix; Represent the first Cluster members; representing the total number of cluster members in the cluster set; Representing sample points A cluster in which the cluster is located; Representing sample points The cluster in which it is located.
And weighting the CA matrix based on the cluster layer weighting coefficient to obtain processing data B, wherein the processing data B comprises the following steps:
Capturing High-confidence information from a CA matrix, supplementing and perfecting the captured High-confidence information to obtain an ideal CA matrix, and marking a High-confidence information HC (High-Confidence Matrix) matrix as H, wherein the method comprises the following steps of:
Wherein, The position of the highly reliable element is recorded,Is an element operator. When the ratio of the number of times two sample points are classified into the same cluster to the total number of cluster members exceeds a predefined thresholdThe corresponding position of the a matrix is considered as a piece of highly reliable information (i.e. one element in the H matrix).
Determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix, wherein the target CA matrix comprises the following components:
the final CA matrix, designated C, is obtained from the H matrix and the B matrix, and the method is as follows:
Wherein, Is a laplace matrix of the type described above,,Indicating the lagrangian term, default to a value of 1,For the purpose of balancing the error loss term,AndRepresenting the lagrangian multiplier and,The Frobenius norm of the matrix is represented, E represents the error term, and F is the intermediate matrix used to alleviate the value range constraint and the symmetry constraint of the ideal CA matrix.
Information entropy is introduced to the target clusters, and an uncertainty index IEI of each cluster is calculated and used as a cluster layer weighting coefficient. The IEI calculation method is as follows:
Wherein, Representing the total number of clusters; For measuring clusters And cluster ofSimilarity between IEI indicators reflects clustersThe likelihood that points in the cluster remain in the same cluster in other base clusters, the greater the IEI indicates the clusterCapturing High-confidence information from the CA matrix, using the captured High-confidence information to complement and perfect to obtain an ideal CA matrix, and using the High-confidence information to obtain an HC (High-Confidence Matrix) matrix which is marked as H; is the maximum value in the H matrix; is the minimum value in the H matrix; for clusters in the H matrix A corresponding value;
or KL divergence is introduced to determine a cluster layer weighting coefficient;
or a plurality of evaluation indexes are introduced and fused together to form a new evaluation index to determine the cluster layer weighting coefficient.
The application introduces the decision tree to assist in selecting the high-quality cluster members, and training the decision tree model to assist in selecting the cluster members, so that the quality and the diversity of the cluster members are considered from multiple angles, and meanwhile, the internal diversity of the cluster members is considered, thereby realizing the differentiated treatment of the cluster members more comprehensively. And finally, finely adjusting the CA matrix according to the weight and the high confidence information so as to improve the accuracy and the robustness of the clustering.
The cluster weighted clustering integrated medical data processing method based on member selection uses a clustering algorithm to cluster patient data to generate a plurality of cluster members. Each cluster member divides the patient into different clusters, each cluster representing a patient population with similar symptoms and features. Next, high quality cluster members are selected with the aid of a decision tree model. In order to further improve accuracy and interpretation of the clustering result, attention is given to the internal diversity of the clustering members, the diversity of the internal clusters is measured, fine adjustment is carried out on the CA matrix, and finally, hierarchical clustering analysis is used for obtaining the final clustering result. By the method, more optimized clustering results are obtained, the characteristics and the similarity of the diseases of the patients can be better displayed, doctors can conveniently select medical services for different patients or customize personalized treatment schemes, and the medical experience and treatment effect of the patients can be improved.
In one embodiment, the processing method further comprises evaluating the final clustering result based on the external index, determining an evaluation value, and judging the validity of the final clustering result based on the evaluation value. The cluster result validity evaluation index is generally classified into an internal index and an external index. In most cases, class labels of the data set are known (not used in the clustering process), and an external index can be used to evaluate the effectiveness of clustering, where an F-measure is a relatively common comprehensive index for evaluating the quality of text clustering. The larger the F value is, the higher the clustering quality is, and when the clustering result is completely consistent with the real category, the F value reaches the maximum value, and the value is 1. In addition, NMI value is also a popular clustering result effectiveness evaluation index, and can quantify the matching degree of the clustering result and the real text category label.
In one embodiment, computing the coefficient of the feature attribute set for the kurti on three aspects of the ARI, NMI, and F-measure indices includes:
Wherein, Represents the genii purity of dataset D; Representing class labels as The duty cycle in the dataset D, i.e. belonging to a categoryDivided by the total number of samples; expressed in characteristic attribute Is not pure in the data set D; expressed in characteristic attribute The value of (2) isThe number of samples of the subset of data obtained at that time; Representing a subset of data Is of non-purity. Based on the formula, the coefficient of the foundation of the labeled cluster members in respect of ARI, NMI, F-measure index is obtained, so that the corresponding coefficient of foundation is conveniently and accurately calculated, the accuracy of judging the sizes of the coefficient of foundation of the three is improved, and further, the root node and the internal node are accurately divided.
In one embodiment, the method for processing the cluster weighted clustering integrated medical data based on member selection further processes medical data as follows to mark data with interference on a clustering result before constructing a cluster member set, and specifically comprises the following processing steps:
Analyzing feedback data corresponding to the clustering result to determine a marking node;
outputting the marked nodes and the related data and training data of the decision tree model when the number of the marked nodes is more than or equal to the preset number (for example, any one value from 2 to 100);
When the number of the marking nodes is smaller than the preset number, screening and marking the medical data based on each marking node;
And when the clustering result is output, the marked medical data synchronously output corresponding marking information.
The feedback data acquisition steps are as follows:
after the clustering result of the patient is sent to a preset doctor terminal, the received annotation information of the doctor terminal;
And/or the number of the groups of groups,
After the clustering result of the patient and the treatment scheme of the patient are sent to the patient terminal, when the received patient receives an in-doubt correction instruction of the data, the clustering result, the treatment scheme and the in-doubt information are sent to a preset professional doctor terminal, and then the received annotation information of the professional doctor terminal is received;
Annotation information, clustering results, treatment schemes and/or doubtful information are used as feedback data.
The analysis steps for the feedback data are as follows:
Screening the feedback data; extracting keywords from annotation information and doubt information in feedback data based on a preset keyword library, acquiring keyword sets corresponding to the feedback data, and deleting the feedback data without extracting the keywords; because some feedback data have annotation information and doubt information at the same time, the keywords extracted by the annotation information and the doubt information have interference, the interference monitoring can be carried out on the keywords in the keyword set according to a preset keyword interference association library, and the feedback data corresponding to the monitored interference keyword set is deleted;
Acquiring original medical data corresponding to a clustering result corresponding to the feedback data after screening and characteristic data during classification;
Associating the original medical data and the classification characteristic data with feedback data to form data to be analyzed;
The specific grouping can be used for determining that two data to be analyzed are the same grouping by calculating the data similarity between the original medical data and the classified characteristic data when the sum of the data similarity is larger than a preset threshold value;
According to the similarity sum of each piece of data to be analyzed in the group and other pieces of original medical data and classification characteristic data of the data to be analyzed, determining a characteristic set constructed by the original medical data and the classification characteristic data corresponding to the data to be analyzed with the largest sum as a characteristic set of a marking node;
The method comprises the steps of scoring the composition and the content of data to be analyzed based on a preset scoring model, and taking key data extracted from labeling data in the data to be analyzed with the largest scoring value as marking data, wherein the more the composition components of the data to be analyzed are, the higher the scoring is, the larger the total amount of data in the data to be analyzed is, the higher the scoring is, the larger the total amount of the extracted key data is, and the specific scoring rules are as follows:
Intercepting data in the data to be analyzed according to the source;
Extracting characteristics of the intercepted data according to characteristic extraction rules corresponding to the sources, wherein the characteristics comprise parameter characteristic values corresponding to the total amount of the data, parameter characteristic values corresponding to keywords contained in a preset characteristic library in the intercepted data and the like;
determining the aspect grading value of each intercepted data according to the grading rule corresponding to the source of each data and the data characteristics corresponding to each intercepted data;
determining weights corresponding to the scoring values of all aspects according to the sources of all the data; the weight corresponding to the classification characteristic data is a fixed value and is preset, the weight corresponding to the labeling data corresponds to the authority value of a doctor terminal or a professional doctor terminal which gives the data, and the weight corresponding to the medical data corresponds to the authority value of a terminal which uploads the medical data;
and calculating the scoring value according to the weight and the scoring value of each aspect. The score value is the sum of the aspect score value and the corresponding weight product.
According to the embodiment, the data are marked in advance through the marking nodes so as to assist in clustering classification of the medical data, so that the characteristics and the similarity of the diseases of the patients can be better displayed, doctors can conveniently select medical services for different patients or customize personalized treatment schemes, and the medical experience and treatment effect of the patients can be improved.
For the field application scenario of medical data processing, in one embodiment, the method for constructing medical data in the cluster weighted clustering integrated medical data processing method based on member selection is as follows:
The legal uploading time interval can be manually configured, for example, when the first terminal is a computer for each medical courtyard clinic, the legal uploading time interval can be configured from half an hour before a doctor works to half an hour after the doctor works;
Determining whether the data quantity of the first data meets a preset first data requirement;
When the standard data is not reached, complementing the standard data stored by the system;
Wherein the first data requirement comprises data of a preset number of patients contained in the medical data. For example, 100, 200, 300.
The association rules of the first terminals are that the first terminals of different hospitals in the same department are associated into a group, and the association rules are convenient for clustering;
in addition, the legal uploading time can be set in a segmented mode, and one legal time corresponds to one analysis by taking each half hour as a section, so that the requirement of an outpatient doctor on data processing is met.
In the context of post-analysis and secondary analysis of medical data, in one embodiment, the method for constructing medical data in a cluster weighted clustering integrated medical data processing method based on member selection is as follows:
Acquiring second data uploaded by a second terminal, wherein the second terminal is terminal equipment (such as a computer arranged in an analysis room and the like) recorded in a system by each hospital;
when uploading, integrating all second data, determining whether the second data meets the preset second data requirement, and if not, directly determining whether the second data meets the preset second requirement;
When the preset second data requirement is not met, uploading data according to the history of the second terminal, outputting a data list, receiving the selection of the second terminal for the data on the data list, and completing the second data to meet the preset second data requirement, wherein the second data requirement comprises data of a preset number of patients contained in medical data. For example, 100, 200, 300.
Through the selection of the historical uploading data, on one hand, the data can meet the requirement of clustering processing, on the other hand, the historical data can be combined for clustering analysis, and the historical data is verified and analyzed on the side face.
The invention also provides a cluster weighted clustering integrated medical data processing system based on member selection, which is shown in figure 6 and comprises a construction unit 1, an input unit 2, a screening unit 3, a matrix determining unit 4 and an executing unit 5;
The method comprises the steps of constructing a cluster member set by a constructing unit 1, inputting the cluster member set into a pre-trained decision tree model by an input unit 2, screening out cluster members with labels being pre-labeled from the cluster member set output by the decision tree model by a screening unit 3, generating a target cluster set by the screened cluster members, determining a target CA matrix of the target cluster set by a matrix determining unit 4 according to a cluster layer weighting coefficient, and executing a hierarchical clustering algorithm by an executing unit 5 on the basis of the target CA matrix to obtain a final clustering result.
The construction unit 1 clusters the medical data using a K-Means algorithm, generating a plurality of cluster members.
The training steps of the decision tree model are as follows:
calculating the Davies-Bouldin index of each sample cluster member in the sample cluster member set, and calculating the overall average value;
Comparing the Davies-Bouldin indexes of each sample cluster member with the average value respectively, marking the sample cluster members with the Davies-Bouldin indexes lower than the average value with a first label, and marking the sample cluster members with the Davies-Bouldin indexes higher than the average value with a second label;
training is carried out based on sample cluster members with labels as a training set, and a trained decision tree model is obtained.
The matrix determining unit 4 performs the following operations:
Constructing a CA matrix about the target cluster;
weighting the CA matrix based on the cluster layer weighting coefficient to obtain processed data B;
capturing high confidence information from the CA matrix to obtain an HC matrix;
and determining a target CA matrix of the target cluster set according to the processing data B and the HC matrix.
The matrix determination unit 4 constructs a CA matrix for the target cluster group, performing the following operations:
Wherein A is CA matrix; Represent the first Cluster members; representing the total number of cluster members in the cluster set; Representing sample points A cluster in which the cluster is located; Representing sample points The cluster in which it is located.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (10)

1.一种基于成员选择的簇加权聚类集成医学文本处理方法,其特征在于包括:1. A cluster weighted clustering integrated medical text processing method based on member selection, characterized by comprising: 构建聚类成员集合并输入至预先训练好的决策树模型中;Construct a set of cluster members and input them into a pre-trained decision tree model; 从决策树模型的输出的聚类成员集合筛选出标签为预先标签的聚类成员,并以筛选出的聚类成员,生成目标聚类集体;Filter out cluster members with pre-labeled labels from the cluster member set output by the decision tree model, and generate a target cluster group with the filtered cluster members; 根据簇层加权系数确定目标聚类集体的目标CA矩阵;Determine the target CA matrix of the target clustering group according to the cluster layer weight coefficient; 以目标CA矩阵为基础执行层次聚类算法,得到最终的聚类结果;Execute the hierarchical clustering algorithm based on the target CA matrix to obtain the final clustering result; 基于外部指标来对最终的聚类结果进行评价,确定评价值,基于评价值来判断最终的聚类结果的有效性;Evaluate the final clustering results based on external indicators, determine the evaluation value, and judge the effectiveness of the final clustering results based on the evaluation value; 在构建聚类成员集合之前,还对医学文本数据进行如下处理,以标注出聚类结果有干扰的数据,具体处理步骤如下:Before constructing the cluster member set, the medical text data is processed as follows to mark out the data that interferes with the clustering results. The specific processing steps are as follows: 对聚类结果对应的反馈数据进行分析,确定标记节点;Analyze the feedback data corresponding to the clustering results and determine the marked nodes; 当标记节点的数量大于等于预设的数量,将标记节点及相关数据输出、决策树模型的训练数据输出;接收专业人员的重新标记后,采用标记后的数据对决策树模型进行重新训练;When the number of marked nodes is greater than or equal to the preset number, the marked nodes and related data are output, and the training data of the decision tree model is output; after receiving the re-marking from professionals, the decision tree model is retrained using the marked data; 当标记节点的数量小于预设的数量时,基于各个标记节点,对医学文本数据进行筛选并标记;When the number of marked nodes is less than a preset number, the medical text data is screened and marked based on each marked node; 在输出聚类结果时,标记的医学文本数据同步输出对应的标记信息;When outputting the clustering results, the marked medical text data synchronously outputs the corresponding marking information; 其中,反馈数据的获取步骤如下:The steps for obtaining feedback data are as follows: 在将患者的聚类结果发送至预设的医师终端后,接收的医师终端的批注信息;After the clustering results of the patients are sent to the preset physician terminal, the annotation information of the physician terminal is received; 和/或,and/or, 在将患者的聚类结果以及患者的治疗方案发送至患者终端后,接收的患者对于接收到的数据的存疑矫正指令时,将聚类结果、治疗方案以及存疑信息发送至预设的专业医师端后,接收的专业医师端的批注信息;After the patient's clustering results and the patient's treatment plan are sent to the patient terminal, when the patient receives a correction instruction for doubts about the received data, the clustering results, treatment plan and doubtful information are sent to the preset professional physician terminal, and the annotation information of the professional physician terminal is received; 将批注信息、聚类结果、治疗方案和/或存疑信息作为反馈数据;Using annotation information, clustering results, treatment plans and/or questionable information as feedback data; 对于反馈数据的分析步骤如下:The analysis steps for feedback data are as follows: 对反馈数据进行筛选;Screening feedback data; 获取筛选后的反馈数据对应的聚类结果所对应的原始医学文本数据、分类时的特征数据;Obtaining the original medical text data corresponding to the clustering results corresponding to the filtered feedback data and the feature data during classification; 将原始医学文本数据、分类特征数据与反馈数据相关联,形成待分析数据;Associating the original medical text data, classification feature data and feedback data to form data to be analyzed; 根据原始医学文本数据和分类特征数据,对待分析数据进行分组;The data to be analyzed are grouped according to the original medical text data and classification feature data; 根据分组内各个待分析数据与其他的待分析数据的原始医学文本数据和分类特征数据的相似度总和,确定总和最大的待分析数据所对应的原始医学文本数据和分类特征数据构建的特征集作为标记节点的特征集;According to the sum of similarities between each data to be analyzed in the group and the original medical text data and classification feature data of other data to be analyzed, a feature set constructed by the original medical text data and classification feature data corresponding to the data to be analyzed having the largest sum is determined as the feature set of the marking node; 基于预设的评分模型对待分析数据的组成以及内容进行评分,以评分值最大的待分析数据中的标注数据提取出的关键数据作为标记数据;Scoring the composition and content of the data to be analyzed based on a preset scoring model, using the key data extracted from the labeled data of the data to be analyzed with the largest scoring value as the labeled data; 对待分析数据中的数据依据来源,进行截取;The data to be analyzed are intercepted according to their sources; 依据各个数据依据来源对应的特征提取规则,对截取的数据进行特征提取;According to the feature extraction rules corresponding to each data source, feature extraction is performed on the intercepted data; 依据各个数据依据来源对应的评分规则和各个截取数据对应的数据特征,确定各个截取数据的方面评分值;Determine the aspect score value of each intercepted data according to the scoring rules corresponding to each data source and the data features corresponding to each intercepted data; 依据各个数据依据来源,确定各个方面评分值对应的权重;Determine the weights corresponding to the scores of each aspect based on the data sources; 根据权重和各个方面评分值,计算出评分值;Calculate the score based on the weights and scores of each aspect; 多数情况下,医学文本数据的类别标签是已知的,此时采用外部指标来评价聚类有效性,其中F值是一个评价医学文本数据聚类质量的综合指标;F值越大,聚类质量越高,当聚类结果与医学文本数据的类别完全一致时,F值达到最大值,其值为1;NMI值是聚类结果有效性评价指标,量化聚类结果与真实医学文本数据的类别标签的匹配程度。In most cases, the category labels of medical text data are known. In this case, external indicators are used to evaluate the effectiveness of clustering. The F value is a comprehensive indicator for evaluating the quality of clustering of medical text data. The larger the F value, the higher the clustering quality. When the clustering result is completely consistent with the category of the medical text data, the F value reaches its maximum value, which is 1. The NMI value is an evaluation indicator of the effectiveness of the clustering result, which quantifies the degree of match between the clustering result and the category label of the real medical text data. 2.根据权利要求1所述的基于成员选择的簇加权聚类集成医学文本处理方法,其特征在于聚类成员集合的构建包括:采用K-Means算法对医学文本数据进行聚类,生成多个聚类成员。2. According to the cluster weighted clustering integrated medical text processing method based on member selection in claim 1, it is characterized in that the construction of the cluster member set includes: clustering the medical text data using the K-Means algorithm to generate multiple cluster members. 3.根据权利要求1所述的基于成员选择的簇加权聚类集成医学文本处理方法,其特征在于决策树模型的训练步骤如下:3. The method for processing medical texts by cluster weighted clustering integration based on member selection according to claim 1, characterized in that the training steps of the decision tree model are as follows: 计算样本聚类成员集合中每个样本聚类成员的Davies-Bouldin指数,并求出整体的平均值;Calculate the Davies-Bouldin index of each sample cluster member in the sample cluster member set and find the overall average value; 将每个样本聚类成员的Davies-Bouldin指数分别与平均值进行比较,给Davies-Bouldin指数低于平均值的样本聚类成员打上第一标签,给Davies-Bouldin指数高于平均值的样本聚类成员打上第二标签;Compare the Davies-Bouldin index of each sample cluster member with the average value, label the sample cluster members whose Davies-Bouldin index is lower than the average value with the first label, and label the sample cluster members whose Davies-Bouldin index is higher than the average value with the second label; 基于带有标签的样本聚类成员作为训练集进行训练,得到训练好的决策树模型。The training is performed based on the sample cluster members with labels as the training set to obtain a trained decision tree model. 4.根据权利要求1所述的基于成员选择的簇加权聚类集成医学文本处理方法,其特征在于根据簇层加权系数确定目标聚类集体的目标CA矩阵,包括:4. The method for processing medical texts by cluster weighted clustering integration based on member selection according to claim 1, characterized in that the target CA matrix of the target clustering group is determined according to the cluster layer weight coefficient, comprising: 构建关于目标聚类集体的CA矩阵;Construct the CA matrix of the target cluster collective; 基于簇层加权系数对CA矩阵进行加权处理,得到处理数据B;The CA matrix is weighted based on the cluster layer weight coefficient to obtain the processed data B; 从CA矩阵中捕获高置信度信息,得到HC矩阵;Capture high confidence information from the CA matrix to obtain the HC matrix; 根据处理数据B及HC矩阵确定目标聚类集体的目标CA矩阵。The target CA matrix of the target cluster is determined based on the processed data B and the HC matrix. 5.根据权利要求4所述的基于成员选择的簇加权聚类集成医学文本处理方法,其特征在于构建关于目标聚类集体的CA矩阵,包括:5. The method for processing medical texts by cluster weighted clustering integration based on member selection according to claim 4, characterized in that the CA matrix of the target clustering group is constructed, comprising: 其中,A为CA矩阵;表示第个聚类成员;表示聚类集体中聚类成员的总数;表示样本点所在的簇;表示样本点所在的簇。Where A is the CA matrix; Indicates cluster members; Represents the total number of cluster members in the cluster collective; Represents sample points The cluster in which it is located; Represents sample points The cluster where it is located. 6.一种基于成员选择的簇加权聚类集成医学文本处理系统,其特征在于包括:构建单元、输入单元、筛选单元、矩阵确定单元和执行单元;6. A cluster weighted clustering integrated medical text processing system based on member selection, characterized by comprising: a construction unit, an input unit, a screening unit, a matrix determination unit and an execution unit; 其中,构建单元构建聚类成员集合;输入单元将聚类成员集合输入至预先训练好的决策树模型中;筛选单元从决策树模型的输出的聚类成员集合筛选出标签为预先标签的聚类成员,并以筛选出的聚类成员,生成目标聚类集体;矩阵确定单元根据簇层加权系数确定目标聚类集体的目标CA矩阵;执行单元以目标CA矩阵为基础执行层次聚类算法,得到最终的聚类结果;Among them, the construction unit constructs a cluster member set; the input unit inputs the cluster member set into a pre-trained decision tree model; the screening unit screens out cluster members with pre-labeled labels from the cluster member set output by the decision tree model, and generates a target cluster group with the screened cluster members; the matrix determination unit determines the target CA matrix of the target cluster group according to the cluster layer weight coefficient; the execution unit executes the hierarchical clustering algorithm based on the target CA matrix to obtain the final clustering result; 在构建聚类成员集合之前,还对医学文本数据进行如下处理,以标注出聚类结果有干扰的数据,具体处理步骤如下:Before constructing the cluster member set, the medical text data is processed as follows to mark out the data that interferes with the clustering results. The specific processing steps are as follows: 对聚类结果对应的反馈数据进行分析,确定标记节点;Analyze the feedback data corresponding to the clustering results and determine the marked nodes; 当标记节点的数量大于等于预设的数量,将标记节点及相关数据输出、决策树模型的训练数据输出;接收专业人员的重新标记后,采用标记后的数据对决策树模型进行重新训练;When the number of marked nodes is greater than or equal to the preset number, the marked nodes and related data are output, and the training data of the decision tree model is output; after receiving the re-marking from professionals, the decision tree model is retrained using the marked data; 当标记节点的数量小于预设的数量时,基于各个标记节点,对医学文本数据进行筛选并标记;When the number of marked nodes is less than a preset number, the medical text data is screened and marked based on each marked node; 在输出聚类结果时,标记的医学文本数据同步输出对应的标记信息;When outputting the clustering results, the marked medical text data synchronously outputs the corresponding marking information; 其中,反馈数据的获取步骤如下:The steps for obtaining feedback data are as follows: 在将患者的聚类结果发送至预设的医师终端后,接收的医师终端的批注信息;After the clustering results of the patients are sent to the preset physician terminal, the annotation information of the physician terminal is received; 和/或,and/or, 在将患者的聚类结果以及患者的治疗方案发送至患者终端后,接收的患者对于接收到的数据的存疑矫正指令时,将聚类结果、治疗方案以及存疑信息发送至预设的专业医师端后,接收的专业医师端的批注信息;After the patient's clustering results and the patient's treatment plan are sent to the patient terminal, when the patient receives a correction instruction for doubts about the received data, the clustering results, treatment plan and doubtful information are sent to the preset professional physician terminal, and the annotation information of the professional physician terminal is received; 将批注信息、聚类结果、治疗方案和/或存疑信息作为反馈数据;Using annotation information, clustering results, treatment plans and/or questionable information as feedback data; 对于反馈数据的分析步骤如下:The analysis steps for feedback data are as follows: 对反馈数据进行筛选;Screening feedback data; 获取筛选后的反馈数据对应的聚类结果所对应的原始医学文本数据、分类时的特征数据;Obtaining the original medical text data corresponding to the clustering results corresponding to the filtered feedback data and the feature data during classification; 将原始医学文本数据、分类特征数据与反馈数据相关联,形成待分析数据;Associating the original medical text data, classification feature data and feedback data to form data to be analyzed; 根据原始医学文本数据和分类特征数据,对待分析数据进行分组;The data to be analyzed are grouped according to the original medical text data and classification feature data; 根据分组内各个待分析数据与其他的待分析数据的原始医学文本数据和分类特征数据的相似度总和,确定总和最大的待分析数据所对应的原始医学文本数据和分类特征数据构建的特征集作为标记节点的特征集;According to the sum of similarities between each data to be analyzed in the group and the original medical text data and classification feature data of other data to be analyzed, a feature set constructed by the original medical text data and classification feature data corresponding to the data to be analyzed having the largest sum is determined as the feature set of the marking node; 基于预设的评分模型对待分析数据的组成以及内容进行评分,以评分值最大的待分析数据中的标注数据提取出的关键数据作为标记数据;Scoring the composition and content of the data to be analyzed based on a preset scoring model, using the key data extracted from the labeled data of the data to be analyzed with the largest scoring value as the labeled data; 对待分析数据中的数据依据来源,进行截取;The data to be analyzed are intercepted according to their sources; 依据各个数据依据来源对应的特征提取规则,对截取的数据进行特征提取;According to the feature extraction rules corresponding to each data source, feature extraction is performed on the intercepted data; 依据各个数据依据来源对应的评分规则和各个截取数据对应的数据特征,确定各个截取数据的方面评分值;Determine the aspect score value of each intercepted data according to the scoring rules corresponding to each data source and the data features corresponding to each intercepted data; 依据各个数据依据来源,确定各个方面评分值对应的权重;Determine the weights corresponding to the scores of each aspect based on the data sources; 根据权重和各个方面评分值,计算出评分值;Calculate the score based on the weights and scores of each aspect; 执行单元还基于外部指标来对最终的聚类结果进行评价,确定评价值,基于评价值来判断最终的聚类结果的有效性;The execution unit also evaluates the final clustering result based on the external indicator, determines the evaluation value, and judges the validity of the final clustering result based on the evaluation value; 多数情况下,医学文本数据的类别标签是已知的,此时采用外部指标来评价聚类有效性,其中F值是一个评价医学文本数据聚类质量的综合指标;F值越大,聚类质量越高,当聚类结果与医学文本数据的类别完全一致时,F值达到最大值,其值为1;NMI值是聚类结果有效性评价指标,量化聚类结果与真实医学文本数据的类别标签的匹配程度。In most cases, the category labels of medical text data are known. In this case, external indicators are used to evaluate the effectiveness of clustering. The F value is a comprehensive indicator for evaluating the quality of clustering of medical text data. The larger the F value, the higher the clustering quality. When the clustering result is completely consistent with the category of the medical text data, the F value reaches its maximum value, which is 1. The NMI value is an evaluation indicator of the effectiveness of the clustering result, which quantifies the degree of match between the clustering result and the category label of the real medical text data. 7.根据权利要求6所述的基于成员选择的簇加权聚类集成医学文本处理系统,其特征在于构建单元采用K-Means算法对医学文本数据进行聚类,生成多个聚类成员。7. The cluster weighted clustering integrated medical text processing system based on member selection according to claim 6 is characterized in that the construction unit uses the K-Means algorithm to cluster the medical text data to generate multiple cluster members. 8.根据权利要求6所述的基于成员选择的簇加权聚类集成医学文本处理系统,其特征在于决策树模型的训练步骤如下:8. The cluster weighted clustering integrated medical text processing system based on member selection according to claim 6, characterized in that the training steps of the decision tree model are as follows: 计算样本聚类成员集合中每个样本聚类成员的Davies-Bouldin指数,并求出整体的平均值;Calculate the Davies-Bouldin index of each sample cluster member in the sample cluster member set and find the overall average value; 将每个样本聚类成员的Davies-Bouldin指数分别与平均值进行比较,给Davies-Bouldin指数低于平均值的样本聚类成员打上第一标签,给Davies-Bouldin指数高于平均值的样本聚类成员打上第二标签;Compare the Davies-Bouldin index of each sample cluster member with the average value, label the sample cluster members whose Davies-Bouldin index is lower than the average value with the first label, and label the sample cluster members whose Davies-Bouldin index is higher than the average value with the second label; 基于带有标签的样本聚类成员作为训练集进行训练,得到训练好的决策树模型。The training is performed based on the sample cluster members with labels as the training set to obtain a trained decision tree model. 9.根据权利要求6所述的基于成员选择的簇加权聚类集成医学文本处理系统,其特征在于矩阵确定单元执行如下操作:9. The cluster weighted clustering integrated medical text processing system based on member selection according to claim 6, characterized in that the matrix determination unit performs the following operations: 构建关于目标聚类集体的CA矩阵;Construct the CA matrix of the target cluster collective; 基于簇层加权系数对CA矩阵进行加权处理,得到处理数据B;The CA matrix is weighted based on the cluster layer weight coefficient to obtain the processed data B; 从CA矩阵中捕获高置信度信息,得到HC矩阵;Capture high confidence information from the CA matrix to obtain the HC matrix; 根据处理数据B及HC矩阵确定目标聚类集体的目标CA矩阵。The target CA matrix of the target cluster group is determined based on the processed data B and the HC matrix. 10.根据权利要求9所述的基于成员选择的簇加权聚类集成医学文本处理系统,其特征在于矩阵确定单元构建关于目标聚类集体的CA矩阵,包括:10. The cluster weighted clustering integrated medical text processing system based on member selection according to claim 9, characterized in that the matrix determination unit constructs a CA matrix about the target cluster collective, comprising: 其中,A为CA矩阵;表示第个聚类成员;表示聚类集合中聚类成员的总数;表示样本点所在的簇;表示样本点所在的簇。Where A is the CA matrix; Indicates cluster members; Represents the total number of cluster members in the cluster set; Represents sample points The cluster in which it is located; Represents sample points The cluster where it is located.
CN202410553936.0A 2023-09-08 2024-05-07 Cluster weighted clustering integrated medical text processing method and system based on member selection Active CN118312816B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2023111662103 2023-09-08
CN202311166210.3A CN117195027A (en) 2023-09-08 2023-09-08 Cluster weighted clustering integration method based on member selection

Publications (2)

Publication Number Publication Date
CN118312816A CN118312816A (en) 2024-07-09
CN118312816B true CN118312816B (en) 2025-04-08

Family

ID=89001143

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202311166210.3A Pending CN117195027A (en) 2023-09-08 2023-09-08 Cluster weighted clustering integration method based on member selection
CN202410553936.0A Active CN118312816B (en) 2023-09-08 2024-05-07 Cluster weighted clustering integrated medical text processing method and system based on member selection

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202311166210.3A Pending CN117195027A (en) 2023-09-08 2023-09-08 Cluster weighted clustering integration method based on member selection

Country Status (1)

Country Link
CN (2) CN117195027A (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688412B (en) * 2024-02-02 2024-05-07 中国人民解放军海军青岛特勤疗养中心 An intelligent data processing system for orthopedic care
CN120452661B (en) * 2025-07-10 2025-10-17 四川互慧软件有限公司 Respiratory failure treatment scheme recommendation method and system based on artificial intelligence

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599913A (en) * 2016-12-07 2017-04-26 重庆邮电大学 Cluster-based multi-label imbalance biomedical data classification method
CN117391456A (en) * 2023-11-27 2024-01-12 浙江南斗数智科技有限公司 Village management method and service platform system based on artificial intelligence

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2603601C2 (en) * 2010-12-20 2016-11-27 Конинклейке Филипс Электроникс Н.В. Methods and systems for identifying patients with mild cognitive impairment with risk of transition to alzheimer's disease
US11587678B2 (en) * 2020-03-23 2023-02-21 Clover Health Machine learning models for diagnosis suspecting
US20230111999A1 (en) * 2021-10-08 2023-04-13 Microsoft Technology Licensing, Llc Method and system of creating clusters for feedback data
US11860824B2 (en) * 2022-04-27 2024-01-02 Truist Bank Graphical user interface for display of real-time feedback data changes
CN115910255A (en) * 2022-09-29 2023-04-04 海南星捷安科技集团股份有限公司 A diagnostic aid system
CN116072302A (en) * 2023-02-17 2023-05-05 西安电子科技大学 Medical unbalanced data classification method based on biased random forest model
CN116451100A (en) * 2023-04-13 2023-07-18 盐城工学院 Three-layer weighted clustering integration algorithm based on cluster member selection

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599913A (en) * 2016-12-07 2017-04-26 重庆邮电大学 Cluster-based multi-label imbalance biomedical data classification method
CN117391456A (en) * 2023-11-27 2024-01-12 浙江南斗数智科技有限公司 Village management method and service platform system based on artificial intelligence

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Ensemble Clustering via Co-association Matrix Self-enhancement";Yuheng Jia等;《IEEE Transactions on Neural Networks and Learning Systems》;20230306;第35卷(第8期);摘要、第Ⅱ-Ⅳ节 *
"一种基于成员选择的簇加权聚类集成算法";徐森等;《控制与决策》;20240411;第1-5页 *

Also Published As

Publication number Publication date
CN117195027A (en) 2023-12-08
CN118312816A (en) 2024-07-09

Similar Documents

Publication Publication Date Title
CN118312816B (en) Cluster weighted clustering integrated medical text processing method and system based on member selection
WO2020181805A1 (en) Diabetes prediction method and apparatus, storage medium, and computer device
CN107358014B (en) Clinical pretreatment method and system of physiological data
CN110503155B (en) A method for information classification and related device and server
CN101297297A (en) Medical risk stratification method and system
JP2023532292A (en) Machine learning based medical data checker
CN119008013A (en) Viral pneumonia risk assessment prediction system
CN108206056B (en) An artificial intelligence-assisted diagnosis and treatment decision-making terminal for nasopharyngeal carcinoma
CN120636766A (en) Traditional Chinese Medicine Intelligent Health Diagnosis System and Method Based on Multi-source Data Fusion
CN116259415A (en) A machine learning-based prediction method for patient medication compliance
CN119092144B (en) Health management system and method based on multi-source data acquisition and analysis
CN115910326A (en) Auxiliary diagnosis method and system for bronchial asthma based on interpretable machine learning
CN110033862B (en) Traditional Chinese medicine quantitative diagnosis system based on weighted directed graph and storage medium
CN119560149A (en) A data processing method and system for comprehensive health assessment of the elderly based on digital twins
CN119339964A (en) Brain tumor multidimensional data analysis method and system
Klemm et al. Interactive visual analysis of lumbar back pain-what the lumbar spine tells about your life
Kumari et al. Machine Learning Classification Techniques to investigate Parkinson's disease
EP4609406A1 (en) Cancer progression assessment method and system thereof
Jha et al. Enhanced Predictive Modeling Techniques for Early Detection of COPD Utilizing 1D Convolutional Neural Networks
Wulandari et al. Application of the Random Forest Algorithm for Breast Cancer Analysis with Data Balancing: Case Study at Al-Ihsan Hospital
Raj Enhancing Thyroid Cancer Diagnostics Through Hybrid Machine Learning and Metabolomics Approaches.
Sadhu et al. Exploratory Modeling Strategies for Robust Medical Condition Classification Using Machine Learning
CN119851962B (en) Analysis methods, systems, equipment and media for nuclear medicine imaging radiology reports
CN119833077B (en) CKD full-flow information management system based on active learning
CN119889561B (en) Reference interval construction system for test items based on gender and age

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant