CN109801175A

CN109801175A - A kind of medical insurance fraudulent act detection method and device

Info

Publication number: CN109801175A
Application number: CN201910054527.5A
Authority: CN
Inventors: 王红熳; 张东宁; 杨放春
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-01-21
Filing date: 2019-01-21
Publication date: 2019-05-24

Abstract

Embodiments of the present invention provide a method and device for detecting medical insurance fraud, wherein the method includes: determining a plurality of medical bill data to be analyzed, and the medical bill data to be analyzed includes a plurality of evaluation indicators; Divide to obtain multiple data subsets; for each data subset, determine a subset weight set corresponding to the data subset; and according to the subset weight set corresponding to the data subset, the medical bills to be analyzed included in the data subset are determined The data is clustered to obtain the cluster members corresponding to the data subset; the ensemble weight set corresponding to the ensemble composed of all the medical bill data to be analyzed is determined; all cluster members are fused according to the ensemble weight set; The medical bill data to be analyzed is isolated, and the isolated medical bill data to be analyzed is regarded as suspicious medical bill data. In this way, the manual workload can be reduced.

Description

A kind of medical insurance fraud detection method and device

技术领域technical field

本发明涉及计算机技术领域，特别是涉及一种医保诈骗行为检测方法和装置。The invention relates to the field of computer technology, in particular to a method and device for detecting medical insurance fraud.

背景技术Background technique

众所周知，我国现行的医疗补助力度是非常大的，人民的医保福利水平也在日益提高，与此同时，一些医保体系中的问题也日益显著，其中焦点问题之一就是医保诈骗行为。As we all know, my country's current medical subsidies are very strong, and the level of people's medical insurance benefits is also increasing. At the same time, some problems in the medical insurance system are becoming more and more obvious. One of the key issues is medical insurance fraud.

关于医保诈骗，主要是指公民、法人或者其他组织在参加医疗保险、缴纳医疗保险费、享受医疗保险待遇过程中，故意捏造事实、弄虚作假、隐瞒真实情况等造成医疗保险基金损失的行为。Regarding medical insurance fraud, it mainly refers to the acts of citizens, legal persons or other organizations that intentionally fabricate facts, falsify, conceal the true situation, etc., causing losses to medical insurance funds in the process of participating in medical insurance, paying medical insurance premiums, and enjoying medical insurance benefits.

医保诈骗形成的微观机制则是由于健康和疾病风险的不确定性以及高度专业化的医疗服务，导致消费者和医疗服务的提供者之间信息的严重不对称。这种信息不对称使得医疗服务的供给方缺少内在的成本约束机制和激励机制，产生了诱导性需求，使得医疗费用上涨的趋势必然得不到有效控制。The micro-mechanism of medical insurance fraud is due to the uncertainty of health and disease risks and highly specialized medical services, resulting in serious information asymmetry between consumers and medical service providers. This information asymmetry makes the medical service provider lack the inherent cost restraint mechanism and incentive mechanism, resulting in induced demand, so that the trend of medical expenses cannot be effectively controlled.

针对现有的医疗骗保事件的调研以及实际数据的分析，医保诈骗行为一般较多地表现以下三个方面：In view of the existing research on medical insurance fraud incidents and the analysis of actual data, medical insurance fraud generally manifests in the following three aspects:

1)单张单据的价格过高：药物单据与医嘱不符(得了小病却开了大量昂贵的药物)、大量购买相同药物等行为造成的高价单据。1) The price of a single bill is too high: the drug bill does not match the doctor's order (a large number of expensive drugs are prescribed for a minor illness), high-priced bills caused by behaviors such as a large number of purchases of the same drug.

2)同类型小额帐单出现多次：医患勾结，将一个完整连续的医疗服务项目拆分成多个服务项目分别进行收费、将骗保额度过大的单据拆成多个小药方等导致的短时间内反复多次拿药。2) The same type of small bills appeared many times: collusion between doctors and patients, splitting a complete and continuous medical service item into multiple service items for charging separately, splitting the bills with excessive fraudulent insurance amount into multiple small prescriptions, etc. This resulted in repeated taking of medicines in a short period of time.

3)冒用他人医保卡：使用他人的医保卡来办理自己的医疗保险业务。3) Fraudulent use of others' medical insurance cards: Use others' medical insurance cards to handle your own medical insurance business.

这些年本应作为公民福利项目的医疗保险被滥用，给国家造成的损失巨大，因此通过参保人员在医院就医后缴费的单据，发现其骗保行为，进行及时追责与后期预防，避免医疗保险基金的损失，成为了现阶段极为重要的问题。In recent years, medical insurance, which should be used as a citizen welfare program, has been abused, causing huge losses to the country. Therefore, through the documents that the insured pays after seeing a doctor in the hospital, their fraudulent behaviors are discovered, and timely accountability and later prevention are carried out to avoid medical treatment. The loss of insurance funds has become an extremely important issue at this stage.

针对这些方面已有对医保账单数据处理的多种方式，其中一种方式是通过训练好的模型对医保账单数据进行检测，以得到医保账单数据是否是可疑医疗账单数据，其中，可疑医疗账单数据是可能存在医保诈骗行为的医疗账单数据。现有这种方式中，训练模型的过程中，需要人工对大量的样本医疗账单数据进行标记，具体地，获取多个样本单据，人工对样本单据进行标记，如标记该样本单据是可疑医疗账单数据或者不是可疑医疗账单数据，然后基于标记后的多个样本单据训练用于检测的模型。可以看出，现有这种方式中，标记过程会消耗过多的人力，人工工作量较大。For these aspects, there are many ways to process medical insurance bill data, one of which is to detect the medical insurance bill data through a trained model to obtain whether the medical insurance bill data is suspicious medical bill data, among which, the suspicious medical bill data It is medical billing data that may have medical insurance fraud. In the existing method, in the process of training the model, a large amount of sample medical bill data needs to be manually marked. Specifically, multiple sample documents are obtained, and the sample documents are manually marked. For example, the sample document is marked as a suspicious medical bill. data or not suspicious medical billing data, and then train a model for detection based on multiple sample bills after labeling. It can be seen that, in the existing method, the marking process consumes too much manpower, and the manual workload is large.

发明内容SUMMARY OF THE INVENTION

本发明实施例的目的在于提供一种医保诈骗行为检测方法和装置，以减轻人工工作量。具体技术方案如下：The purpose of the embodiments of the present invention is to provide a method and device for detecting medical insurance fraud, so as to reduce the manual workload. The specific technical solutions are as follows:

第一方面，本发明实施例提供了一种医保诈骗行为检测方法，包括：In a first aspect, an embodiment of the present invention provides a method for detecting medical insurance fraud, including:

确定多个待分析医疗账单数据，所述待分析医疗账单数据包括多个评价指标；determining a plurality of medical bill data to be analyzed, the medical bill data to be analyzed including a plurality of evaluation indicators;

将多个待分析医疗账单数据进行划分得到多个数据子集；Divide multiple medical bill data to be analyzed to obtain multiple data subsets;

针对各个数据子集，确定该数据子集对应的子集权重集；并根据该数据子集对应的子集权重集对该数据子集中包括的待分析医疗账单数据进行聚类，得到该数据子集对应的聚类成员；For each data subset, determine the subset weight set corresponding to the data subset; and cluster the medical bill data to be analyzed included in the data subset according to the subset weight set corresponding to the data subset to obtain the data subset the cluster member corresponding to the set;

确定所有待分析医疗账单数据构成的全集对应的全集权重集；Determine the corpus weight set corresponding to the corpus composed of all the medical bill data to be analyzed;

根据所述全集权重集对所有聚类成员进行融合，其中，所述所有聚类成员由各个数据子集分别对应的聚类成员组成；All cluster members are fused according to the corpus weight set, wherein all the cluster members are composed of cluster members corresponding to each data subset respectively;

确定进行融合后得到的孤立待分析医疗账单数据，并将所述孤立待分析医疗账单数据作为可疑医疗账单数据。Determine the isolated medical bill data to be analyzed obtained after the fusion, and use the isolated medical bill data to be analyzed as suspicious medical bill data.

第二方面，本发明实施例提供了一种医保诈骗行为检测装置，包括：In a second aspect, an embodiment of the present invention provides a medical insurance fraud detection device, including:

第一确定模块，用于确定多个待分析医疗账单数据，所述待分析医疗账单数据包括多个评价指标；a first determining module, configured to determine a plurality of medical bill data to be analyzed, wherein the medical bill data to be analyzed includes a plurality of evaluation indicators;

划分模块，用于将多个待分析医疗账单数据进行划分得到多个数据子集；A division module, which is used to divide a plurality of medical bill data to be analyzed to obtain a plurality of data subsets;

第二确定模块，用于针对各个数据子集，确定该数据子集对应的子集权重集；a second determination module, configured to determine, for each data subset, a subset weight set corresponding to the data subset;

聚类模块，用于根据该数据子集对应的子集权重集对该数据子集中包括的待分析医疗账单数据进行聚类，得到该数据子集对应的聚类成员；a clustering module, configured to cluster the medical bill data to be analyzed included in the data subset according to the subset weight set corresponding to the data subset, and obtain the cluster members corresponding to the data subset;

第三确定模块，用于确定所有待分析医疗账单数据构成的全集对应的全集权重集；The third determining module is used to determine the corpus weight set corresponding to the corpus composed of all the medical bill data to be analyzed;

融合模块，用于根据所述全集权重集对所有聚类成员进行融合，其中，所述所有聚类成员由多个数据子集分别对应的聚类成员组成；a fusion module, configured to fuse all cluster members according to the corpus weight set, wherein all the cluster members are composed of cluster members corresponding to a plurality of data subsets respectively;

第四确定模块，用于确定进行融合后得到的孤立待分析医疗账单数据，并将所述孤立待分析医疗账单数据作为可疑医疗账单数据。The fourth determination module is configured to determine the isolated medical bill data to be analyzed obtained after fusion, and use the isolated medical bill data to be analyzed as suspicious medical bill data.

本发明实施例提供的医保诈骗行为检测方法和装置，在医保诈骗行为检测过程中无需人工进行标注，能够减轻人工工作量。同时，能够提高计算效率。且降低了主观干扰，进一步能够提高计算的准确性。当然，实施本发明的任一产品或方法必不一定需要同时达到以上所述的所有优点。The medical insurance fraud detection method and device provided by the embodiments of the present invention do not require manual marking during the medical insurance fraud detection process, which can reduce the manual workload. At the same time, the computational efficiency can be improved. Moreover, subjective interference is reduced, and the calculation accuracy can be further improved. Of course, it is not necessary for any product or method to implement the present invention to simultaneously achieve all of the advantages described above.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1为本发明实施例提供的医保诈骗行为检测方法的流程示意图；1 is a schematic flowchart of a method for detecting a medical insurance fraudulent behavior provided by an embodiment of the present invention;

图2为本发明实施例中融合过程的流程示意图；2 is a schematic flowchart of a fusion process in an embodiment of the present invention;

图3(a)为本发明实施例中子系统示意图；Figure 3 (a) is a schematic diagram of a subsystem in an embodiment of the present invention;

图3(b)为本发明实施例中医疗数据采集和整理模块示意图；Figure 3(b) is a schematic diagram of a medical data collection and arrangement module in an embodiment of the present invention;

图3(c)为本发明实施例中医疗数据预处理模块示意图；3(c) is a schematic diagram of a medical data preprocessing module in an embodiment of the present invention;

图3(d)为本发明实施例中医保诈骗评价指标权重制定模块示意图；Figure 3 (d) is a schematic diagram of a module for formulating the weights of evaluation indicators for TCM insurance fraud according to an embodiment of the present invention;

图3(e)为本发明实施例中聚类成员生成模块示意图；Fig. 3 (e) is the schematic diagram of the cluster member generation module in the embodiment of the present invention;

图3(f)为本发明实施例中聚类融合模块示意图；3(f) is a schematic diagram of a clustering and fusion module in an embodiment of the present invention;

图3(g)为本发明实施例中医保诈骗结果输出模块示意图；Fig. 3 (g) is the schematic diagram of the output module of TCM insurance fraud result according to the embodiment of the present invention;

图3(h)为本发明实施例中聚类成员存储模块示意图；Figure 3 (h) is a schematic diagram of a cluster member storage module in an embodiment of the present invention;

图4(a)为本发明实施例提供的具体实施例的流程示意图；4(a) is a schematic flowchart of a specific embodiment provided by an embodiment of the present invention;

图4(b)为本发明具体实施例中数据流向示意图；Figure 4(b) is a schematic diagram of data flow in a specific embodiment of the present invention;

图5为本发明实施例提供的医保诈骗行为检测装置的结构示意图；5 is a schematic structural diagram of a medical insurance fraud detection device provided by an embodiment of the present invention;

图6为本发明实施例提供的医保诈骗行为检测设备的结构示意图。FIG. 6 is a schematic structural diagram of a medical insurance fraud detection device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明实施例提供的医保诈骗行为检测方法可以应用于电子设备。具体地，电子设备可以包括终端、服务器、处理器等等。The method for detecting medical insurance fraud provided by the embodiment of the present invention can be applied to an electronic device. Specifically, the electronic device may include a terminal, a server, a processor, and the like.

本发明实施例提供了一种医保诈骗行为检测方法，如图1所示，可以包括：An embodiment of the present invention provides a method for detecting medical insurance fraud, as shown in FIG. 1 , which may include:

S101，确定多个待分析医疗账单数据。S101. Determine a plurality of medical bill data to be analyzed.

待分析医疗账单数据包括多个评价指标。评价指标可以包括帐单总价、药物类型、拿药次数、医嘱类型、开具医嘱医生编号、医保手册号、执行科室、执行医生编号、病人死亡标志和/或病人身份证号等等。The medical bill data to be analyzed includes multiple evaluation indicators. Evaluation indicators may include total billing price, type of medicine, number of times of taking medicine, type of doctor's order, number of the doctor who issued the doctor's order, number of medical insurance booklet, executive department, number of executive doctor, patient's death sign and/or patient's ID number, and so on.

具体地，确定多个待分析医疗账单数据，可以包括：Specifically, determining a plurality of medical bill data to be analyzed may include:

获取多个原始医疗账单数据；针对各个原始医疗账单数据，对该原始医疗账单数据进行预处理，得到该原始医疗账单数据对应的待分析医疗账单数据。A plurality of original medical bill data are obtained; for each original medical bill data, the original medical bill data is preprocessed to obtain medical bill data to be analyzed corresponding to the original medical bill data.

原始医疗账单数据可以为从医院数据库中获取的医疗账单数据。一种可实现方式中，在获取原始医疗账单数据后，还可以包括数据整理的过程，如将从医院数据库中获取的原始医疗账单数据整理成预定格式的数据，以便于后续方便处理。The raw medical billing data may be medical billing data obtained from a hospital database. In an implementation manner, after obtaining the original medical bill data, a data sorting process may also be included, such as sorting the original medical bill data obtained from the hospital database into data in a predetermined format, so as to facilitate subsequent processing.

预处理可以包括医疗数据清洗、医疗数据集成、医疗数据降维、医疗数据标准化等一系列操作。Preprocessing can include a series of operations such as medical data cleaning, medical data integration, medical data dimensionality reduction, and medical data standardization.

S102，将多个待分析医疗账单数据进行划分得到多个数据子集。S102: Divide a plurality of medical bill data to be analyzed to obtain a plurality of data subsets.

将多个待分析医疗账单数据划分至多个数据子集。Divide a plurality of medical billing data to be analyzed into a plurality of data subsets.

一种可实现方式中，将多个待分析医疗账单数据通过随机抽样的方法产生同等量级的数据子集。如共获取得到150个待分析医疗账单数据，将其中每50个划分至一个数据子集，则可以得到3个数据子集。In an implementation manner, a plurality of medical bill data to be analyzed are randomly sampled to generate data subsets of the same magnitude. If a total of 150 medical bill data to be analyzed are obtained, and each 50 of them is divided into a data subset, 3 data subsets can be obtained.

S103，针对各个数据子集，确定该数据子集对应的子集权重集；并根据该数据子集对应的子集权重集对该数据子集中包括的待分析医疗账单数据进行聚类，得到该数据子集对应的聚类成员。S103, for each data subset, determine a subset weight set corresponding to the data subset; and cluster the medical bill data to be analyzed included in the data subset according to the subset weight set corresponding to the data subset to obtain the The cluster member corresponding to the subset of data.

确定子集权重集，即为数据子集中待分析医疗账单数据的各个评价指标确定权重值。本发明实施例中一种可实现方式中随机确定各个评价指标对应的权重值；或者也可以获取人工确定的各个评价指标对应的权重值。另一种可实现方式中，可以基于数据子集中包括的待分析医疗账单数据自适应地确定该数据子集对应的子集权重集。Determining the subset weight set is to determine the weight value for each evaluation index of the medical bill data to be analyzed in the data subset. In an implementation manner in the embodiment of the present invention, the weight value corresponding to each evaluation index is randomly determined; or the manually determined weight value corresponding to each evaluation index can be obtained. In another implementation manner, the subset weight set corresponding to the data subset may be determined adaptively based on the medical bill data to be analyzed included in the data subset.

针对数据子集进行聚类的过程，可以理解为根据数据子集中包括的各个待分析医疗账单数据之间的相似度进行聚类，如可以理解为将相似度比较大的，如大于相似度阈值的多个待分析医疗账单数据聚类得到一个聚类成员。The process of clustering data subsets can be understood as clustering according to the similarity between the medical bill data to be analyzed included in the data subsets. For example, it can be understood that the similarity is relatively large, such as greater than the similarity threshold. A cluster member is obtained by clustering multiple medical bill data to be analyzed.

S104，确定所有待分析医疗账单数据构成的全集对应的全集权重集。S104: Determine a corpus weight set corresponding to a corpus composed of all medical bill data to be analyzed.

确定全集权重集的过程类似于确定数据子集对应的子集权重集的过程，具体地确定过程可以参照确定数据子集对应的子集权重集的过程。不同在于，确定全集权重集的过程是基于所有待分析医疗账单数据构成的全集的。本发明一种可选的实施例中，可以基于所有待分析医疗账单数据构成的全集自适应地确定该全集对应的全集权重集。The process of determining the corpus weight set is similar to the process of determining the subset weight set corresponding to the data subset, and the specific determination process may refer to the process of determining the subset weight set corresponding to the data subset. The difference is that the process of determining the corpus weight set is based on the corpus composed of all the medical billing data to be analyzed. In an optional embodiment of the present invention, a corpus weight set corresponding to the corpus may be determined adaptively based on a corpus composed of all medical bill data to be analyzed.

S105，根据全集权重集对所有聚类成员进行融合。S105, fuse all cluster members according to the corpus weight set.

其中，所有聚类成员由各个数据子集分别对应的聚类成员组成。Among them, all cluster members are composed of cluster members corresponding to each data subset respectively.

为了改进采用单一聚类算法的弊端，引入了融合的过程。具体地，可以计算聚类成员之间的相似度，根据聚类成员之间的相似度对相似度大于阈值的进行融合。其中，计算聚类成员之间的相似度，可以通过计算聚类成员的聚类中心的相似度，或者聚类成员的边缘数据的相似度。In order to improve the disadvantages of using a single clustering algorithm, a fusion process is introduced. Specifically, the similarity between the cluster members may be calculated, and the similarity greater than the threshold is fused according to the similarity between the cluster members. Wherein, the similarity between cluster members can be calculated by calculating the similarity of the cluster centers of the cluster members, or the similarity of the edge data of the cluster members.

S106，确定进行融合后得到的孤立待分析医疗账单数据，并将孤立待分析医疗账单数据作为可疑医疗账单数据。S106: Determine the isolated medical bill data to be analyzed obtained after the fusion, and use the isolated medical bill data to be analyzed as suspicious medical bill data.

由于医保诈骗情况的比例远小于正常使用医保的情况，因此以聚类、融合后的得到的结果中出现的孤立点作为存在医保诈骗嫌疑的条例，即将融合后得到的孤立待分析医疗账单数据作为可疑医疗账单数据。Since the proportion of medical insurance fraud is much smaller than the normal use of medical insurance, the isolated points in the results obtained after clustering and fusion are used as regulations for suspected medical insurance fraud, and the isolated medical bill data to be analyzed after fusion is used as the Suspicious medical billing data.

在确定进行融合后得到的孤立待分析医疗账单数据，并将孤立待分析医疗账单数据作为可疑医疗账单数据之后，还可以包括：After determining the isolated medical bill data to be analyzed obtained after the fusion, and using the isolated medical bill data to be analyzed as suspicious medical bill data, it may further include:

展示可疑医疗账单数据。Display suspicious medical billing data.

另外，可以对可疑医疗账单数据进行人工复查，对于实际存在医疗保险诈骗行为的涉案人员等进行追责，实现全过程、全环节智能的医保诈骗行为的检测。In addition, the suspicious medical bill data can be manually reviewed, and the persons involved in the actual medical insurance fraud can be held accountable, so as to realize the whole-process and all-link intelligent detection of medical insurance fraud.

本发明实施例中，医保诈骗行为检测过程中无需人工进行标注，能够减轻人工工作量。同时，能够提高计算效率。且降低了主观干扰，进一步能够提高计算的准确性。In the embodiment of the present invention, manual marking is not required in the process of detecting medical insurance fraud, which can reduce manual workload. At the same time, the computational efficiency can be improved. Moreover, subjective interference is reduced, and the calculation accuracy can be further improved.

另外，本发明实施例中针对医疗保险诈骗场景将大数据集分成多个小数据子集进行多线程并行计算，加快计算过程。In addition, in the embodiment of the present invention, for the medical insurance fraud scenario, the large data set is divided into multiple small data subsets to perform multi-threaded parallel computing, so as to speed up the computing process.

本发明一种可选的实施例中，S103：确定该数据子集对应的子集权重集，可以包括：In an optional embodiment of the present invention, S103: Determine the subset weight set corresponding to the data subset, which may include:

根据该数据子集中包括的各个待分析医疗账单数据，构建第一评价指标权重函数；根据第一评价指标权重函数，通过粒子群优化算法，确定该数据子集对应的子集权重集。According to each medical bill data to be analyzed included in the data subset, a first evaluation index weight function is constructed; according to the first evaluation index weight function, a particle swarm optimization algorithm is used to determine a subset weight set corresponding to the data subset.

其中，子集权重集中包括各个评价指标分别对应的第一权重值。The subset weight set includes first weight values corresponding to each evaluation index respectively.

具体地，将数据子集设为X_a＝{X_a1，X_a2，...，X_an}，(1≤i≤n)，其中X_a1，X_a2，...，X_an为该数据子集中包括的n个待分析医疗账单数据，X_ai＝(x_ai1，x_ai2，...，x_aim)，x_aik(k∈[1，m])是X_ai的第k个评价指标。引入评价指标的权重ω_a＝(ω_a1，ω_a2，...，ω_am)，其中ω_ak(k∈[1，m])表示第a个数据子集中第k维评价指标的权重。得到对应的欧式距离计算公式表示待分析医疗账单数据X_ai和待分析医疗账单数据X_aj在数据子集上的相似程度d_aij。Specifically, the data subset is set as X _a ={X _a1 , X _a2 ,...,X _an }, (1≤i≤n), where X _a1 , X _a2 ,..., X _an are the The n medical bill data to be analyzed included in the data subset, X _ai = (x _ai1 , x _ai2 , ..., x _aim ), x aik ( _k∈ [1, m]) is the kth evaluation of X _ai index. The weight ω _a = (ω _a1 , ω _a2 , . . . , ω _am ) of the evaluation index is introduced, where ω _ak (k∈[1, m]) represents the weight of the k-th dimension evaluation index in the a-th data subset. The corresponding Euclidean distance calculation formula is obtained to represent the similarity degree _daij between the medical bill data to be analyzed X _ai and the medical bill data X _aj to be analyzed on the data subset.

然后，引入代表相似程度的函数S_aij，定义其相似性关系函数为：Then, a function S _aij representing the degree of similarity is introduced, and the similarity function is defined as:

其中，γ值由下式确定：where the value of γ is determined by:

其中，当S_aij趋近于1时表示相似程度越大，两点之间的距离越小；当S_aij趋近于0时表示数据的相似度越低，两点之间的距离越大；当S_aij在0.5附近时说明模糊性较大。Among them, when _Saij approaches 1, it means that the degree of similarity is greater, and the distance between two points is smaller; when _Saij is close to 0, it means that the similarity of data is lower, and the distance between two points is larger; When _Saij is around 0.5, it means that the ambiguity is large.

为让聚类结果具有模糊性相对较小的性质，通过调整属性权重，使相似数据间的距离减小，不相似数据间的距离增大，即找到一个属性评价函数，综合评价各个点之间的相似程度，使总体达到模糊性最小。对相似关系矩阵进行优化，使相似性较大的数据的相似性关系函数S_aij趋近于1，同理使相似性较小的数据的相似性关系函数趋近于0。为此引入第一评价指标权重函数，定义如下：In order to make the clustering result have the property of relatively less ambiguity, by adjusting the attribute weight, the distance between similar data is reduced, and the distance between dissimilar data is increased, that is, an attribute evaluation function is found to comprehensively evaluate the distance between each point. similarities, so that the overall ambiguity is minimized. The similarity relationship matrix is optimized so that the similarity relationship function S _aij of data with greater similarity tends to be close to 1, and similarly, the similarity relationship function of data with small similarity is made to approach 0. For this purpose, the first evaluation index weight function is introduced, which is defined as follows:

其中，当Weight_a(ω)趋近于0时表示模糊性最小。Among them, when Weight _a (ω) approaches 0, it means that the ambiguity is the smallest.

为了最小化Weight_a(ω)函数，采用粒子群优化算法，针对数据量较小的数据子集产生的Weight_a(ω)并行使用收敛速度较慢但特征明显的粒子群优化算法，流程如下：In order to minimize the Weight _a (ω) function, the particle swarm optimization algorithm is used, and the particle swarm optimization algorithm with slower convergence speed but obvious characteristics is used in parallel for the Weight _a (ω) generated by the data subset with a small amount of data. The process is as follows:

(1)初始化阶段：首先，定义目标空间的维度为医保诈骗评价指标权重的维度k，粒子的初始位置和初始速度为默认值，最后根据数据子集的数量设置粒子群的大小和作为终止条件的最大迭代次数。(1) Initialization stage: First, the dimension of the target space is defined as the dimension k of the weight of the medical insurance fraud evaluation index, the initial position and initial velocity of the particles are the default values, and finally the size of the particle swarm is set according to the number of data subsets. the maximum number of iterations.

(2)计算个体值和全局最优解：个体极值为每个粒子找到的最优解，从这些最优解找到一个全局值，叫做本次全局最优解。与历史全局最优比较，进行更新。(2) Calculate the individual value and the global optimal solution: the individual extreme value is the optimal solution found by each particle, and a global value is found from these optimal solutions, which is called this global optimal solution. Compare with the historical global optimum and update it.

(3)更新粒子速度和位置。(3) Update particle velocity and position.

(4)判断是否达到终止条件：当达到最大迭代数目时停止迭代，输出结果。(4) Judging whether the termination condition is reached: when the maximum number of iterations is reached, the iteration is stopped, and the result is output.

通过上述的粒子群优化算法流程可以得到Weight_a(ω)函数的最小值，进而获得最优分配的各个评价指标的权重ω_a。Through the above-mentioned particle swarm optimization algorithm flow, the minimum value of the Weight _a (ω) function can be obtained, and then the weight ω _a of each evaluation index to be optimally assigned can be obtained.

本发明一种可选的实施例中，S104：确定所有待分析医疗账单数据构成的全集对应的全集权重集，可以包括：In an optional embodiment of the present invention, S104: Determine a corpus weight set corresponding to a corpus composed of all medical bill data to be analyzed, which may include:

根据所有待分析医疗账单数据，构建第二评价指标权重函数；根据第二评价指标权重函数，通过差分进化算法，确定全集对应的全集权重集。According to all the medical bill data to be analyzed, a second evaluation index weight function is constructed; according to the second evaluation index weight function, a corpus weight set corresponding to the corpus is determined through a differential evolution algorithm.

其中，所述全集权重集包括各个评价指标分别对应的第二权重值。Wherein, the ensemble weight set includes second weight values corresponding to each evaluation index respectively.

将各个数据子集合并后得到全集为X＝{X₁，X₂，...，X_n}，(1≤i≤n)，X_i＝(x_i1，x_i2，...，x_im)，x_ik是X_i的第k个评价指标，引入评价指标权重Ω＝(Ω₁，Ω₂，...，Ω_m)，其中Ω_k(k∈[1，m])表示医保帐单数据全集中第k维评价指标的权重，最终得到第二评价指标权重函数为：After merging each data subset, the complete set is X={X ₁ , X ₂ ,..., X _n }, (1≤i≤n), X _i =(x _i1 , x _i2 ,...,x _im ), x _ik is the k-th evaluation index of Xi _, and the evaluation index weight Ω=(Ω ₁ , Ω ₂ , ..., Ω _m ) is introduced, where Ω _k (k∈[1, m]) represents medical insurance The weight of the k-th dimension evaluation index in the complete set of billing data, and finally the weight function of the second evaluation index is obtained as:

其中， in,

针对数据量较大的数据全集产生的Weight_a(ω)使用收敛速度快，更易达到全局最优解的差分进化算法，流程如下：For the Weight _a (ω) generated by the data set with a large amount of data, the differential evolution algorithm with fast convergence speed and easier to achieve the global optimal solution is used. The process is as follows:

(1)设置基本参数包括种群规模为100、缩放因子为0.5和交叉概率为0.8。(1) Set the basic parameters including population size as 100, scaling factor as 0.5 and crossover probability as 0.8.

(2)初始化种群，设置维度为医保诈骗评价指标权重的维度k，初始化代数为1。(2) Initialize the population, set the dimension as the dimension k of the weight of the medical insurance fraud evaluation index, and the initialization algebra is 1.

(3)计算种群适应度值。(3) Calculate the population fitness value.

(4)终止条件不满足时，进行循环，依次执行变异、交叉、选择运算，直到终止运算。(4) When the termination condition is not satisfied, a loop is performed, and mutation, crossover, and selection operations are performed in sequence until the operation is terminated.

最终得到基于数据全集的评价指标的权重Ω。Finally, the weight Ω of the evaluation index based on the data set is obtained.

本发明实施例中，对于数据量较小的数据子集和数据量较大的数据全集分别采用粒子群优化和差分进化算法这两个具有针对性的算法，使数据量较小的数据能突出数据特征，数据量较大的数据能快速收敛，得到的评价指标权重集分别应用于聚类成员产生和聚类融合，提高最终的聚类融合性能。In the embodiment of the present invention, two targeted algorithms, particle swarm optimization and differential evolution algorithm, are respectively used for a data subset with a small amount of data and a complete data set with a large amount of data, so that the data with a small amount of data can be highlighted. Data characteristics, data with a large amount of data can quickly converge, and the obtained evaluation index weight sets are respectively applied to cluster member generation and cluster fusion to improve the final cluster fusion performance.

本发明一种可选的实施例中，S103：根据该数据子集对应的子集权重集对该数据子集中包括的待分析医疗账单数据进行聚类，得到该数据子集对应的聚类成员，可以包括：In an optional embodiment of the present invention, S103: Cluster the medical bill data to be analyzed included in the data subset according to the subset weight set corresponding to the data subset, and obtain the cluster members corresponding to the data subset , which can include:

A1，基于各个评价指标，分别确定该数据子集中包括的各个待分析医疗账单数据两两之间的子相似度。A1, based on each evaluation index, respectively determine the sub-similarity between each pair of medical bill data to be analyzed included in the data subset.

A2，分别根据子集权重集中包括的各个评价指标分别对应的第一权重值，对基于各个评价指标确定的各个评价指标分别对应的子相似度进行加权，得到各个待分析医疗账单数据两两之间的总相似度。A2, according to the first weight value corresponding to each evaluation index included in the subset weight set, weight the sub-similarities corresponding to each evaluation index determined based on each evaluation index respectively, and obtain each pair of medical bill data to be analyzed. the total similarity between them.

A3，根据总相似度，对该数据子集中包括的各个待分析医疗账单数据进行聚类，得到该数据子集对应的聚类成员。A3, according to the total similarity, perform clustering on each medical bill data to be analyzed included in the data subset to obtain cluster members corresponding to the data subset.

具体地，分别针对各个数据子集主要并行采用加权的Canopy粗聚类后再根据得到的聚类中心点和聚类簇个数进行加权凝聚层次算法。Specifically, the weighted Canopy rough clustering is mainly used in parallel for each data subset, and then the weighted agglomeration hierarchical algorithm is performed according to the obtained cluster center points and the number of clusters.

使用Canopy粗聚类，定义两个阈值T₁和T₂，并令T₁＞T₂；Use Canopy coarse clustering, define two thresholds T ₁ and T ₂ , and let T ₁ >T ₂ ;

(1)随机从待分析医疗账单数据中选择一个点x，通过如下加权欧式距离计算在数据子集a中该点到其他待分析医疗账单数据的距离d_aij。(1) Randomly select a point x from the medical bill data to be analyzed, and calculate the distance _daij from the point in the data subset a to other medical bill data to be analyzed by the following weighted Euclidean distance.

(2)若判断得到d_aij＜T₁，则说明是弱关联，将这些点归为一类；(2) If it is judged that _daij <T ₁ , it means that it is a weak correlation, and these points are classified into one category;

(3)再继续判断，若d_aij＜T₂，则说明是强关联，将这些点从数据子集中移除不需要再进行计算。(3) Continue to judge, if _daij <T ₂ , it means that there is a strong correlation, and it is not necessary to perform calculation again to remove these points from the data subset.

(4)重复上面三个步骤直到数据子集为空，此时可以快速得到类别数量W_a和每个类的聚类中心C_aw。(4) Repeat the above three steps until the data subset is empty. At this time, the number of categories W _a and the cluster center C _aw of each category can be quickly obtained.

然后在a个并行计算单元上进行加权的凝聚层次聚类算法，对数据进行深层次聚类。Then a weighted agglomerative hierarchical clustering algorithm is performed on a parallel computing unit to perform deep clustering on the data.

(1)将样本空间中的每个数据看成一个类簇，这时设定共有n类，取类p和类q之间的平均加权距离为两个类间距离。(1) Consider each data in the sample space as a class cluster, and set a total of n classes at this time, and take the average weighted distance between class p and class q as the distance between the two classes.

n_ap，n_aq分别表示数据子集a中类p和类q中包含的数据量。n _ap , n _aq represent the amount of data contained in class p and class q in data subset a, respectively.

(2)在每次迭代中，将两个类合并成一个类。选出的两个类为平均连接最小的类。即根据我们选择的距离度量，这两个类之间的d(p，q)距离最小，因此是最相似的，将其被合并起来。(2) In each iteration, merge the two classes into one class. The two selected classes are the ones with the smallest average connection. That is, according to our chosen distance metric, the two classes have the smallest d(p,q) distance between them, and thus are the most similar, and are merged together.

(3)不断重复上面的步骤，最终得到Canopy粗聚类类别数量W_a后停止聚类，a个并行单元最终得到a个聚类成员M_a。(3) Repeat the above steps continuously, and finally stop the clustering after obtaining the number of Canopy coarse clustering categories W _a , a parallel unit finally obtain a cluster member M _a .

(4)将深层次聚类后得到的a个聚类成员M_a中W_a个类簇和聚类中心C_aw一一对应整合，将聚类中心C_aw作为其所在类簇W_a的中心，若出现多个中心C_aw在同一类簇或者不在任何类簇中这些无法对应的情况时，将对应的类簇数据和聚类中心返回给粗聚类单元重新进行粗聚类确定聚类中心c_aw和类簇数w_a后传回当前凝聚层次聚类单元执行深层次聚类，将聚类结果得到的w_a个类簇与聚类中心c_aw再次对应，不断重复此步骤，直到完成类簇与聚类中心一一对应。最后将结果合并，得到a个聚类成员M_a中以C_aw为聚类中心的W_a个聚类簇。(4) Integrate the W _a clusters in the a cluster members Ma obtained after deep clustering and the cluster center C _aw in a one-to _- one correspondence, and take the cluster center C _aw as the center of the cluster W _a where it belongs. , if there are multiple centers C _aw in the same cluster or not in any cluster, which cannot correspond, return the corresponding cluster data and cluster center to the coarse clustering unit to perform coarse clustering again to determine the cluster center c _aw and the number of clusters w _a are returned to the current agglomerative hierarchical clustering unit to perform deep-level clustering, and the w _a clusters obtained from the clustering results correspond to the clustering center c _aw again, and this step is repeated continuously until it is completed. Clusters correspond one-to-one with cluster centers. Finally, the results are combined to obtain W _a clusters with C _aw as the cluster center in a cluster member M _a .

(4)聚类簇中的各个点为X_awi，表示第a个聚类成员中的第k个聚类簇中的第i个数据，X_awi＝(x_awi1，x_awi2，...，x_awim)，x_awik(k∈[1，m])是X_awi的第k个评价指标，计算每个到簇中聚类中心C_aw的距离d_awi，表示第a个聚类成员中的第k个聚类簇中的第i个数据距离其聚类簇中聚类中心的距离。(4) Each point in the cluster is X _awi , which represents the i-th data in the k-th cluster in the a-th cluster member, X _awi =(x _awi1 , x _awi2 ,..., x _awim ), x _awik (k∈[1, m]) is the k-th evaluation index of X _awi , and calculates the distance d _awi from each cluster center C _aw to the cluster center C aw , which represents the a-th cluster member in the The distance of the i-th data in the k-th cluster from the cluster center in its cluster.

通过比较同一类簇边缘距离的相近程度，对于边缘距离较远且数量极少的点规定为b个孤立点Slolitary_ab。每个聚类成员中的每个类可以得到一个最近点距离d_awmin和最远边缘距离d_awmax，为后续的融合策略做准备。By comparing the similarity of the edge distances of the same cluster, the points with farther edge distances and a small number of points are defined as b isolated points Slolitary _ab . Each class in each cluster member can get a nearest point distance d _awmin and farthest edge distance d _awmax to prepare for the subsequent fusion strategy.

本发明实施例中，针对医疗保险诈骗场景采用Canopy粗聚类和凝聚层次聚类算法进行聚类成员的聚类训练，可以快速得到聚类簇中心和聚类簇数目，不需要预先设定聚类簇个数，步步迭代产生可见的最优效果。且针对数据子集产生的聚类成员具有层次结构并可以恢复其层次结构，使聚类中的每次迭代过程都可以还原，便于后期分析。In the embodiment of the present invention, the Canopy rough clustering and agglomerative hierarchical clustering algorithms are used for the clustering training of the cluster members for the medical insurance fraud scenario, so that the cluster center and the number of clusters can be quickly obtained, and there is no need to pre-set the clustering. The number of clusters, step-by-step iteration produces the best visible effect. Moreover, the cluster members generated for the data subset have a hierarchical structure and can restore its hierarchical structure, so that each iteration process in the clustering can be restored, which is convenient for later analysis.

本发明一种可选的实施例中，在S103：针对各个数据子集，确定该数据子集对应的子集权重集；并根据该数据子集对应的子集权重集对该数据子集中包括的待分析医疗账单数据进行聚类，得到该数据子集对应的聚类成员之后，方法还包括：In an optional embodiment of the present invention, in S103: for each data subset, determine a subset weight set corresponding to the data subset; and according to the subset weight set corresponding to the data subset, the data subset includes After clustering the to-be-analyzed medical bill data to obtain the cluster members corresponding to the data subset, the method further includes:

保存各个数据子集分别对应的聚类成员。Save the cluster members corresponding to each data subset.

针对新产生的需要进行医保诈骗信息发现的小数据集的加入可以使用原有的同一结构和类型训练集训练聚类成员进行共同融合，省去了重新训练所有数据集的过程，有较好的可扩展性。For the newly generated small data sets that need to be discovered for medical insurance fraud information, the original training set of the same structure and type can be used to train cluster members for joint fusion, which saves the process of retraining all data sets, and has better performance. Extensibility.

本发明一种可选的实施例中，S105，根据全集权重集对所有聚类成员进行融合，可以包括：In an optional embodiment of the present invention, in S105, all cluster members are fused according to the corpus weight set, which may include:

B1，确定融合策略。B1, determine the fusion strategy.

确定融合策略的过程也可以理解为确定共识函数的过程。本发明实施例中不对确定融合策略的方式做限定，任何可以实现对各个聚类成员进行融合的方式均在本发明实施例的保护范围内。The process of determining the fusion strategy can also be understood as the process of determining the consensus function. The embodiment of the present invention does not limit the manner of determining the fusion strategy, and any manner that can realize the fusion of each cluster member falls within the protection scope of the embodiment of the present invention.

B2，基于各个评价指标，分别确定各个聚类中心两两之间的相似度，其中，各个聚类中心是所有聚类成员中的各个聚类成员分别对应的。B2, based on each evaluation index, determine the similarity between each cluster center pairwise, wherein each cluster center corresponds to each cluster member in all cluster members respectively.

具体地，在确定聚类成员的过程中，针对所有聚类成员中的各个聚类成员，确定该聚类成员的聚类中心。Specifically, in the process of determining the cluster members, for each cluster member among all the cluster members, the cluster center of the cluster member is determined.

B3，根据全集权重集包括的各个评价指标分别对应的第二权重值，以及各个聚类中心两两之间的第二相似度，通过融合策略，对所有聚类成员进行融合。B3, according to the second weight value corresponding to each evaluation index included in the ensemble weight set, and the second similarity between each cluster center pairwise, all cluster members are fused through a fusion strategy.

为了改进采用单一聚类算法的弊端，引入了聚类融合算法，如图2所示，具体地，确定共识函数策略如下：In order to improve the disadvantages of using a single clustering algorithm, a clustering fusion algorithm is introduced, as shown in Figure 2. Specifically, the consensus function strategy is determined as follows:

(1)对于聚类成员生成模块传来的a个聚类成员M_a和m个基于医保帐单数据全集的医保诈骗评价指标权重Ω＝(Ω₁，Ω₂，...，Ω_m)，其中Ω_k(k∈[1，m])代表第k个医保诈骗评价指标权重，分别进行聚类中心C_eg和C_fh之间距离d_egfh的计算(这里a取e和f表示两个聚类成员，w取g和h表示两个聚类成员中的聚类簇)。(1) For a cluster member M _a and m medical insurance fraud evaluation index weights Ω=(Ω ₁ , Ω ₂ , ..., Ω _m ) based on the complete set of medical insurance billing data from the cluster member generation module , where Ω _k (k∈[1, m]) represents the weight of the k-th medical insurance fraud evaluation index, and calculates the distance d _egfh between the cluster centers C _eg and C _fh respectively (here a takes e and f to represent two Cluster member, w takes g and h represents the cluster cluster in the two cluster members).

若两个聚类中心之间的距离d_egfh小于两个聚类中心的最近点距离d_ewmin和d_fwmin中的一个则将两个聚类簇归为一类簇，定义其中一个聚类中心作为融合后的聚类中心，并按新的聚类中心更新最近点距离和最远边缘距离。若不满足条件则不进行融合，重复进行直到完成所有的聚类簇合并，得到新的多个聚类中心Cnew_w、最近点距离dnew_wmin和最远边缘距离dnew_wmin。If the distance d _egfh between the two cluster centers is less than one of the closest point distances d _ewmin and d _fwmin between the two cluster centers, the two clusters are classified as a class of clusters, and one of the cluster centers is defined as The fused cluster centers, and the closest point distance and the farthest edge distance are updated according to the new cluster centers. If the conditions are not met, no fusion is performed, and the process is repeated until all clusters are merged to obtain new multiple cluster centers Cnew _w , distances from the nearest point dnew _wmin and distances from the farthest edge dnew _wmin .

(2)对于聚类成员中出现的孤立点Slolitary_ab分别对各个新的聚类中心Cnew_w计算距离d_awb，表示之前聚类成员a中的孤立点b到新聚类簇w聚类中心的距离。(2) For the isolated points Slolitary _ab that appear in the cluster members, calculate the distance d _awb for each new cluster center Cnew _w respectively, indicating the distance between the isolated point b in the previous cluster member a and the cluster center of the new cluster cluster w distance.

对于距离小于最远边缘距离dnew_wmin的孤立点将其归为该类簇。若不满足条件则仍作为孤立点，得到最终的孤立点Snew_b，孤立点也可以理解为孤立医疗账单数据。如此，可以得到最终的融合结果。For outliers whose distance is less than the distance dnew _wmin of the farthest edge, it is classified into this cluster. If the condition is not met, it is still regarded as an isolated point, and the final isolated point Snew _b is obtained. The isolated point can also be understood as isolated medical bill data. In this way, the final fusion result can be obtained.

本发明实施例中，在医保诈骗行为检测的过程中引入了加权聚类融合算法，相较于传统的仅考虑聚类算法，可以针对孤立点和聚类簇边缘数据进行再次分析，在数据集中的平稳性上更为优秀。In the embodiment of the present invention, a weighted clustering fusion algorithm is introduced in the process of medical insurance fraud detection. Compared with the traditional clustering algorithm that only considers the clustering algorithm, the isolated point and cluster edge data can be re-analyzed. The stability is better.

本发明实施例中具体地可以通过不同的子系统实现上述医保诈骗行为检测的过程，其中，不同的子系统中可以包括模块。In the embodiment of the present invention, the above-mentioned process of detecting medical insurance fraud may be implemented through different subsystems, wherein different subsystems may include modules.

具体地，如图3(a)所示，可以包括数据采集子系统，数据分析子系统、结果展示终端以及聚类成员存储区。Specifically, as shown in FIG. 3( a ), it may include a data acquisition subsystem, a data analysis subsystem, a result display terminal, and a cluster member storage area.

数据采集子系统主要包含一个模块：医疗数据采集和整理模块，负责医保单据的采集和整理成特定的数据格式便于后续数据的处理和使用。The data collection subsystem mainly includes a module: the medical data collection and arrangement module, which is responsible for the collection and arrangement of medical insurance documents into a specific data format for subsequent data processing and use.

数据分析子系统主要包括四个模块：医疗数据预处理模块主要进行医疗数据预处理。医保诈骗指标权重制定模块对预处理之后的各项评价指标进行自适应权重划分，针对不同的评价指标完成权重的分配工作。聚类成员的产生模块负责融合算法中产生多个聚类成员的功能。聚类融合模块针对多个聚类成员完成聚类融合算法中的融合操作，得到最终的数据聚类融合结果。The data analysis subsystem mainly includes four modules: the medical data preprocessing module mainly performs medical data preprocessing. The medical insurance fraud index weight formulation module performs adaptive weight division for each evaluation index after preprocessing, and completes the weight distribution for different evaluation indexes. The generating module of cluster members is responsible for the function of generating multiple cluster members in the fusion algorithm. The cluster fusion module completes the fusion operation in the cluster fusion algorithm for multiple cluster members, and obtains the final data cluster fusion result.

结果展示终端主要包含一个模块：医保诈骗结果输出模块，筛查聚类融合结果中的孤立点，并在终端展示数据聚类结果中的孤立点信息及相关的医保单据条例，作为人工复查后进行涉案人员追责的依据。The result display terminal mainly includes one module: the medical insurance fraud result output module, which screens the outliers in the clustering fusion results, and displays the outlier information in the data clustering results and the relevant medical insurance document regulations on the terminal as a manual review. The basis for the prosecution of the persons involved.

聚类成员存储区主要包含一个模块：聚类成员存储模块，负责对聚类成员进行存储，与聚类成员产生模块进行交互，完成聚类成员的重复利用。The cluster member storage area mainly includes a module: the cluster member storage module, which is responsible for storing the cluster members, interacting with the cluster member generation module, and completing the reuse of the cluster members.

整个系统在完成基础设定后可以实现全过程、全环节的智能医保欺诈行为检测，四个子系统之间通过数据接口进行数据交互。After completing the basic setting, the whole system can realize the whole-process and whole-link intelligent medical insurance fraud detection, and the data exchange between the four subsystems is carried out through the data interface.

下面分别对不同部分进行介绍。为了介绍简便，下面介绍中将医疗账单数据简称为医疗数据或者数据。The different parts are described below. For simplicity of introduction, the medical billing data is simply referred to as medical data or data in the following description.

如图3(b)所示，数据采集子系统包含一个模块：医疗数据采集和整理模块。该模块主要从医疗账单数据库中无人工干预地提取数据，并将其转换为适合输入的格式。As shown in Figure 3(b), the data acquisition subsystem includes one module: the medical data acquisition and arrangement module. This module mainly extracts data from medical billing databases without human intervention and converts it into a format suitable for input.

该模块主要由4个功能单元组成，分别为医院数据库接口、数据交互控制单元、医疗表单数据整理与格式转换单元以及规定格式数据接口。This module is mainly composed of 4 functional units, which are hospital database interface, data interaction control unit, medical form data sorting and format conversion unit and specified format data interface.

医院数据库接口：负责对医院数据库进行封装，屏蔽不同数据库在数据管理方式上的差异；与医院数据库相连接，以简单接口形式给数据整理单元提供数据的读与查询功能，为服务请求单元使用业务数据提供方便。Hospital database interface: responsible for encapsulating the hospital database, shielding the differences in data management methods of different databases; connecting with the hospital database, providing data reading and query functions for the data sorting unit in the form of a simple interface, and using services for the service request unit Data is provided easily.

数据交互控制单元：负责控制和协调其它各单元共同完成数据交互功能。作为该模块的控制核心，控制医院数据库接口向医院数据库发送服务请求与交互数据，并将从医院数据库中得到的数据转发给数据整理与格式转换单元，并控制数据整理与格式转换单元对数据进行格式的整理转换。Data interaction control unit: responsible for controlling and coordinating other units to jointly complete the data interaction function. As the control core of this module, it controls the hospital database interface to send service requests and interactive data to the hospital database, and forwards the data obtained from the hospital database to the data sorting and format conversion unit, and controls the data sorting and format conversion unit to process the data. Format conversion.

医疗数据整理与格式转换单元：负责按照后续预处理需求对数据进行格式的整理转换，保证数据格式的一致性，输出固定格式的多数据。该单元接受数据交互单元的控制指令，将发往数据预处理模块的数据进行整理和转换。Medical data sorting and format conversion unit: Responsible for sorting and converting the data format according to the subsequent preprocessing requirements, ensuring the consistency of the data format, and outputting multiple data in a fixed format. This unit accepts the control instructions of the data exchange unit, and organizes and converts the data sent to the data preprocessing module.

规定格式数据接口：负责与数据预处理模块交互信息，接受数据交互控制单元的控制指令，将规定格式的数据发送给数据分析子系统的预处理模块。Specified format data interface: responsible for exchanging information with the data preprocessing module, accepting control instructions from the data interaction control unit, and sending data in specified format to the preprocessing module of the data analysis subsystem.

其中，对数据进行格式转换，也可以理解为数据预处理过程中一个过程。Among them, the format conversion of data can also be understood as a process in the process of data preprocessing.

数据分析子系统包含四个模块：医疗数据预处理模块，医保诈骗评价指标权重制定模块，聚类成员生成模块以及聚类融合模块。The data analysis subsystem includes four modules: medical data preprocessing module, medical insurance fraud evaluation index weight formulation module, cluster member generation module and cluster fusion module.

如图3(c)所示，医疗数据预处理模块主要完成医疗数据的预处理功能，采用机器学习预处理算法将原始数据变为根据单个标准化低维度医疗表单数据分成的多个数据子集，以供后续模块的处理。As shown in Figure 3(c), the medical data preprocessing module mainly completes the preprocessing function of medical data. The machine learning preprocessing algorithm is used to convert the original data into multiple data subsets divided according to a single standardized low-dimensional medical form data. for processing by subsequent modules.

该模块主要由7个功能单元组成，分别为服务及数据接口、数据交互控制单元、医疗数据清洗单元、医疗数据集成单元、医疗数据标准化单元、医疗数据降维单元以及随机抽样单元。This module is mainly composed of 7 functional units, namely service and data interface, data interaction control unit, medical data cleaning unit, medical data integration unit, medical data standardization unit, medical data dimensionality reduction unit and random sampling unit.

服务及数据接口：负责该模块与数据采集子系统的数据采集和整理模块的交互，将获得的固定格式的多个医疗数据传送给医疗数据清洗单元以便后续处理。Service and data interface: responsible for the interaction between this module and the data collection and sorting module of the data collection subsystem, and transmits multiple obtained medical data in a fixed format to the medical data cleaning unit for subsequent processing.

数据交互控制单元：负责控制服务及数据接口向数据采集和整理模块接收结构数据，并将结构数据转发给医疗数据清洗单元进行数据的预处理。预处理主要包含医疗数据的清洗、集成、标准化以及降维过程等。Data interaction control unit: responsible for the control service and data interface to receive structural data from the data acquisition and sorting module, and forward the structural data to the medical data cleaning unit for data preprocessing. Preprocessing mainly includes cleaning, integration, standardization and dimensionality reduction of medical data.

医疗数据清洗单元：负责将数据中的清洗工作，将得到的数据中有大量评价指标缺失的数据去除；去除诸如家庭住址、籍贯、年龄等对本场景无用的数据指标；去除格式有格式错误指标出现的数据；对于重复出现的数据去除其冗余，缩减数据规模，得到多个完整的无冗余的医疗帐单数据表并传入医疗数据集成单元。Medical data cleaning unit: responsible for cleaning the data, and remove the data with a large number of missing evaluation indicators in the obtained data; remove data indicators such as home address, place of origin, age that are useless for this scenario; remove the format error indicators appear data; for repeated data, its redundancy is removed, the data scale is reduced, and multiple complete non-redundant medical billing data tables are obtained and transferred to the medical data integration unit.

医疗数据集成单元：负责数据的集成工作，将多个关联表单的数据进行合并集成，将同一个账单号以及同一个病患的数据合并，以帐单编号作为唯一的Key，得到一个包含所有评价指标信息的数据表单并传入医疗数据标准化单元。Medical data integration unit: responsible for data integration, merging and integrating the data of multiple related forms, merging the data of the same billing number and the same patient, using the billing number as the unique key, and obtaining a single data containing all evaluations. The data form of the indicator information and passed into the medical data standardization unit.

医疗数据标准化单元：负责数据的标准化工作，对数据进行离差标准化处理，将区间规范到大于等于0且小于等于1之间，防止出现由于量纲不同造成的数据权重差异。最后得到一个标准化的医疗数据表并传入医疗数据降维单元。Medical data standardization unit: responsible for the standardization of data, standardize the deviation of the data, and standardize the interval to be greater than or equal to 0 and less than or equal to 1 to prevent data weight differences caused by different dimensions. Finally, a standardized medical data table is obtained and passed into the medical data dimension reduction unit.

医疗数据降维单元：负责数据的降维工作，这里主要使用主成分分析方式，求取医疗数据的协方差矩阵，通过计算比较矩阵中的各维参数特征值的大小，保留各维参数中包含信息最多的几项，完成维度的降低工作，最后得到低维度的数据表单。最后将低维度的单个医保数据表单传入随机抽样单元。Medical data dimensionality reduction unit: responsible for the dimensionality reduction of data. Here, principal component analysis is mainly used to obtain the covariance matrix of medical data. For the items with the most information, the dimensionality reduction work is completed, and finally a low-dimensional data form is obtained. Finally, the low-dimensional single medical insurance data form is passed into the random sampling unit.

随机抽样单元：负责将接收到的低维度的单个医保数据表单通过随机抽样的方法产生同等量级的数据子集，最后将整个预处理完成后的多个数据子集传入后续的医保诈骗评价指标权重制定模块和聚类成员产生模块。Random sampling unit: responsible for generating a data subset of the same magnitude from the received low-dimensional single medical insurance data form through random sampling, and finally passing the entire pre-processed multiple data subsets into subsequent medical insurance fraud evaluations The index weight formulating module and the cluster member generating module.

如图3(d)所示，医保诈骗评价指标权重制定模块主要完成各项医保诈骗评价指标权重的自适应制定，便于后续产生聚类成员和聚类融合的使用。As shown in Figure 3(d), the medical insurance fraud evaluation index weight formulation module mainly completes the adaptive formulation of the weights of various medical insurance fraud evaluation indicators, which is convenient for the subsequent generation of cluster members and the use of cluster fusion.

该模块主要由3个功能单元组成，分别为医保诈骗评价指标权重函数生成单元、评价指标权重函数多子集粒子群优化单元以及评价指标权重函数全集差分进化单元。This module is mainly composed of three functional units, namely, the generation unit of the medical insurance fraud evaluation index weight function, the evaluation index weight function multi-subset particle swarm optimization unit, and the evaluation index weight function global differential evolution unit.

医保诈骗评价指标权重函数生成单元：负责接收多个医疗数据预处理模块预处理后的标准化低维度的医疗帐单数据子集，并将上述低维度数据表单中的各个维度作为评价指标多线程并行在多个数据子集中分别生成多个在取得最小值时可以得到最好的指标权重分配的医保诈骗评价指标权重函数。另外将数据子集合并为一个合集，在其基础上按上述评价指标生成基于全集的医保诈骗评价指标权重函数，将多个由数据子集生成的医保诈骗评价指标权重函数输入到评价指标权重函数多子集粒子群优化单元，将由全集生成的医保诈骗评价指标权重函数输入到评价指标权重函数全集差分进化单元进行求最优解处理。Medical insurance fraud evaluation index weight function generation unit: responsible for receiving standardized low-dimensional medical bill data subsets preprocessed by multiple medical data preprocessing modules, and using each dimension in the above low-dimensional data form as evaluation indicators. Multi-threaded parallel In multiple data subsets, multiple health insurance fraud evaluation index weight functions are generated, which can obtain the best index weight distribution when the minimum value is obtained. In addition, the data subsets are merged into a collection, and based on the above evaluation indicators, the medical insurance fraud evaluation index weight function based on the whole set is generated, and multiple medical insurance fraud evaluation index weight functions generated by the data subset are input into the evaluation index weight function. The multi-subset particle swarm optimization unit inputs the medical insurance fraud evaluation index weight function generated by the complete set into the evaluation index weight function complete set differential evolution unit for optimal solution processing.

评价指标权重函数多子集粒子群优化单元：负责在接收到多个由数据子集生成的医保诈骗评价指标权重函数后多线程并行采用粒子群优化算法根据多个数据子集得到多个评价指标权重集，并将多个基于数据子集得到的医保诈骗评价指标权重集传输给聚类成员产生模块作为加权聚类算法中对应的每项评价指标权重值使用。Evaluation index weight function multi-subset particle swarm optimization unit: responsible for multi-threaded parallel use of particle swarm optimization algorithm to obtain multiple evaluation indicators according to multiple data subsets after receiving multiple medical insurance fraud evaluation index weight functions generated by data subsets A weight set is obtained, and multiple medical insurance fraud evaluation index weight sets obtained based on the data subset are transmitted to the cluster member generation module to be used as the corresponding weight value of each evaluation index in the weighted clustering algorithm.

评价指标权重函数全集差分进化单元：负责在接收到由全集生成的评价指标权重函数后在其基础上针对评价指标权重函数采用差分进化算法对其进行处理，自适应地得到指标权重函数的全局最优解，得到最优的基于全集的医保诈骗的各个评价指标权重，并将其传输给聚类融合模块作为聚类融合时计算加权距离的权重使用。The differential evolution unit of the evaluation index weight function ensemble: After receiving the evaluation index weight function generated by the ensemble, the differential evolution algorithm is used to process the evaluation index weight function on the basis of it, and the global maximum value of the index weight function is adaptively obtained. The optimal solution is obtained to obtain the optimal weight of each evaluation index of medical insurance fraud based on the complete set, and transmit it to the cluster fusion module as the weight for calculating the weighted distance during cluster fusion.

如图3(e)，聚类成员生成模块主要完成聚类融合算法的聚类成员的产生和存储步骤，同时还能自动读取之前数据产生的聚类成员进行聚类融合使用。As shown in Figure 3(e), the cluster member generation module mainly completes the generation and storage steps of the cluster members of the cluster fusion algorithm, and can also automatically read the cluster members generated by the previous data for cluster fusion use.

该模块主要由4个功能单元组成，分别为粗聚类单元、凝聚层次聚类单元、数据交互控制单元以及聚类成员读写接口。The module is mainly composed of 4 functional units, namely coarse clustering unit, agglomerative hierarchical clustering unit, data interaction control unit and cluster member read and write interface.

粗聚类单元：负责从医疗数据预处理模块接收多个数据子集以及从医保诈骗评价指标权重制定模块接收到的针对多个数据子集的各个评价指标的权重并将多个评价指标权重集与数据子集对应起来，然后多线程并行进行加权Canopy粗聚类运算，得到聚类中心和聚类的簇的个数，并将包含原始信息的粗聚类后结果信息传给凝聚层次聚类单元进行处理。同时接收后续凝聚层次聚类单元传来的聚类中心无法与深层次聚类后得到的聚类簇对应整合的类簇数据，并针对其重新进行加权Canopy粗聚类运算得到新的聚类中心和聚类簇的个数后传回凝聚层次聚类单元。Rough clustering unit: responsible for receiving multiple data subsets from the medical data preprocessing module and the weights of each evaluation index for multiple data subsets received from the medical insurance fraud evaluation index weight formulating module and set the weights of multiple evaluation indexes Corresponding to the data subset, and then multi-threaded parallel weighted Canopy rough clustering operation to obtain the cluster center and the number of clusters of the cluster, and transmit the rough clustering result information containing the original information to the agglomerative hierarchical clustering unit for processing. At the same time, receive the cluster data that the cluster centers from the subsequent agglomerative hierarchical clustering unit cannot be integrated with the clusters obtained after deep clustering, and perform the weighted Canopy rough clustering operation again to obtain new cluster centers. And the number of clusters is returned to the agglomerative hierarchical clustering unit.

凝聚层次聚类单元：负责聚类融合算法中的聚类成员生成步骤，从粗聚类单元和指标权重制定模块分别读取原始信息和粗聚类后的结果以及各个指标的最优权重值，然后在已知聚类中心和最终聚类簇数量的前提下多线程并行使用凝聚层次聚类算法得到多个聚类成员。同时可以通过与聚类成员读写接口交互实现聚类成员的存取，可以将生成的聚类成员存入聚类成员存储模块作为历史聚类成员以便后续需要进行重新聚类训练的聚类成员或者新传入的少量待验证数据进行训练时可以调出使用，提高少量数据进行聚类时的准确性同时减少重新训练的资源浪费，最后将所有数据子集得到的聚类成员发送给聚类融合模块进行聚类融合的后续步骤。另外对于从粗聚类单元传来的聚类中心若无法与聚类成员中的类簇对应而导致的聚类成员生成失败时将对应的类簇数据和聚类中心返回给粗聚类单元重新进行粗聚类的运算，并接收粗聚类单元返回的聚类中心和类簇个数后重新进行聚类成员生成，直到生成成功。Agglomerative hierarchical clustering unit: responsible for the cluster member generation step in the clustering fusion algorithm, read the original information and the results after rough clustering and the optimal weight value of each index from the rough clustering unit and the index weight formulating module, respectively. Then, on the premise that the cluster centers and the final number of clusters are known, multi-threads use the agglomerative hierarchical clustering algorithm in parallel to obtain multiple cluster members. At the same time, the access of cluster members can be realized by interacting with the read-write interface of cluster members, and the generated cluster members can be stored in the cluster member storage module as historical cluster members for subsequent cluster members that need to be re-clustered and trained. Or the newly imported small amount of data to be verified can be called out for training, improving the accuracy of clustering a small amount of data and reducing the waste of retraining resources. Finally, the cluster members obtained from all data subsets are sent to the cluster. The fusion module performs the subsequent steps of cluster fusion. In addition, if the cluster center from the coarse clustering unit cannot correspond to the cluster in the cluster member, and the cluster member generation fails, the corresponding cluster data and cluster center will be returned to the coarse clustering unit for new Perform the rough clustering operation, and regenerate the cluster members after receiving the cluster center and the number of clusters returned by the rough clustering unit until the generation is successful.

数据交互控制单元：控制聚类成员读写接口向聚类成员存储模块发送服务请求与交互数据，完成数据交互控制单元和聚类成员存储模块的数据交互转发。Data interaction control unit: controls the read-write interface of cluster members to send service requests and interaction data to the cluster member storage module, and completes the data interaction forwarding between the data interaction control unit and the cluster member storage module.

聚类成员读写接口：负责凝聚层次聚类算法单元与聚类成员存储模块的交互，接受数据交互控制单元的控制指令，实现数据在数据交互控制单元和聚类成员存储模块间的交互，并将数据交互结果反馈给数据交互控制单元进行后续处理。Cluster member read and write interface: responsible for the interaction between the agglomerative hierarchical clustering algorithm unit and the cluster member storage module, accept the control instructions of the data interaction control unit, realize the interaction of data between the data interaction control unit and the cluster member storage module, and The data interaction result is fed back to the data interaction control unit for subsequent processing.

如图3(f)，聚类融合模块主要完成聚类融合算法共识函数构造进而完成聚类融合算法的最后步骤得到最终的数据聚类结果。As shown in Figure 3(f), the cluster fusion module mainly completes the construction of the consensus function of the cluster fusion algorithm, and then completes the final step of the cluster fusion algorithm to obtain the final data clustering result.

该模块主要由3个功能单元组成，分别为构造共识函数单元、数据交互控制单元数据聚类结果接口。This module is mainly composed of 3 functional units, namely, constructing consensus function unit, data interaction control unit, and data clustering result interface.

构造共识函数单元：负责从聚类成员生成模块获取聚类成员生成模块中产生的多个聚类成员以及从医保诈骗评价指标权重制定模块获取基于医疗数据全集的医保诈骗的各个评价指标权重，构造相应的共识函数将获取到的聚类成员进行加权融合，并自动对孤立点数据进行再次处理，得到最终的医疗表单数据聚类融合的结果。传输给数据聚类结果接口进行后续处理。Constructing a consensus function unit: responsible for obtaining multiple cluster members generated in the cluster member generating module from the cluster member generating module and obtaining the weights of each evaluation index of medical insurance fraud based on the medical data complete set from the medical insurance fraud evaluation index weight formulating module, and constructing The corresponding consensus function performs weighted fusion of the obtained cluster members, and automatically reprocesses the isolated point data to obtain the final result of cluster fusion of medical form data. It is transmitted to the data clustering result interface for subsequent processing.

数据交互控制单元：控制数据聚类结果接口向结果输出模块发送服务请求与交互数据，完成数据聚类结果接口和结果输出模块的数据交互转发工作。Data interaction control unit: controls the data clustering result interface to send service requests and interactive data to the result output module, and completes the data interactive forwarding work between the data clustering result interface and the result output module.

数据聚类结果接口：负责构造共识函数单元与结果输出模块的交互，接受数据交互控制单元的控制指令，实现数据在聚类成员融合单元与结果输出模块间的交互，并将数据交互结果反馈给数据交互控制单元进行孤立点筛选和展示的后续处理。Data clustering result interface: responsible for constructing the interaction between the consensus function unit and the result output module, accepting the control instructions of the data interaction control unit, realizing the interaction of data between the cluster member fusion unit and the result output module, and feeding back the data interaction results to The data interaction control unit performs subsequent processing of outlier screening and display.

结果展示终端包含一个模块：医保诈骗结果输出模块。如图3(g)所示。The result display terminal includes a module: the medical insurance fraud result output module. As shown in Figure 3(g).

医保诈骗结果输出模块主要完成数据聚类结果中作为存在医保诈骗嫌疑的孤立点的查找以及最终结果的自动展示功能。The medical insurance fraud result output module mainly completes the search for isolated points suspected of medical insurance fraud in the data clustering results and the automatic display function of the final result.

该模块主要由4个功能单元组成，分别为服务及数据接口、数据交互控制单元、医保诈骗孤立点寻找单元以及医保诈骗嫌疑条例展示单元。This module is mainly composed of 4 functional units, which are service and data interface, data interaction control unit, medical insurance fraud isolated point finding unit and medical insurance fraud suspected regulation display unit.

服务及数据接口：负责构造共识函数模块与结果输出模块的信息交互，接收数据交互控制单元的服务请求，获取聚类融合单元产生的最终聚类结果，并将数据交互结果反馈给数据交互控制单元进行后续处理。Service and data interface: responsible for constructing the information interaction between the consensus function module and the result output module, receiving the service request from the data interaction control unit, obtaining the final clustering result generated by the clustering fusion unit, and feeding back the data interaction result to the data interaction control unit for subsequent processing.

数据交互控制单元：负责控制和协调其它各单元共同完成数据交互功能。控制服务及数据接口向构造共识函数模块接收聚类结果数据，并将聚类结果数据转发给孤立点寻找单元，再控制孤立点寻找单元挑选出聚类结果中作为存在医疗诈骗嫌疑的孤立点。Data interaction control unit: responsible for controlling and coordinating other units to jointly complete the data interaction function. The control service and data interface receives the clustering result data from the constructing consensus function module, and forwards the clustering result data to the outlier finding unit, and then controls the outlier finding unit to select outliers in the clustering results as suspected of medical fraud.

医保诈骗孤立点寻找单元：负责从接收到的聚类结果中筛选出聚类结果中作为存在医疗诈骗嫌疑的孤立点，并将孤立点包含的医保单据信息传递给医保诈骗嫌疑条例展示单元进行后续处理。Medical insurance fraud isolated point finding unit: responsible for screening out the clustering results as isolated points suspected of medical fraud from the received clustering results, and passing the medical insurance document information contained in the isolated points to the medical insurance fraud suspected regulations display unit for follow-up deal with.

医保诈骗嫌疑条例展示单元：接收孤立点寻找单元传来的医保单据信息，并在终端进行展示，供后续人工复查与用作涉案人员追责依据。Medical Insurance Fraud Suspected Regulations Display Unit: Receive the medical insurance document information from the isolated point search unit, and display it on the terminal for subsequent manual review and as the basis for accountability of the persons involved.

聚类成员存储区包含一个模块：聚类成员存储模块。如图3(h)所示，聚类成员存储模块主要完成聚类成员的自动存取功能，为聚类融合算法提供的扩展性使用提供基础。The cluster member store contains one module: the cluster member store module. As shown in Figure 3(h), the cluster member storage module mainly completes the automatic access function of cluster members, and provides the basis for the extended use provided by the cluster fusion algorithm.

该模块主要由3个功能单元组成，分别为服务及数据接口，数据交互控制单元，聚类成员数据库单元。The module is mainly composed of three functional units, namely service and data interface, data interaction control unit, and cluster member database unit.

服务及数据接口：负责聚类成员产生模块与聚类成员存储模块之间聚类成员的数据交互，接受数据交互控制单元的服务请求，完成聚类成员的存取功能，并将数据交互结果反馈给数据交互控制单元进行后续处理。Service and data interface: responsible for the data interaction of cluster members between the cluster member generation module and the cluster member storage module, accepting service requests from the data interaction control unit, completing the access function of cluster members, and feeding back the data interaction results Perform subsequent processing on the data interaction control unit.

数据交互控制单元：负责控制和协调其它各单元共同完成数据交互功能。控制服务及数据接口向聚类成员产生模块接收聚类成员数据，可将多次聚类结果数据转发给聚类成员数据库单元并控制其存储，也可以接受聚类成员产生模块的请求从聚类成员数据库单元读取聚类成员数据。Data interaction control unit: responsible for controlling and coordinating other units to jointly complete the data interaction function. The control service and data interface receives the cluster member data from the cluster member generation module, and can forward multiple clustering result data to the cluster member database unit and control its storage. It can also accept the request of the cluster member generation module from the cluster member. The membership database unit reads the cluster membership data.

聚类成员数据库单元：负责对聚类成员数据的存储，该单元接受数据交互单元的控制指令，提供读、写和查询功能。Cluster member database unit: responsible for the storage of cluster member data, this unit accepts the control instructions of the data interaction unit, and provides read, write and query functions.

具体地，如图4(a)所示，本发明实施例的具体实施例的流程如下：Specifically, as shown in Figure 4(a), the flow of a specific embodiment of the embodiment of the present invention is as follows:

C1，读取医院的医疗帐单数据库中需要进行筛选的医保单据，将其整理成多个固定数据格式的医疗数据表单便于后续的预处理时的使用。C1, read the medical insurance documents that need to be screened in the medical bill database of the hospital, and organize them into a plurality of medical data forms in a fixed data format for subsequent use in preprocessing.

C2，将读取到的数据进行数据清洗、数据集成、数据标准化和数据降维后得到低维度、标准化的单个医疗数据表单，并通过随机抽取得到多个医保帐单数据子集，便于之后的数据的处理。C2, perform data cleaning, data integration, data standardization and data dimensionality reduction on the read data to obtain a low-dimensional, standardized single medical data form, and obtain multiple medical insurance bill data subsets through random extraction, which is convenient for subsequent processing of data.

C3，根据得到的预处理后数据子集构造多个基于数据子集的指标权重函数，并将数据合并为全集后构造基于数据全集的指标权重函数，然后并行采用粒子群优化算法使多个基于数据子集的指标权重函数得到全局最优解，进而得到多个基于数据子集的最优分配的医保诈骗评价指标权重集；另外采用差分进化算法使基于数据全集的指标权重函数得到全局最优解，进而得到基于数据全集的最优分配的医保诈骗评价指标权重集，分别应用于后续的聚类成员生成和聚类融合中的权重规定。C3: Construct multiple index weight functions based on the data subsets according to the obtained preprocessed data subsets, and combine the data into a complete set to construct an index weight function based on the complete data set, and then use the particle swarm optimization algorithm in parallel to make multiple index weight functions based on the data subsets. The index weight function of the data subset obtains the global optimal solution, and then multiple optimally allocated medical insurance fraud evaluation index weight sets based on the data subset are obtained. In addition, the differential evolution algorithm is used to obtain the global optimal index weight function based on the data set. Then, the optimal allocation of medical insurance fraud evaluation index weight sets based on the complete data set is obtained, which are respectively applied to the subsequent weight regulation in cluster member generation and cluster fusion.

C4，首先对于得到的多个医保帐单数据子集和对应的医保诈骗评价指标权重集并行采用加权Canopy聚类算法进行粗聚类，得到多个聚类中心和聚类簇的数量，然后并行使用根据已知的聚类中心和聚类数量进行加权(医保诈骗评价指标权重)的凝聚层次聚类算法进行聚类运算，得到多个聚类成员。同时请求聚类成员数据库，读取历史保存的聚类成员，共同作为下一步聚类成员融合的基础。并将生成的聚类成员写入聚类成员数据库中，更新聚类成员，方便下次运行方案时使用。C4: First, the weighted Canopy clustering algorithm is used in parallel for the obtained multiple medical insurance bill data subsets and the corresponding medical insurance fraud evaluation index weight sets to perform rough clustering to obtain the number of multiple cluster centers and clusters, and then parallel Using the agglomerative hierarchical clustering algorithm weighted according to the known cluster centers and the number of clusters (the weight of the medical insurance fraud evaluation index), the clustering operation is performed to obtain multiple cluster members. At the same time, the cluster member database is requested, and the cluster members saved in the history are read, which together serve as the basis for the next cluster member fusion. Write the generated cluster members into the cluster member database, and update the cluster members, which is convenient for use when running the scheme next time.

C5，在得到聚类成员和基于医保帐单数据全集的医保诈骗的评价指标权重集后，构造共识函数，依据聚类中心和边缘点的加权距离将聚类成员进行融合，得到最终的医保单据聚类融合结果。C5, after obtaining the cluster members and the medical insurance fraud evaluation index weight set based on the complete set of medical insurance bill data, construct a consensus function, and fuse the cluster members according to the weighted distance between the cluster center and the edge point to obtain the final medical insurance document Cluster fusion results.

C6，在聚类融合结果中寻找出孤立点作为医保诈骗嫌疑的调理，将相关的医保单据信息展示在终端页面供人工进行复查，以便用作涉案人员追责依据与后续医疗保险诈骗防护的依赖数据。C6, find out isolated points in the clustering fusion results as the adjustment of suspected medical insurance fraud, and display the relevant medical insurance document information on the terminal page for manual review, so as to be used as the basis for accountability of the persons involved and the dependence of subsequent medical insurance fraud protection data.

整个流程不需要人为干涉，能够实现全环节、全过程的数据智能处理。The whole process does not require human intervention, and can realize intelligent data processing in the whole process and in the whole process.

其中，如图4(b)所示，数据流向如下：Among them, as shown in Figure 4(b), the data flow is as follows:

整个数据流程可以分为6个阶段：医保单据的采集和整理阶段，医疗数据预处理阶段，自适应医保诈骗评价指标权重的制定阶段，读取、生成以及存储聚类成员阶段，构造共识函数完成聚类成员融合阶段以及寻找结果中的孤立点并将其作为医保诈骗嫌疑条例阶段。The entire data process can be divided into 6 stages: the collection and sorting stage of medical insurance documents, the preprocessing stage of medical data, the stage of formulating the weight of adaptive medical insurance fraud evaluation indicators, the stage of reading, generating and storing cluster members, and the completion of constructing a consensus function Cluster member fusion stage and finding outliers in the results and taking them as medical insurance fraud suspect regulation stage.

医保单据的采集和整理阶段：主要由医疗单据数据采集和整理模块进行处理，从医院帐单数据库读取包含费用明细表、医嘱表、病人基本信息登记表在内的医疗表单数据，采用固定格式转换等方式，将数据转换为固定格式的多个表单数据，并将数据流传入医疗数据预处理阶段。The collection and sorting stage of medical insurance documents: mainly processed by the medical document data collection and sorting module, which reads the medical form data including the expense list, the doctor's order form, and the patient's basic information registration form from the hospital billing database, using a fixed format. Convert the data into multiple form data in a fixed format, and pass the data stream into the medical data preprocessing stage.

医疗数据预处理阶段：主要由医疗数据预处理模块进行处理，接收上一阶段传入的固定格式的多个表单数据流后，按顺序进行数据的清洗、集成、标准化以及降维工作，将数据转换为单个标准化低维度的医疗表单数据，并通过随机抽样将其转化为等量的多个数据子集，最后将数据流传入自适应医保诈骗评价指标权重的制定阶段和读取、生成以及存储聚类成员阶段。Medical data preprocessing stage: It is mainly processed by the medical data preprocessing module. After receiving multiple form data streams in a fixed format passed in from the previous stage, data cleaning, integration, standardization and dimensionality reduction are carried out in sequence, and the data is processed. Convert it into a single standardized low-dimensional medical form data, and convert it into multiple data subsets of the same amount through random sampling, and finally pass the data stream into the formulation stage of the adaptive medical insurance fraud evaluation index weight and read, generate and store. Cluster membership stage.

自适应医保诈骗评价指标权重的制定阶段：主要由医保诈骗指标权重制定模块进行处理，接收上一阶段传入的根据单个标准化低维度的医疗表单数据分成的多个数据子集，分别针对多个数据子集和数据子集合并后的全集构造医疗指标权重评价函数并分别并行对各个子集使用粒子群优化算法、对全集使用差分进化算法进行处理得到最优的基于数据子集的医保诈骗的各个评价指标的权重和基于数据全集的医保诈骗的各个评价指标的权重，并将其分别传入读取、生成以及存储聚类成员阶段和构造共识函数完成聚类成员融合阶段。Adaptive medical insurance fraud evaluation index weight formulation stage: It is mainly processed by the medical insurance fraud indicator weight formulation module, and receives multiple data subsets that are divided into a single standardized low-dimensional medical form data from the previous stage, respectively for multiple data subsets. The data subset and the complete set after the data subset are merged to construct the medical index weight evaluation function, and use the particle swarm optimization algorithm and the differential evolution algorithm to process each subset in parallel to obtain the optimal data subset-based medical insurance fraud. The weight of each evaluation index and the weight of each evaluation index of medical insurance fraud based on the complete data set are respectively passed into the reading, generating and storing stage of cluster members and the construction of consensus function to complete the stage of cluster member fusion.

读取、生成以及存储聚类成员阶段：主要由聚类成员生成模块和聚类成员存储模块进行处理，聚类成员生成模块接收自适应医保诈骗评价指标权重的制定阶段传入的医保诈骗的各个评价指标的权重和医疗数据预处理阶段传入的根据单个标准化低维度的医疗表单数据分成的多个数据子集，顺序使用Canopy粗聚类和凝聚层次聚类算法对多个数据子集进行并行处理，将数据转换为多个聚类成员。同时与聚类成员存储模块进行交互，完成新聚类成员的存储和历史聚类成员的读取工作。最后将数据流传入构造共识函数完成聚类成员融合阶段。The stage of reading, generating and storing cluster members: it is mainly processed by the cluster member generation module and the cluster member storage module. The cluster member generation module receives the incoming medical insurance fraud in the development stage of the adaptive medical insurance fraud evaluation index weight. The weights of the evaluation indicators and the multiple data subsets that are entered in the medical data preprocessing stage are divided according to a single standardized low-dimensional medical form data, and the Canopy coarse clustering and agglomerative hierarchical clustering algorithms are sequentially used to parallelize the multiple data subsets. processing, transforming the data into multiple cluster members. At the same time, it interacts with the cluster member storage module to complete the storage of new cluster members and the reading of historical cluster members. Finally, the data stream is passed into the construction consensus function to complete the cluster member fusion stage.

构造共识函数完成聚类成员融合阶段：主要由聚类融合模块进行处理，针对读取、生成以及存储聚类成员阶段传入的聚类成员和自适应医保诈骗评价指标权重的制定阶段传入的基于全集的各个评价指标权重构造共识函数融合策略，进行聚类融合，将数据转换为最终的医疗帐单数据聚类结果。并将数据流传入寻找结果中的孤立点并将其作为医保诈骗嫌疑条例阶段。Constructing a consensus function to complete the cluster member fusion stage: it is mainly processed by the cluster fusion module, and the incoming cluster members and the weights of the adaptive medical insurance fraud evaluation index passed in the reading, generating and storing stage of the cluster member are processed. Construct a consensus function fusion strategy based on the weights of each evaluation index of the complete set, perform clustering fusion, and convert the data into the final medical bill data clustering result. And pass the data stream into the finding outliers in the results and use them as the regulation stage of suspected health insurance fraud.

寻找结果中的孤立点并将其作为医保诈骗嫌疑条例阶段：主要由医保诈骗结果输出模块进行处理，针对上一阶段传入的最终的医疗帐单数据聚类结果数据进行孤立点寻找，将数据转换为由医保诈骗嫌疑的孤立点数据，并将数据流传入最终的展示终端进行展示。完成整个数据流程。Find outliers in the results and use them as suspected medical insurance fraud regulations Stage: It is mainly processed by the medical insurance fraud result output module. Convert it to isolated point data suspected of medical insurance fraud, and transmit the data stream to the final display terminal for display. Complete the entire data flow.

本发明是实施例中针对医院的医疗账单数据采用自适应权重聚类融合方法，可以实现全环节、全流程的智能医保诈骗行为的检测。减少了人为主观性干预的情况，使最后的结果正确率更高。The present invention adopts the adaptive weight clustering fusion method for the medical bill data of the hospital in the embodiment, which can realize the detection of intelligent medical insurance fraud in the whole link and the whole process. It reduces the situation of human subjectivity intervention, making the final result more accurate.

本发明实施例提供了一种医保诈骗行为检测装置，如图5所示，包括：An embodiment of the present invention provides a medical insurance fraud detection device, as shown in FIG. 5 , including:

第一确定模块501，用于确定多个待分析医疗账单数据，待分析医疗账单数据包括多个评价指标；The first determination module 501 is configured to determine a plurality of medical bill data to be analyzed, and the medical bill data to be analyzed includes a plurality of evaluation indicators;

划分模块502，用于将多个待分析医疗账单数据进行划分得到多个数据子集；A dividing module 502, configured to divide a plurality of medical bill data to be analyzed to obtain a plurality of data subsets;

第二确定模块503，用于针对各个数据子集，确定该数据子集对应的子集权重集；The second determining module 503 is configured to, for each data subset, determine a subset weight set corresponding to the data subset;

聚类模块504，用于根据该数据子集对应的子集权重集对该数据子集中包括的待分析医疗账单数据进行聚类，得到该数据子集对应的聚类成员；The clustering module 504 is configured to cluster the medical bill data to be analyzed included in the data subset according to the subset weight set corresponding to the data subset, and obtain the cluster members corresponding to the data subset;

第三确定模块505，用于确定所有待分析医疗账单数据构成的全集对应的全集权重集；The third determining module 505 is configured to determine the corpus weight set corresponding to the corpus composed of all the medical bill data to be analyzed;

融合模块506，用于根据全集权重集对所有聚类成员进行融合，其中，所有聚类成员由多个数据子集分别对应的聚类成员组成；A fusion module 506, configured to fuse all cluster members according to the corpus weight set, wherein all cluster members are composed of cluster members corresponding to multiple data subsets respectively;

第四确定模块507，用于确定进行融合后得到的孤立待分析医疗账单数据，并将孤立待分析医疗账单数据作为可疑医疗账单数据。The fourth determination module 507 is configured to determine the isolated medical bill data to be analyzed obtained after fusion, and use the isolated medical bill data to be analyzed as suspicious medical bill data.

可选的，第二确定模块503，具体用于根据该数据子集中包括的各个待分析医疗账单数据，构建第一评价指标权重函数；根据第一评价指标权重函数，通过粒子群优化算法，确定该数据子集对应的子集权重集，其中，子集权重集中包括各个评价指标分别对应的第一权重值。Optionally, the second determination module 503 is specifically configured to construct a first evaluation index weight function according to each to-be-analyzed medical bill data included in the data subset; A subset weight set corresponding to the data subset, wherein the subset weight set includes first weight values corresponding to each evaluation index respectively.

可选的，聚类模块504，具体用于基于各个评价指标，分别确定该数据子集中包括的各个待分析医疗账单数据两两之间的子相似度；分别根据子集权重集中包括的各个评价指标分别对应的第一权重值，对基于各个评价指标确定的各个评价指标分别对应的子相似度进行加权，得到各个待分析医疗账单数据两两之间的总相似度；根据总相似度，对该数据子集中包括的各个待分析医疗账单数据进行聚类，得到该数据子集对应的聚类成员。Optionally, the clustering module 504 is specifically configured to, based on each evaluation index, respectively determine the sub-similarity between each pair of medical bill data to be analyzed included in the data subset; and respectively according to each evaluation included in the subset weight set. The first weight value corresponding to the index, weights the sub-similarities corresponding to each evaluation index determined based on each evaluation index, and obtains the total similarity between each pair of medical bill data to be analyzed; Each medical bill data to be analyzed included in the data subset is clustered to obtain cluster members corresponding to the data subset.

可选的，第三确定模块505，具体用于根据所有待分析医疗账单数据，构建第二评价指标权重函数；根据第二评价指标权重函数，通过差分进化算法，确定全集对应的全集权重集，其中，全集权重集包括各个评价指标分别对应的第二权重值。Optionally, the third determination module 505 is specifically configured to construct a second evaluation index weight function according to all the medical bill data to be analyzed; according to the second evaluation index weight function, through a differential evolution algorithm, determine the universe weight set corresponding to the universe, Wherein, the full set weight set includes the second weight values corresponding to each evaluation index respectively.

可选的，融合模块506，具体用于确定融合策略；基于各个评价指标，分别确定各个聚类中心两两之间的相似度，其中，各个聚类中心是所有聚类成员中的各个聚类成员分别对应的；根据全集权重集包括的各个评价指标分别对应的第二权重值，以及各个聚类中心两两之间的第二相似度，通过融合策略，对所有聚类成员进行融合。Optionally, the fusion module 506 is specifically used to determine a fusion strategy; based on each evaluation index, determine the similarity between each cluster center pairwise, wherein each cluster center is each cluster in all cluster members. The members correspond respectively; according to the second weight value corresponding to each evaluation index included in the corpus weight set, and the second similarity between each cluster center, all cluster members are fused through a fusion strategy.

可选的，第一确定模块501，具体用于获取多个原始医疗账单数据；针对各个原始医疗账单数据，对该原始医疗账单数据进行预处理，得到该原始医疗账单数据对应的待分析医疗账单数据。Optionally, the first determining module 501 is specifically configured to acquire multiple original medical bill data; for each original medical bill data, preprocess the original medical bill data to obtain a medical bill to be analyzed corresponding to the original medical bill data. data.

可选的，该装置还包括：展示模块，用于展示可疑医疗账单数据。Optionally, the device further includes: a display module for displaying suspicious medical bill data.

可选的，该装置还包括：保存模块，用于保存各个数据子集分别对应的聚类成员。Optionally, the device further includes: a saving module, configured to save the cluster members corresponding to each data subset respectively.

需要说明的是，本发明实施例提供的医保诈骗行为检测装置是应用上述医保诈骗行为检测方法的装置，则上述医保诈骗行为检测方法的所有实施例均适用于该装置，且均能达到相同或相似的有益效果。It should be noted that the medical insurance fraud detection device provided by the embodiment of the present invention is a device applying the above medical insurance fraud detection method, and all the embodiments of the above medical insurance fraud detection method are applicable to the device, and can achieve the same or similar beneficial effects.

本发明实施例还提供了一种医保诈骗行为检测设备，如图6所示，包括处理器601、通信接口602、存储器603和通信总线604，其中，处理器601，通信接口602，存储器603通过通信总线604完成相互间的通信。An embodiment of the present invention also provides a medical insurance fraud detection device, as shown in FIG. 6 , including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, wherein the processor 601, the communication interface 602, and the memory 603 pass through the The communication bus 604 accomplishes the mutual communication.

存储器603，用于存放计算机程序；a memory 603 for storing computer programs;

处理器601，用于执行存储器603上所存放的程序时，实现上述医保诈骗行为检测方法的方法步骤。The processor 601 is configured to implement the method steps of the above medical insurance fraud detection method when executing the program stored in the memory 603 .

上述医保诈骗行为检测设备提到的通信总线可以是外设部件互连标准(Peripheral Component Interconnect，PCI)总线或扩展工业标准结构(ExtendedIndustry Standard Architecture，EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示，图中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。The communication bus mentioned by the medical insurance fraud detection device may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus or the like. The communication bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.

通信接口用于上述医保诈骗行为检测设备与其他设备之间的通信。The communication interface is used for communication between the medical insurance fraud detection device and other devices.

存储器可以包括随机存取存储器(Random Access Memory，RAM)，也可以包括非易失性存储器(Non-Volatile Memory，NVM)，例如至少一个磁盘存储器。可选的，存储器还可以是至少一个位于远离前述处理器的存储装置。The memory may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk memory. Optionally, the memory may also be at least one storage device located away from the aforementioned processor.

上述的处理器可以是通用处理器，包括中央处理器(Central Processing Unit，CPU)、网络处理器(Network Processor，NP)等；还可以是数字信号处理器(Digital SignalProcessing，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above-mentioned processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; may also be a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

本发明实施例还提供了一种计算机可读存储介质，计算机可读存储介质内存储有计算机程序，计算机程序被处理器执行时实现上述医保诈骗行为检测方法的方法步骤。An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method steps of the above medical insurance fraud detection method are implemented.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any relationship between these entities or operations. any such actual relationship or sequence exists. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article, or device that includes the element.

本说明书中的各个实施例均采用相关的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于装置、设备及计算机可读存储介质实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the apparatus, device, and computer-readable storage medium embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for related parts.

以上所述仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等，均包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. a medical insurance fraud detection method, is characterized in that, comprises:

determining a plurality of medical bill data to be analyzed, the medical bill data to be analyzed including a plurality of evaluation indicators;

Divide multiple medical bill data to be analyzed to obtain multiple data subsets;

For each data subset, determine the subset weight set corresponding to the data subset; and cluster the medical bill data to be analyzed included in the data subset according to the subset weight set corresponding to the data subset to obtain the data subset the cluster member corresponding to the set;

Determine the corpus weight set corresponding to the corpus composed of all the medical bill data to be analyzed;

All cluster members are fused according to the corpus weight set, wherein all the cluster members are composed of cluster members corresponding to each data subset respectively;

Determine the isolated medical bill data to be analyzed obtained after the fusion, and use the isolated medical bill data to be analyzed as suspicious medical bill data.

2. The method according to claim 1, wherein the determining the subset weight set corresponding to the data subset comprises:

constructing a first evaluation index weight function according to each medical bill data to be analyzed included in the data subset;

According to the first evaluation index weight function, a particle swarm optimization algorithm is used to determine a subset weight set corresponding to the data subset, wherein the subset weight set includes first weight values corresponding to each evaluation index respectively.

3 . The method according to claim 2 , wherein the medical bill data to be analyzed included in the data subset is clustered according to the subset weight set corresponding to the data subset, and the corresponding data subset is obtained. 4 . The cluster members of , including:

Based on each evaluation index, respectively determine the sub-similarity between each pair of medical bill data to be analyzed included in the data subset;

Weighting the respective sub-similarities corresponding to each evaluation index determined based on each evaluation index according to the first weight value corresponding to each evaluation index included in the subset weight set, respectively, to obtain each pair of medical bill data to be analyzed. the total similarity between

According to the total similarity, each medical bill data to be analyzed included in the data subset is clustered to obtain a cluster member corresponding to the data subset.

4. The method according to claim 1, wherein the determining a corpus weight set corresponding to a corpus composed of all medical bill data to be analyzed comprises:

constructing a second evaluation index weight function according to all the medical bill data to be analyzed;

According to the second evaluation index weight function, a differential evolution algorithm is used to determine a corpus weight set corresponding to the corpus, wherein the corpus weight set includes a second weight value corresponding to each evaluation index.

5. The method according to claim 1, characterized in that, merging all cluster members according to the corpus weight set, comprising:

Determine the fusion strategy;

Based on each evaluation index, the similarity between each cluster center is determined respectively, wherein each cluster center is corresponding to each cluster member in all cluster members;

According to the second weight value corresponding to each evaluation index included in the corpus weight set, and the second similarity between each cluster center, all cluster members are fused through the fusion strategy.

6. The method according to any one of claims 1 to 5, wherein the determining a plurality of medical bill data to be analyzed comprises:

Obtain multiple raw medical billing data;

For each original medical bill data, the original medical bill data is preprocessed to obtain the to-be-analyzed medical bill data corresponding to the original medical bill data.

7. The method according to any one of claims 1 to 5, wherein the isolated medical bill data to be analyzed obtained after fusion is determined, and the isolated medical bill data to be analyzed is regarded as suspicious medical bills After the data, the method further includes:

The suspicious medical billing data is displayed.

8. The method according to any one of claims 1 to 5, wherein, for each data subset, a subset weight set corresponding to the data subset is determined; and according to the subset weight set corresponding to the data subset The set weight set performs clustering on the medical bill data to be analyzed included in the data subset, and after obtaining the cluster members corresponding to the data subset, the method further includes:

Save the cluster members corresponding to each data subset.

9. A medical insurance fraud detection device, characterized in that, comprising:

a first determining module, configured to determine a plurality of medical bill data to be analyzed, wherein the medical bill data to be analyzed includes a plurality of evaluation indicators;

A division module, which is used to divide a plurality of medical bill data to be analyzed to obtain a plurality of data subsets;

a second determination module, configured to determine, for each data subset, a subset weight set corresponding to the data subset;

a clustering module, configured to cluster the medical bill data to be analyzed included in the data subset according to the subset weight set corresponding to the data subset, and obtain the cluster members corresponding to the data subset;

The third determining module is used to determine the corpus weight set corresponding to the corpus composed of all the medical bill data to be analyzed;

a fusion module, configured to fuse all cluster members according to the corpus weight set, wherein all the cluster members are composed of cluster members corresponding to a plurality of data subsets respectively;

The fourth determination module is configured to determine the isolated medical bill data to be analyzed obtained after fusion, and use the isolated medical bill data to be analyzed as suspicious medical bill data.

10. The apparatus according to claim 9, wherein the second determination module is specifically configured to construct a first evaluation index weight function according to each medical bill data to be analyzed included in the data subset; The evaluation index weight function determines the subset weight set corresponding to the data subset through the particle swarm optimization algorithm, wherein the subset weight set includes the first weight values corresponding to each evaluation index respectively.