CN113919812A - Method and device for checking crowd queue research data - Google Patents
Method and device for checking crowd queue research data Download PDFInfo
- Publication number
- CN113919812A CN113919812A CN202111203015.4A CN202111203015A CN113919812A CN 113919812 A CN113919812 A CN 113919812A CN 202111203015 A CN202111203015 A CN 202111203015A CN 113919812 A CN113919812 A CN 113919812A
- Authority
- CN
- China
- Prior art keywords
- data
- field
- correlation
- verification
- checking
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/103—Workflow collaboration or project management
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Human Resources & Organizations (AREA)
- Strategic Management (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Economics (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
Description
技术领域technical field
本申请涉及医疗信息化技术领域,具体涉及一种人群队列研究数据的核查方法及装置。The present application relates to the technical field of medical informatization, and in particular to a method and device for verifying data in a population cohort study.
背景技术Background technique
随着医疗信息化技术的发展,医疗领域源源不断产生大量的医疗数据,例如对应于不同病种的人群队列研究数据等,因此有必要对人群队列研究数据进行核查以排除或修正异常数据,从而保证数据的准确性和可用性。With the development of medical information technology, a large amount of medical data is continuously generated in the medical field, such as population cohort study data corresponding to different diseases. Therefore, it is necessary to check the population cohort study data to exclude or correct abnormal data, so as to Guarantee the accuracy and availability of data.
现有技术中多采用对人群队列研究数据表中的各字段分别进行异常数据核查的处理方式,但是这种方法提供的核查精确度较差。In the prior art, the processing method of separately performing abnormal data verification on each field in the population cohort research data table is mostly adopted, but the verification accuracy provided by this method is poor.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本申请提供了一种人群队列研究数据的核查方法及装置,能够有效解决现有技术中的人群队列研究数据核查方法精确度较差的问题。In view of this, the present application provides a method and device for verifying data of a population cohort study, which can effectively solve the problem of poor accuracy of the method for verifying data of a population cohort study in the prior art.
下文中将给出关于本申请的简要概述,以便提供关于本申请的某些方面的基本理解。应当理解,此概述并不是关于本申请的穷举性概述。它并不是意图确定本申请的关键或重要部分,也不是意图限定本申请的范围。其目的仅仅是以简化的形式给出某些概念,以此作为稍后论述的更详细描述的前序。The following will give a brief overview of the application in order to provide a basic understanding of certain aspects of the application. It should be understood that this summary is not an exhaustive overview of the present application. It is not intended to identify a critical or essential part of this application, nor is it intended to limit the scope of the application. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
根据本申请的第一方面,提供了一种人群队列研究数据的核查方法,包括:According to the first aspect of the present application, a method for checking data of a population cohort study is provided, including:
步骤一:对待核查数据表中的字段进行第一核查,若所述字段均符合预设的第一核查条件,则选取待核查数据表中的任一可信字段作为第一字段;Step 1: perform a first check on the fields in the data table to be checked, and if the fields all meet the preset first check conditions, select any trusted field in the data table to be checked as the first field;
步骤二:查询所述第一字段的关联字段并基于所述第一字段以及所述第一字段的关联字段生成关联字段集;Step 2: query the associated field of the first field and generate an associated field set based on the first field and the associated field of the first field;
步骤三:依据所述关联字段集的相关性验证条件,对所述关联字段集中的数据条目进行相关性验证,若所述关联字段集中的数据条目不符合所述相关性验证条件,则将所述数据条目标记为潜在异常数据;Step 3: According to the correlation verification conditions of the associated field set, perform correlation verification on the data items in the associated field set, if the data items in the associated field set do not meet the correlation verification conditions, then The above data entry is marked as potentially abnormal data;
步骤四:判断所述潜在异常数据是否为真实数据,若是,则判断所述潜在异常数据是否满足目标病种对应的关联条件,若满足,则将所述潜在异常数据标记为正常,同时,判断标记为正常的潜在异常数据的数量或占比是否达到预设阈值,若是,则基于所述标记为正常的潜在异常数据更新所述关联字段集的相关性验证条件,并基于更新后的相关性验证条件,对所述关联字段集中的待验证数据条目进行相关性验证。Step 4: Judging whether the potential abnormal data is real data, if yes, then judging whether the potential abnormal data satisfies the associated conditions corresponding to the target disease, if so, marking the potential abnormal data as normal, and at the same time, judging Whether the number or proportion of potentially abnormal data marked as normal reaches a preset threshold, and if so, update the correlation verification condition of the associated field set based on the potential abnormal data marked as normal, and based on the updated correlation The verification condition is to perform correlation verification on the data items to be verified in the associated field set.
在一些实施例中,所述人群队列研究数据的核查方法还包括,采集与所述目标病种对应的数据并生成初始数据表,抽样选取所述初始数据表中的多条数据生成所述待核查数据表。In some embodiments, the method for verifying the data of the population cohort study further includes collecting data corresponding to the target disease type and generating an initial data table, and selecting a plurality of pieces of data in the initial data table by sampling to generate the to-be-to-be-data table. Check the data sheet.
进一步的,若所述待核查数据表中的任一字段不符合预设的第一核查条件,则重新采集所述初始数据表中对应字段的数据,并基于重新采集的数据更新所述初始数据表,重新抽样选取所述初始数据表中的多条数据生成待核查数据表。Further, if any field in the data table to be checked does not meet the preset first check condition, then re-collect the data of the corresponding field in the initial data table, and update the initial data based on the re-collected data. table, re-sampling and selecting a plurality of pieces of data in the initial data table to generate a data table to be checked.
在一些实施例中,所述对待核查数据表中的字段进行第一核查,包括乱码核查、数据格式核查、信息一致性核查、空值率核查、值域范围核查、字段长度核查中的一项或多项。In some embodiments, the first check is performed on the fields in the data table to be checked, including one of garbled code checking, data format checking, information consistency checking, null value checking, value range checking, and field length checking. or more.
在一些实施例中,所述查询第一字段的关联字段,包括:在预设的一个或多个相关性验证条件中,查询涉及所述第一字段的相关性验证条件,并基于所述相关性验证条件确定所述第一字段的关联字段。In some embodiments, the querying the associated field of the first field includes: in one or more preset correlation verification conditions, querying the correlation verification condition related to the first field, and based on the correlation The sex validation condition determines the associated field of the first field.
在一些实施例中,在判断所述潜在异常数据为真实数据后,若判断所述潜在异常数据不满足目标病种对应的关联条件,则删除所述不满足所述目标病种对应的关联条件的潜在异常数据。In some embodiments, after judging that the potential abnormal data is real data, if it is determined that the potential abnormal data does not meet the association condition corresponding to the target disease, the association condition corresponding to the target disease is deleted. potential anomalous data.
在一些实施例中,所述关联字段集的相关性验证条件包括第一字段及其对应的关联字段在内的相关字段间的数值范围匹配性,所述对所述关联字段集中的数据条目进行相关性验证,包括判断数据条目的第一字段及其对应的关联字段的数值是否在对应数值范围内。In some embodiments, the correlation verification condition of the associated field set includes a value range matching between the first field and its corresponding associated fields, and the data entry in the associated field set is performed The correlation verification includes judging whether the values of the first field of the data entry and its corresponding associated field are within the corresponding value range.
进一步的,所述更新所述关联字段集的相关性验证条件包括,更新一个或多个关联字段的数值范围。Further, the updating the correlation verification condition of the associated field set includes updating the value range of one or more associated fields.
在一些实施例中,所述的人群队列研究数据的核查方法,还包括,对所述待核查数据表中除第一字段外的其它字段分别重复步骤二到四,直至完成对所述待核查数据表中可信字段的遍历。In some embodiments, the method for checking data of a population cohort study further includes repeating steps 2 to 4 for other fields in the data table to be checked except the first field, until the checking of the to-be-checked data table is completed. Traversal of trusted fields in the data table.
根据本申请的第二方面,提供了一种人群队列研究数据的核查装置,包括,According to a second aspect of the present application, there is provided a verification device for population cohort study data, comprising,
第一核查单元,用于对待核查数据表中的字段进行第一核查,若所述字段均符合预设的第一核查条件,则选取待核查数据表中的任一可信字段作为第一字段;The first verification unit is used to perform a first verification on the fields in the data table to be verified, and if the fields all meet the preset first verification conditions, select any trusted field in the data table to be verified as the first field ;
关联字段集生成单元,用于查询所述第一字段的关联字段并基于所述第一字段以及所述第一字段的关联字段生成关联字段集;an associated field set generating unit, configured to query the associated field of the first field and generate an associated field set based on the first field and the associated field of the first field;
相关性验证单元,用于依据所述关联字段集的相关性验证条件,对所述关联字段集中的数据条目进行相关性验证,若所述关联字段集中的数据条目不符合所述相关性验证条件,则将所述数据条目标记为潜在异常数据;A correlation verification unit, configured to perform correlation verification on the data items in the correlation field set according to the correlation verification conditions of the correlation field set, if the data items in the correlation field set do not meet the correlation verification conditions , the data entry is marked as potentially abnormal data;
相关性验证条件更新单元,用于判断所述潜在异常数据是否为真实数据,若是,则判断所述潜在异常数据是否满足目标病种对应的关联条件,若满足,则将所述潜在异常数据标记为正常,同时,判断标记为正常的潜在异常数据的数量或占比是否达到预设阈值,若是,则基于所述标记为正常的潜在异常数据更新所述关联字段集的相关性验证条件,并基于更新后的相关性验证条件,对所述关联字段集中的待验证数据条目进行相关性验证。A correlation verification condition update unit, configured to judge whether the potential abnormal data is real data, and if so, judge whether the potential abnormal data satisfies the correlation conditions corresponding to the target disease, and if so, mark the potential abnormal data is normal, and at the same time, determine whether the number or proportion of potential abnormal data marked as normal reaches a preset threshold, and if so, update the correlation verification condition of the associated field set based on the potential abnormal data marked as normal, and Based on the updated correlation verification conditions, correlation verification is performed on the data items to be verified in the associated field set.
根据本申请的第三方面,提供了一种电子设备,包括:According to a third aspect of the present application, an electronic device is provided, comprising:
一个或多个处理器;one or more processors;
存储器,用于存储一个或多个程序,memory for storing one or more programs,
其中,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器执行如第一方面所述的方法。Wherein, when the one or more programs are executed by the one or more processors, the one or more processors are caused to execute the method according to the first aspect.
根据本申请的第四方面,提供了一种计算机可读介质,其上存储有可执行指令,该指令被处理器执行时使处理器执行如第一方面所述的方法。According to a fourth aspect of the present application, there is provided a computer-readable medium having executable instructions stored thereon, the instructions, when executed by a processor, cause the processor to perform the method according to the first aspect.
本申请提出了一种人群队列研究数据的核查方法及装置,通过构建关联字段集并对关联字段集中的数据进行相关性验证,有效解决了现有的人群队列研究数据核查方法核查精确度较差的问题。The present application proposes a method and device for verifying data in a population cohort study. By constructing an associated field set and verifying the correlation of the data in the associated field set, it effectively solves the problem that the existing population cohort study data verification method has poor verification accuracy. The problem.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。通过附图所示,本申请的上述及其它目的、特征和优势将更加清晰。在全部附图中相同的附图标记指示相同的部分。并未刻意按实际尺寸等比例缩放绘制附图,重点在于示出本申请的主旨。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the accompanying drawings required in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the present application. In the embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort. The above and other objects, features and advantages of the present application will be more apparent from the accompanying drawings. The same reference numerals refer to the same parts throughout the drawings. The drawings are not intentionally scaled to actual size, and the emphasis is on illustrating the subject matter of the present application.
图1为根据本申请实施例提供的一种人群队列研究数据的核查方法的流程示意图。FIG. 1 is a schematic flowchart of a method for checking data of a population cohort study provided according to an embodiment of the present application.
图2为根据本申请实施例提供的一种人群队列研究数据的核查方法的二维坐标图。FIG. 2 is a two-dimensional coordinate diagram of a method for checking data of a population cohort study provided according to an embodiment of the present application.
图3为根据本申请实施例提供的一种人群队列研究数据的核查装置的系统结构图。FIG. 3 is a system structure diagram of an apparatus for checking data of a population cohort study provided according to an embodiment of the present application.
图4为根据本申请实施例提供的一种电子设备的结构示意图。FIG. 4 is a schematic structural diagram of an electronic device provided according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。同时,在本申请的描述中诸如“第一”、“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures. Meanwhile, in the description of this application, relational terms such as "first", "second", etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.
再者,本申请中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。Furthermore, the term "and/or" in this application is only an association relationship to describe related objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, and A and B exist at the same time. B, there are three cases of B alone.
下文中将结合附图对本申请的示例性实施例进行描述。为了清楚和简明起见,在说明书中并未描述实际实施例的所有特征。然而,应该了解,在开发任何这种实际实施例的过程中可以做出很多特定于实施例的决定,以便实现开发人员的具体目标,并且这些决定可能会随着实施例的不同而有所改变。Hereinafter, exemplary embodiments of the present application will be described with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual embodiment are described in the specification. It should be appreciated, however, that many embodiment-specific decisions may be made in the development of any such actual embodiment in order to achieve the developer's specific goals, and that these decisions may vary from embodiment to embodiment .
在此,还需要说明的一点是,为了避免因不必要的细节而模糊了本申请,在附图中仅仅示出了与根据本申请的方案密切相关的装置结构,而省略了与本申请关系不大的其他细节。Here, it should also be noted that, in order to avoid obscuring the present application due to unnecessary details, only the device structure closely related to the solution according to the present application is shown in the drawings, and the relationship with the present application is omitted. Not much other details.
应理解的是,本申请并不会由于如下参照附图的描述而只限于所描述的实施形式。在本文中,在可行的情况下,实施例可以相互组合、不同实施例之间的特征替换或借用、在一个实施例中省略一个或多个特征。It should be understood that the present application is not limited to the described embodiments due to the following description with reference to the accompanying drawings. Herein, where feasible, embodiments may be combined with each other, features may be substituted or borrowed between different embodiments, and one or more features may be omitted in one embodiment.
现有技术中提供的人群队列研究数据的核查方法,一般都是核查数据表中某一字段的数据是不是为空值、数据值是不是在一定的范围里。但是这种处理方法仅可以对单一字段内的数值进行核查,而没有对具有关联关系的字段进行相关性验证,因此处理精确度较差。The methods for checking population cohort study data provided in the prior art generally check whether the data of a certain field in the data table is a null value and whether the data value is within a certain range. However, this processing method can only check the value in a single field, and does not perform correlation verification on the fields with associated relationships, so the processing accuracy is poor.
为了解决上述问题,本申请实施例从提高处理精确度出发,提供了一种人群队列研究数据的核查方法及装置,下面首先对本申请实施例提供的一种人群队列研究数据的核查方法进行具体介绍。In order to solve the above problems, the embodiments of the present application provide a method and device for checking population cohort study data from the perspective of improving the processing accuracy. The following first specifically introduces a method for checking population cohort study data provided by the embodiments of the present application. .
图1示出了根据本申请实施例提供的一种人群队列研究数据的核查方法的流程示意图100,该方法具体包括:FIG. 1 shows a
步骤110:对待核查数据表中的字段进行第一核查,若所述字段均符合预设的第一核查条件,则选取待核查数据表中的任一可信字段作为第一字段。Step 110: Perform a first check on the fields in the data table to be checked, and select any trusted field in the data table to be checked as the first field if the fields all meet the preset first check conditions.
本申请实施例中,所述待核查数据表可以是与研究的目标病种对应的多条患者数据,例如,可以是与肥胖症这一病种对应的多条患者数据。表1提供了一种待核查数据表的示例,需要注意的是,该示例仅示出了待核查数据表中的部分数据,而省略对其他数据条目的示出。In the embodiment of the present application, the data table to be checked may be multiple pieces of patient data corresponding to the target disease of the study, for example, may be multiple pieces of patient data corresponding to the disease of obesity. Table 1 provides an example of a data table to be checked, and it should be noted that this example only shows part of the data in the data table to be checked, and the illustration of other data items is omitted.
表1-基于肥胖症患者数据生成的待核查数据表Table 1 - Data table to be verified based on data of obese patients
本申请实施例中,所述可信字段包括能够标识患者,从相对固有信息中由机器解析或抽取得到的字段,所述相对固有信息例如可以是患者通过身份证或就诊卡刷卡就诊时获取的信息,例如就诊号、身份证号、性别等,所述可信字段可以是从患者的相对固有信息中通过机器解析或抽取的方式得到的字段,例如年龄、性别等。相对应的,本申请实施例中,将待核查数据表中除可信字段之外的字段定义为非可信字段,非可信字段主要用于表征与患者病症相关的参数,例如体重、血糖、腰围等,非可信字段可以是人工输入也可以是由计算机批量导入。本申请实施例中,表1中的年龄字段为可信字段,体重字段为非可信字段。In the embodiment of the present application, the trusted field includes a field that can identify a patient and is obtained by machine parsing or extraction from relative inherent information, for example, the relative inherent information may be obtained when a patient visits a doctor by swiping an ID card or a medical card. Information, such as medical number, ID number, gender, etc., the trusted field may be a field obtained by machine parsing or extraction from the relative inherent information of the patient, such as age, gender, etc. Correspondingly, in the embodiment of the present application, the fields other than the trusted fields in the data table to be checked are defined as untrusted fields, and the untrusted fields are mainly used to represent parameters related to the patient's condition, such as body weight, blood sugar, etc. , waist circumference, etc. The untrusted fields can be entered manually or imported in batches by a computer. In the embodiment of the present application, the age field in Table 1 is a trusted field, and the weight field is an untrusted field.
本申请实施例可以从是否出现乱码、数据格式、信息一致性、是否存在空值、字段内数据的数值范围是否合理、数据长度是否符合要求等角度对待核查数据表中的字段进行第一核查,相应的,本申请实施例中,对待核查数据表中的字段进行第一核查,可以包括对字段内的数据进行乱码核查、数据格式核查、信息一致性核查、空值率核查、值域范围核查、字段长度核查中的一项或多项。In this embodiment of the present application, the fields in the data table to be checked can be checked first from the perspectives of whether there is garbled characters, data format, information consistency, whether there is a null value, whether the numerical range of the data in the field is reasonable, and whether the data length meets the requirements, etc. Correspondingly, in the embodiment of the present application, the first verification of the fields in the data table to be verified may include garbled code verification, data format verification, information consistency verification, null value rate verification, and value range verification on the data in the field. , one or more of Field Length Check.
本申请实施例中,理论上,通过机器提取的字段准确性很高,基本认为不会出错,所以将其确定为可信字段。但考虑到机器抽取的过程中还是可能会出现数据乱码、数据格式错误、数据不完整,以及抽取出的信息与固有信息不一致等问题,因此具体的,可以是对待核查数据表中的可信字段内的数据进行乱码核查、数据格式核查、信息一致性核查中的一项或多项;而对待核查数据表中的非可信字段内的数据的核查则可以是空值率核查、值域范围核查、字段长度核查中的一项或多项。In the embodiment of the present application, theoretically, the field extracted by the machine has high accuracy, and it is basically considered that there is no error, so it is determined as a trusted field. However, considering that there may still be problems such as garbled data, wrong data format, incomplete data, and inconsistency between the extracted information and inherent information in the process of machine extraction, specifically, it can be the trusted field in the data table to be checked. One or more of garbled code verification, data format verification, and information consistency verification are performed on the data in the data table; while the verification of the data in the untrusted fields in the data table to be verified can be null value rate verification, value range verification One or more of Check, Field Length Check.
本申请实施例中,对待核查数据表中的字段进行的第一核查并不局限于上述维度,本领域技术人员容易想到的其他可以对字段进行核查的维度均在本申请的保护范围内。In the embodiments of the present application, the first verification performed on the fields in the data table to be verified is not limited to the above-mentioned dimensions, and other dimensions that can be easily thought of by those skilled in the art to verify the fields are within the protection scope of the present application.
本申请实施例中,针对对待核查数据表中字段的第一核查,预设有第一核查条件,所述第一核查条件可以是对应字段中的数据通过第一核查中的全部核查项或部分核查项。例如,在第一核查包括乱码核查、数据格式核查、信息一致性核查三个核查项目的情况下,所述第一核查条件可以为可信字段中的数据均通过上述三个核查项目,即当前字段中的数据不存在乱码、数据格式统一且数据与从身份证号码等固有信息中抽取出的信息保持一致;或者,在第一核查包括空值率核查、值域范围核查、字段长度核查三个核查项目的情况下,所述第一核查条件可以为字段中的数据均通过上述三个核查项目,即当前字段中的数据不存在空值、字段内数据的数值范围合理、数据长度符合要求。本申请实施例中,所述第一核查条件并不局限于上述情况,本领域技术人员可以根据需求自行设置第一核查条件。In the embodiment of the present application, for the first verification of the fields in the data table to be verified, a first verification condition is preset, and the first verification condition may be that the data in the corresponding field passes all or part of the verification items in the first verification. Check item. For example, when the first check includes three check items: garbled code check, data format check, and information consistency check, the first check condition may be that the data in the trusted field all pass the above three check items, that is, the current The data in the field has no garbled characters, the data format is uniform, and the data is consistent with the information extracted from the inherent information such as ID number; In the case of one verification item, the first verification condition may be that the data in the field all pass the above three verification items, that is, the data in the current field does not have a null value, the value range of the data in the field is reasonable, and the data length meets the requirements . In the embodiment of the present application, the first verification condition is not limited to the above situation, and those skilled in the art can set the first verification condition by themselves according to requirements.
进一步的,若所述字段均符合预设的第一核查条件,则选取待核查数据表中的任一可信字段作为第一字段。例如,本申请实施例中,可以选取表1中的年龄字段作为第一字段。Further, if all the fields meet the preset first verification condition, any trusted field in the data table to be verified is selected as the first field. For example, in this embodiment of the present application, the age field in Table 1 may be selected as the first field.
步骤120:查询所述第一字段的关联字段并基于所述第一字段以及所述第一字段的关联字段生成关联字段集。Step 120: Query the associated field of the first field and generate a set of associated fields based on the first field and the associated field of the first field.
本申请实施例中,服务器中预设有一个或多个可对待核查数据表中的相关字段的相关性进行验证的相关性验证条件,所述查询所述第一字段的关联字段,可以是在预设的相关性验证条件中,查询涉及所述第一字段的相关性验证条件,并基于所述相关性验证条件确定所述第一字段的关联字段。In the embodiment of the present application, one or more correlation verification conditions that can verify the correlation of the correlation fields in the data table to be checked are preset in the server, and the correlation field of the first field in the query may be in In the preset correlation verification condition, the query involves the correlation verification condition of the first field, and the associated field of the first field is determined based on the correlation verification condition.
本申请实施例中,所述相关性验证条件可以是指包括第一字段及其对应的关联字段在内的相关字段间的数值范围匹配性,例如,在采集的是肥胖症患者的数据时,由于肥胖症的判断标准包括对应于不同年龄的体重数值范围,因此本申请实施例中,对应于肥胖症患者数据的相关性验证条件可以是年龄、体重两个相关字段间的数值范围匹配性。本申请实施例中,假设具体的相关性验证条件如表2所示,表2给出了对应于不同年龄肥胖者患者的体重下限值及体重上限值。需要注意的是,上述关于肥胖症患者数据的相关性验证条件还可以包括其它相关字段间的数值范围匹配性,例如性别、年龄与腰围等。In the embodiment of the present application, the correlation verification condition may refer to the matching of the numerical range between the related fields including the first field and its corresponding associated field. For example, when the data of obese patients is collected, Since the judgment criteria for obesity include weight value ranges corresponding to different ages, in this embodiment of the present application, the correlation verification condition corresponding to the data of obese patients may be the matching of the value ranges between the two related fields of age and weight. In the examples of the present application, it is assumed that the specific correlation verification conditions are shown in Table 2, and Table 2 gives the lower limit value and upper limit value of weight corresponding to obese patients of different ages. It should be noted that the above-mentioned correlation verification conditions for obese patient data may also include the matching of value ranges between other related fields, such as gender, age, and waist circumference.
表2-本申请实施例提供的相关性验证条件Table 2 - Correlation verification conditions provided by the examples of this application
本申请实施例中,在可信字段为年龄的情况下,通过查询涉及年龄字段的相关性验证条件,获取到包括年龄和体重两个字段的相关性验证条件,进而确定年龄字段的关联字段为体重这一字段,并可以基于所述年龄字段以及所述体重字段,生成关联字段集。In the embodiment of the present application, when the trusted field is age, the correlation verification conditions including age and weight are obtained by querying the correlation verification conditions involving the age field, and then it is determined that the correlation field of the age field is weight field, and an associated field set may be generated based on the age field and the weight field.
通过对待核查数据表中字段的第一核查能够筛选出明显异常的数据,现有技术通常采用该方式进行核查,但仅采用该方式核查,并未考虑不同字段间的关联关系,因此不能筛选出全部的异常数据。例如,值域范围核查,往往基于经验数据,采用一个普适性的值域范围。以肥胖症为例,在肥胖症对应的值域范围为≥120斤的情况下,所有≥120斤的体重数据均能通过核查,而对于实际肥胖症病人而言,不同年龄对应的体重范围是不同的,如表2所示,42岁对应的肥胖症体重范围为≥125斤,对于一名42岁,体重为123斤的患者,其体重数据能通过第一核查中的值域范围核查,但结合年龄进行判断时,其并不符合肥胖症的体重范围,为异常数据。基于此,进行相关性验证能够进一步保证异常数据筛查的精确度。Obviously abnormal data can be screened out through the first check of the fields in the data table to be checked. This method is usually used for checking in the prior art, but only this method is used for checking without considering the relationship between different fields, so it cannot be screened out. All abnormal data. For example, range checking, often based on empirical data, employs a universal range of values. Taking obesity as an example, when the corresponding range of obesity is ≥ 120 catties, all weight data of ≥ 120 catties can pass the verification. For actual obese patients, the corresponding weight ranges for different ages are Different, as shown in Table 2, the weight range of obesity corresponding to 42 years old is ≥125 catties. For a 42-year-old patient with a weight of 123 catties, the weight data can be checked through the range of the first verification. However, when judging by age, it does not conform to the weight range of obesity, which is abnormal data. Based on this, correlation verification can further ensure the accuracy of abnormal data screening.
本申请实施例中,可能存在同一可信字段同时涉及一个或多个相关性验证条件的情况,此时,对应生成的关联字段集为一个或多个。例如,关于肥胖症患者数据表的相关性验证条件可以包括年龄与体重这两个相关字段间的数值范围匹配性,还可以包括年龄、性别、腰围这三个相关字段间的数值范围匹配性,因此,在这种情况下,生成的对应于该可信字段的关联字段集为两个,分别是{年龄,体重}以及{年龄,性别,腰围}字段集。In the embodiment of the present application, there may be a situation that the same trusted field involves one or more correlation verification conditions at the same time, and in this case, the corresponding generated set of associated fields is one or more. For example, the correlation verification condition for the obesity patient data table may include the matching of the numerical range between the two related fields of age and weight, and may also include the matching of the numerical range between the three related fields of age, gender, and waist circumference. Therefore, in this case, the generated associated field sets corresponding to the trusted field are two, namely {age, weight} and {age, gender, waistline} field sets.
进一步的,本申请实施例中,在生成关联字段集之后,可以对所述被选取的可信字段进行标识。通过对被选取的可信字段进行标识,可以在完成对应于该可信字段的关联字段集的相关性验证后,继续选取其他的可信字段,并对与其对应的关联字段集进行相关性验证,从而避免在选取可信字段时出现重复,进而提高数据处理效率。Further, in this embodiment of the present application, after the associated field set is generated, the selected trusted field may be identified. By identifying the selected trusted field, after completing the correlation verification of the associated field set corresponding to the trusted field, it is possible to continue to select other trusted fields and perform correlation verification on the corresponding associated field set. , so as to avoid duplication when selecting trusted fields, thereby improving data processing efficiency.
本申请实施例中,在所述人群队列研究数据的核查方法的步骤110之前,还可以包括,采集与研究的目标病种对应的数据并生成初始数据表,所述初始数据表中的列对应于数据的不同字段,抽样选取初始数据表中的多条数据生成所述待核查数据表。In the embodiment of the present application, before
本申请实施例中,可以是从服务器上基于关键词搜索与目标病种对应的患者数据,例如基于“肥胖”以及与之类似的关键词搜索肥胖症患者的数据,并将搜索到的肥胖症患者的数据集中到一张表格中,从而形成初始数据表。所述初始数据表中的列对应于数据的不同字段,例如,可以是对应于肥胖症患者的年龄、体重字段或者年龄、性别、体重、腰围等字段。进一步的,为提高数据处理效率,抽样选取初始数据表中的多条数据生成待核查数据表,本申请实施例中,例如,可以抽样选取初始数据表中多条肥胖症患者的数据生成待核查数据表。值得注意的是,各病种对应的患者数据可存储于服务器中,也可存储于其它存储介质中供服务器查询调用。In this embodiment of the present application, the patient data corresponding to the target disease can be searched from the server based on keywords, for example, the data of obese patients is searched based on "obesity" and similar keywords, and the searched obesity The patient's data is collected into a single table, thus forming the initial data table. Columns in the initial data table correspond to different fields of the data, for example, age, weight fields or fields of age, gender, weight, waist circumference, etc. corresponding to obese patients. Further, in order to improve the data processing efficiency, a plurality of pieces of data in the initial data table are sampled to generate the data table to be checked. In the embodiment of the present application, for example, a plurality of pieces of data of obese patients in the initial data table can be sampled to generate the data table to be checked. data sheet. It is worth noting that the patient data corresponding to each disease type can be stored in the server or in other storage media for the server to query and call.
本申请实施例中,获取与目标病种对应的数据并不局限于前述的方法,本领域技术人员容易想到的其他能够获取与目标病种对应的数据的方法均在本申请的保护范围内。In the embodiments of the present application, the acquisition of data corresponding to the target disease is not limited to the aforementioned method, and other methods that can easily obtain data corresponding to the target disease that can be easily conceived by those skilled in the art are within the scope of protection of the present application.
本申请实施例中,在上述对待核查数据表中的字段进行第一核查的过程中,在所述待核查数据表中的任一字段不符合预设的第一核查条件的情况下,可以重新采集所述初始数据表中对应于该字段的数据,并基于重新采集的数据更新所述初始数据表,重新抽样选取所述初始数据表中的多条数据生成待核查数据表。此处,重新采集所述初始数据表中对应于该字段的数据,可以是采集对应于该字段的所有数据,也可以是仅采集对应于该字段中没有通过第一核查条件中的核查项目的数据。In the embodiment of the present application, in the above-mentioned process of performing the first verification on the fields in the data table to be verified, if any field in the data table to be verified does not meet the preset first verification conditions, the The data corresponding to the field in the initial data table is collected, the initial data table is updated based on the re-collected data, and multiple pieces of data in the initial data table are re-sampled to generate a data table to be checked. Here, re-collecting the data corresponding to the field in the initial data table may be collecting all data corresponding to the field, or only collecting the data corresponding to the check items in the field that do not pass the first check condition data.
步骤130:依据所述关联字段集的相关性验证条件,对所述关联字段集中的数据条目进行相关性验证,若所述关联字段集中的数据条目不符合所述相关性验证条件,则将所述数据条目标记为潜在异常数据。Step 130: According to the correlation verification conditions of the associated field set, perform correlation verification on the data items in the associated field set, and if the data items in the associated field set do not meet the correlation verification conditions, then verify all the data items in the associated field set. The data entry described above is marked as potentially anomalous data.
本申请实施例中,所述数据条目可以对应于待核查数据表中的数据行。在生成关联字段集后,由于关联字段集为待核查数据表中的多个字段数据的集合,因此,此时数据条目即对应于关联字段集中的数据行。本申请实施例中,表1提供的待核查数据表仅包括年龄和体重两个字段,此时在可信字段为年龄字段,生成的关联字段集为{年龄,体重}的情况下,其对应的数据如表1所示。In this embodiment of the present application, the data entry may correspond to a data row in the data table to be checked. After the associated field set is generated, since the associated field set is a collection of multiple field data in the data table to be checked, the data entry corresponds to the data row in the associated field set at this time. In the embodiment of this application, the data table to be checked provided in Table 1 only includes two fields: age and weight. At this time, when the trusted field is the age field and the generated associated field set is {age, weight}, the corresponding The data are shown in Table 1.
本申请实施例中,所述对关联字段集中的数据条目进行相关性验证,包括判断数据条目的第一字段及其对应的关联字段的数值是否在对应数值范围内。例如,在第一字段为年龄字段,生成的关联字段集为{年龄,体重}的情况下,依据关联字段集{年龄,体重}的相关性验证条件,即表2中的相关性验证条件,对该关联字段集中的数据条目进行相关性验证,若所述关联字段集中的数据条目的年龄字段及对应的体重字段的数值范围在表2提供的对应数值范围内,则判断该数据条目通过相关性验证,若所述关联字段集中的数据条目的年龄字段及对应的体重字段的数值范围不在表2提供的对应数值范围内,则将该数据条目标记为潜在异常数据。例如,通过将表1中的数据与表2提供的相关性验证条件比对可以发现,对应年龄为4、18、46、10的数据条目在相关性验证条件提供的数值范围内,此时对上述数据条目的相关性验证通过;而对应于年龄为42、18-24的数据条目的体重数值不在相关性验证条件提供的数值范围内,这些数据条目不符合相关性验证条件,将这些数据条目标记为潜在异常数据。In the embodiment of the present application, the performing correlation verification on the data items in the associated field set includes judging whether the values of the first field of the data item and its corresponding associated field are within the corresponding value range. For example, when the first field is an age field and the generated set of associated fields is {age, weight}, according to the correlation verification conditions of the associated field set {age, weight}, that is, the correlation verification conditions in Table 2, The correlation verification is carried out on the data items in the associated field set. If the age field of the data items in the associated field set and the numerical range of the corresponding weight field are within the corresponding numerical range provided in Table 2, then it is judged that the data item has passed the correlation. Sexual verification, if the numerical range of the age field and the corresponding weight field of the data entry in the associated field set is not within the corresponding numerical range provided in Table 2, the data entry is marked as potential abnormal data. For example, by comparing the data in Table 1 with the correlation verification conditions provided in Table 2, it can be found that the data entries corresponding to
本申请实施例中,还可以将关联字段集中的数据条目转换为二维或三维坐标系中的坐标点,并将相关性验证条件中的数值范围转换为对应坐标系中的可信区间。例如,如图2所示,可以以表1中的年龄字段为X轴、体重字段为Y轴构建直角坐标系,并根据表1中的数据条目的数值,将表1中的数据条目转换为直角坐标系中的坐标点;同时可以根据表2提供的对应于不同年龄肥胖症患者体重数值的上限值和下限值分别描点、连线,生成对应于体重上限值和体重下限值的两条拟合曲线,取两条拟合曲线之间的范围为可信区间,从而将表2中的相关性验证条件转换为所述坐标系中的可信区间。In this embodiment of the present application, the data items in the correlation field set can also be converted into coordinate points in a two-dimensional or three-dimensional coordinate system, and the numerical range in the correlation verification condition can be converted into a credible interval in the corresponding coordinate system. For example, as shown in Figure 2, a Cartesian coordinate system can be constructed with the age field in Table 1 as the X-axis and the weight field as the Y-axis, and according to the values of the data items in Table 1, the data items in Table 1 can be converted into Coordinate points in the Cartesian coordinate system; at the same time, points and lines can be drawn and connected according to the upper and lower limits of the weight values of obese patients of different ages provided in Table 2 to generate the upper and lower limits of weight corresponding to the weight. The range between the two fitting curves is taken as the credible interval, so that the correlation verification conditions in Table 2 are converted into credible intervals in the coordinate system.
此时,所述对所述关联字段集中的数据条目进行相关性验证,可以是判断根据数据条目的第一字段及其对应的关联字段确定的坐标点是否在所述可信区间内,若所述坐标值不在可信区间内,则将该数据条目标记为潜在异常数据。例如,图2中对应于年龄为42的数据条目的坐标点不在可信区间内,则将对应于年龄为42的数据条目标记为潜在异常数据。At this time, the correlation verification of the data items in the associated field set may be to determine whether the coordinate point determined according to the first field of the data item and its corresponding associated field is within the credible interval, if the If the coordinate value is not within the credible interval, the data entry is marked as potential abnormal data. For example, if the coordinate point corresponding to the data item with age 42 in FIG. 2 is not within the credible interval, the data item corresponding to age 42 is marked as potential abnormal data.
步骤140:判断所述潜在异常数据是否为真实数据,若是,则判断所述潜在异常数据是否满足目标病种对应的关联条件,若满足,则将所述潜在异常数据标记为正常,同时,判断标记为正常的潜在异常数据的数量或占比是否达到预设阈值,若是,则基于所述标记为正常的潜在异常数据更新所述关联字段集的相关性验证条件,并基于更新后的相关性验证条件,对所述关联字段集中的待验证数据条目进行相关性验证。Step 140: Determine whether the potential abnormal data is real data, if so, determine whether the potential abnormal data satisfies the associated conditions corresponding to the target disease, if so, mark the potential abnormal data as normal, and at the same time, determine Whether the number or proportion of potentially abnormal data marked as normal reaches a preset threshold, and if so, update the correlation verification condition of the associated field set based on the potential abnormal data marked as normal, and based on the updated correlation The verification condition is to perform correlation verification on the data items to be verified in the associated field set.
本申请实施例中,例如,针对上述示例中提供的潜在异常数据,通过人工核实该数据条目中的体重数据是否为真实数据,也即是否符合患者的实际情况,若该体重数据不符合患者的实际情况,则说明该体重数据不是真实数据(可能因为人工录入过程中出错),需要重新采集与该数据条目对应的数据;若该体重数据符合患者的实际情况,则说明该潜在异常数据是真实数据。In the embodiment of the present application, for example, for the potentially abnormal data provided in the above example, it is manually verified whether the weight data in the data entry is real data, that is, whether it conforms to the actual situation of the patient. If the actual situation, it means that the weight data is not real data (maybe because of errors in the manual input process), and the data corresponding to the data entry needs to be collected again; if the weight data conforms to the actual situation of the patient, it means that the potential abnormal data is real data.
本申请实施例中,还预设有所述目标病种对应的关联条件,本申请实施例中,所述目标病种对应的关联条件为待核查数据表中的数据仅由目标病种造成,而不受其他病种的影响。例如,在研究的目标病种为肥胖症的情况下,所述肥胖症对应的关联条件为待核查数据表中的数据仅由肥胖症造成,而不受其他病种的影响。如果某一患者同时患有肥胖症和恶性肿瘤,会导致肥胖症研究相关的数据出现异常。例如恶性肿瘤可能使患者体重迅速下降。但在患者数据采集阶段,由于医院各科室数据不互通,因此只能基于目标病种到对应科室采集患者数据。当患者数据采集完成之后,才能通过服务器获取其它科室的患者数据,以判断所述潜在异常数据是否符合所述目标病种对应的关联条件。In the embodiment of the present application, the association condition corresponding to the target disease is also preset. In the embodiment of the present application, the association condition corresponding to the target disease is that the data in the data table to be checked is only caused by the target disease, not affected by other diseases. For example, when the target disease of the study is obesity, the association condition corresponding to the obesity is that the data in the data table to be checked is only caused by obesity and is not affected by other diseases. If a patient has both obesity and malignancy, it can lead to anomalous data related to obesity research. Malignant tumors, for example, may cause a patient to lose weight rapidly. However, in the patient data collection stage, due to the lack of interoperability of data among various departments in the hospital, patient data can only be collected from the corresponding department based on the target disease type. After the patient data collection is completed, the patient data of other departments can be obtained through the server to determine whether the potential abnormal data meets the associated conditions corresponding to the target disease.
进一步的,本申请实施例中,在判断所述潜在异常数据为真实数据的情况下,继续判断所述潜在异常数据是否满足目标病种对应的关联条件,若满足,则将所述潜在异常数据标记为正常,这样能更进一步保证关联字段集中数据筛查的精确度。Further, in the embodiment of the present application, in the case of judging that the potential abnormal data is real data, continue to judge whether the potential abnormal data satisfies the associated conditions corresponding to the target disease, and if so, the potential abnormal data is Mark as normal, which can further ensure the accuracy of data screening in the associated field set.
本申请实施例中,例如,如果判断上述关联字段集中年龄为42、体重为56的数据条目为真实数据,则进一步判断该数据条目是否满足肥胖症对应的关联条件,也即该数据条目中的数据仅由肥胖症造成,而不受其他病种的影响。如果该数据条目中的数据确实仅为肥胖症造成,也即该体重值小于相关性验证条件中提供的数值范围的原因并非由其他疾病造成,则判断该数据条目满足肥胖症对应的关联条件,并将该数据条目标记为正常。In the embodiment of the present application, for example, if it is determined that the data entry whose age is 42 and the weight is 56 in the above-mentioned association field set is real data, it is further determined whether the data entry satisfies the association condition corresponding to obesity, that is, the data entry in the data entry. The data is solely due to obesity and not affected by other diseases. If the data in the data entry is indeed caused only by obesity, that is, the reason why the weight value is less than the value range provided in the correlation verification condition is not caused by other diseases, then it is judged that the data entry satisfies the association condition corresponding to obesity, and mark the data entry as OK.
另一方面,如果进一步判断所述潜在异常数据不满足所述目标病种对应的关联条件,则可以删除所述不满足所述目标病种对应的关联条件的潜在异常数据。本申请实施例中,例如,如果判断上述关联字段集中年龄为42、体重为56的数据条目为真实数据,但与该数据条目对应的患者除肥胖症外还存在其他的疾病,且该其他疾病导致该数据条目中的体重值小于相关性质控条件中提供的体重的数值范围,则这种情况下判断该数据条目不满足肥胖症对应的关联条件,并将该数据条目从当前的关联字段集中删除。On the other hand, if it is further determined that the potential abnormal data does not satisfy the association condition corresponding to the target disease type, the potential abnormal data that does not satisfy the association condition corresponding to the target disease type can be deleted. In the embodiment of the present application, for example, if it is determined that the data entry with age 42 and weight 56 in the above-mentioned association field set is real data, but the patient corresponding to the data entry has other diseases besides obesity, and the other diseases As a result, the weight value in the data entry is less than the value range of the weight provided in the relevant quality control conditions. In this case, it is judged that the data entry does not meet the association conditions corresponding to obesity, and the data entry is removed from the current association field set. delete.
本申请实施例中,在将所述潜在异常数据标记为正常的同时,判断标记为正常的潜在异常数据的数量或占比是否达到预设阈值,若是,则基于所述标记为正常的潜在异常数据更新所述关联字段集的相关性验证条件。由于所述关联字段集的相关性验证条件是基于经验预先设定的,但随着患者数据的不断累计和更新,可能出现一些新的数据并不适用于预设的相关性验证条件,因此,有必要对所述相关性验证条件进行不断迭代更新,以保证患者数据筛查结果的准确性。In the embodiment of the present application, while marking the potential abnormal data as normal, it is determined whether the number or proportion of the potential abnormal data marked as normal reaches a preset threshold, and if so, based on the potential abnormal data marked as normal The data updates the correlation validation condition of the associated field set. Because the correlation verification conditions of the associated field set are preset based on experience, but with the continuous accumulation and updating of patient data, some new data may appear that are not suitable for the preset correlation verification conditions. Therefore, It is necessary to iteratively update the correlation verification conditions to ensure the accuracy of patient data screening results.
本申请实施例中,例如,在关联字段集{年龄、体重}中的数据条目数量为N的情况下,可以在服务器中预设标记为正常的潜在异常数据的数量的阈值为N/3,并采用计数器对标记为正常的潜在异常数据的数量进行实时计数,当判断标记为正常的潜在异常数据的数量达到预设阈值N/3时,更新所述关联字段集的相关性验证条件。本申请实施例中,所述标记为正常的潜在异常数据的数量的阈值并不限定于上述情况,本领域技术人员可以根据实际需求设定对应的阈值。In this embodiment of the present application, for example, when the number of data items in the associated field set {age, weight} is N, the threshold for the number of potential abnormal data marked as normal may be preset in the server to be N/3, A counter is used to count the number of potential abnormal data marked as normal in real time, and when it is determined that the number of potential abnormal data marked as normal reaches a preset threshold N/3, the correlation verification condition of the associated field set is updated. In the embodiment of the present application, the threshold for the number of potential abnormal data marked as normal is not limited to the above situation, and those skilled in the art can set the corresponding threshold according to actual needs.
本申请实施例中,例如,在关联字段集{年龄、体重}中的数据条目数量为N的情况下,可以在服务器中预设标记为正常的潜在异常数据的占比的阈值为1/3,当判断标记为正常的潜在异常数据的占比达到预设阈值1/3时,更新所述关联字段集的相关性验证条件。本申请实施例中,所述标记为正常的潜在异常数据的占比的阈值并不限定于上述情况,本领域技术人员可以根据实际需求设定对应的阈值。In this embodiment of the present application, for example, when the number of data entries in the associated field set {age, weight} is N, the threshold for the proportion of potentially abnormal data marked as normal may be preset in the server to be 1/3 , when it is determined that the proportion of potentially abnormal data marked as normal reaches 1/3 of the preset threshold, update the correlation verification condition of the associated field set. In the embodiment of the present application, the threshold value of the proportion of the potential abnormal data marked as normal is not limited to the above situation, and those skilled in the art can set the corresponding threshold value according to actual needs.
本申请实施例中,所述更新所述关联字段集的相关性验证条件,可以是基于所述标记为正常的潜在异常数据的数值更新一个或多个关联字段的数值范围。例如,若上述示例提供的关联字段集中对应年龄为42以及18-24的数据条目均为标记为正常的潜在异常数据,由于对应年龄为18-24的数据条目的体重值与表2提供的相关性验证条件较为接近,因此可以基于上述对应年龄为18-24的数据条目的体重值更新所述相关性验证条件中对应于18-24岁的体重字段的数值范围。具体的,本申请实施例中,可以通过判断标记为正常的潜在异常数据的坐标点是否位于可信区间附近来确定哪些标记为正常的潜在异常数据的数值与提供的相关性验证条件较为接近。例如,如图2所示,对应年龄为18-24的数据条目对应的坐标点与可信区间较为接近,而对应年龄为42的数据条目对应的坐标点与可信区间距离较远,因此选取对应年龄为18-24的数据条目的体重值更新所述相关性验证条件中对应于18-24岁的体重的数值范围。In this embodiment of the present application, the updating the correlation verification condition of the associated field set may be updating the value range of one or more associated fields based on the value of the potential abnormal data marked as normal. For example, if the data items corresponding to age 42 and 18-24 in the correlation field set provided in the above example are all potential abnormal data marked as normal, since the weight value of the data item corresponding to age 18-24 is related to the data provided in Table 2 The sex verification conditions are relatively close, so the value range of the weight field corresponding to the age of 18-24 in the correlation verification condition can be updated based on the weight value of the data entry corresponding to the age of 18-24. Specifically, in this embodiment of the present application, it is possible to determine which values of the potential abnormal data marked as normal are closer to the provided correlation verification conditions by judging whether the coordinate points of the potential abnormal data marked as normal are located near the credible interval. For example, as shown in Figure 2, the coordinate points corresponding to the data items with age 18-24 are relatively close to the credible interval, while the coordinate points corresponding to the data items corresponding to the age of 42 are far away from the credible interval. The weight value for the data entry corresponding to age 18-24 updates the range of values for weight corresponding to age 18-24 in the correlation validation condition.
本申请实施例中,所述更新关联字段集的相关性验证条件,可以是人工更新也可以是自动更新。In the embodiment of the present application, the correlation verification condition for updating the associated field set may be manually updated or automatically updated.
本申请实施例中,所述更新所述关联字段集的相关性验证条件,还可以是从标记为正常的潜在异常数据中选取一个可信字段对应的某一数值,对待核查数据表中对应于该数值的关联字段的数值进行统计分析,并根据统计分析结果自动更新所述关联字段集的相关性验证条件。例如,若上述示例提供的关联字段集中对应年龄为42以及18-24的数据条目均为标记为正常的潜在异常数据,则可以从这些数据条目中选取年龄为18的这一数值,对待核查数据表中对应年龄为18的所有数据条目的体重的数值及出现频率进行统计分析,选取出现频率最高的体重的数值范围,并基于该数值范围更新所述相关性验证条件中对应于18岁的体重的数值范围。In the embodiment of the present application, the updating of the correlation verification condition of the associated field set may also be to select a certain value corresponding to a trusted field from the potential abnormal data marked as normal, and select a certain value corresponding to a trusted field in the data table to be checked. Statistical analysis is performed on the numerical value of the associated field of the numerical value, and the correlation verification condition of the associated field set is automatically updated according to the statistical analysis result. For example, if the data items corresponding to ages 42 and 18-24 in the associated field set provided in the above example are all potential abnormal data marked as normal, the value of age 18 can be selected from these data items to be checked. Statistical analysis is carried out on the weight values and frequency of occurrence of all data entries corresponding to age 18 in the table, the numerical range of the weight with the highest frequency of occurrence is selected, and the weight corresponding to the age of 18 in the correlation verification condition is updated based on the numerical range. range of values.
需要注意的是,当关联字段集中的字段较多时,更新所述关联字段集的相关性验证条件,也可以是更新其中多个关联字段的数值范围,本申请实施例中对该种情况不再展开叙述。It should be noted that when there are many fields in the associated field set, updating the correlation verification condition of the associated field set may also update the numerical range of multiple associated fields, which is no longer the case in the embodiment of the present application. Expand the narrative.
本申请通过基于第一字段以及第一字段的关联字段组建关联字段集,并利用相关性验证条件对关联字段集进行相关性验证,有效提高了人群队列研究数据的核查精确度;此外,在关联字段集的相关性验证过程中,能够根据标记为正常的潜在异常数据更新关联字段集的相关性验证条件,并采用更新后的相关性验证条件对关联字段集中的待验证数据条目进行相关性验证,避免了待验证数据被识别为潜在异常数据,从而有效提升了数据核查效率。In this application, an association field set is formed based on the first field and the associated fields of the first field, and the correlation verification condition is used to verify the correlation of the association field set, which effectively improves the verification accuracy of the population cohort study data; in addition, in the association During the correlation verification process of the field set, the correlation verification condition of the associated field set can be updated according to the potential abnormal data marked as normal, and the updated correlation verification condition can be used to perform correlation verification on the data items to be verified in the associated field set. , which prevents the data to be verified from being identified as potentially abnormal data, thereby effectively improving the efficiency of data verification.
本申请实施例中,通过步骤120-140,完成了对待核查数据表中首先选取的可信字段或唯一可信字段对应的一个或多个关联字段集的相关性验证。In the embodiment of the present application, through steps 120-140, the correlation verification of one or more associated field sets corresponding to the trusted field or the unique trusted field selected first in the data table to be checked is completed.
本申请实施例中,还提供了另一种人群队列研究数据的核查方法,该方法考虑到了待核查数据表中有多个可信字段的情况,在完成步骤110至140后,还包括,对所述待核查数据表中的其他可信字段重复步骤120至140,直至完成对所述待核查数据表中可信字段的遍历。In the embodiment of the present application, another method for verifying data of a population cohort study is also provided. The method takes into account the fact that there are multiple credible fields in the data table to be verified. After completing
此外,若某一关联字段集涉及两个或多个可信字段,则在上述数据核查过程中会出现关联字段集的重复,此时,为提高数据处理效率,可以跳过已经进行过相关性验证的关联字段集,而直接对其他关联字段集进行相关性验证。In addition, if a certain associated field set involves two or more trusted fields, the repetition of the associated field set will occur during the above data verification process. In this case, in order to improve the data processing efficiency, you can skip the correlation that has already been performed. Validated set of associated fields, while correlation validation is performed directly on other sets of associated fields.
根据本申请实施例提供的人群队列研究数据的核查方法,本申请实施例还提供了一种人群队列研究数据的核查装置,其系统结构图如图3所示。According to the method for verifying population cohort study data provided by the embodiment of the present application, an embodiment of the present application further provides a device for verifying population cohort study data, the system structure diagram of which is shown in FIG. 3 .
本申请实施例提供的人群队列研究数据的核查装置300,包括第一核查单元301、关联字段集生成单元302、相关性验证单元303以及相关性验证条件更新单元304。其中,The
第一核查单元301,用于对待核查数据表中的字段进行第一核查,若所述字段均符合预设的第一核查条件,则选取待核查数据表中的任一字段作为第一字段;The
关联字段集生成单元302,用于查询所述第一字段的关联字段并基于所述第一字段以及所述第一字段的关联字段生成关联字段集;an associated field set generating
相关性验证单元303,用于依据所述关联字段集的相关性验证条件,对所述关联字段集中的数据条目进行相关性验证,若所述关联字段集中的数据条目不符合所述相关性验证条件,则将所述数据条目标记为潜在异常数据;The
相关性验证条件更新单元304,用于判断所述潜在异常数据是否为真实数据,若是,则判断所述潜在异常数据是否满足目标病种对应的关联条件,若满足,则将所述潜在异常数据标记为正常,同时,判断标记为正常的潜在异常数据的数量或占比是否达到预设阈值,若是,则基于所述标记为正常的潜在异常数据更新所述关联字段集的相关性验证条件,并基于更新后的相关性验证条件,对所述关联字段集中的待验证数据条目进行相关性验证。The correlation verification
图4出了根据本申请实施例提供的电子设备400的结构示意图。如图4所示,电子设备400包括中央处理单元(CPU)401,其可以根据存储在只读存储器(ROM)402中的程序或者从存储部分408加载到随机访问存储器(RAM)403中的程序而执行各种适当的动作和处理。在RAM 403中,还存储有电子设备操作所需的各种程序和数据。CPU 401、ROM 402以及RAM403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。FIG. 4 is a schematic structural diagram of an
以下部件连接至I/O接口405:包括键盘、鼠标等的输入部分406;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分407;包括硬盘等的存储部分408;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分409。通信部分409经由诸如因特网的网络执行通信处理。驱动器410也根据需要连接至I/O接口405。可拆卸介质411,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器410上,以便于从其上读出的计算机程序根据需要被安装入存储部分408。The following components are connected to the I/O interface 405: an
特别地,根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,包括承载指令的在计算机可读介质,在这样的实施例中,该指令可以通过通信部分409从网络上被下载和安装,和/或从可拆卸介质411被安装。在该指令被中央处理单元(CPU)401执行时,执行本发明中描述的各个方法步骤。In particular, according to embodiments of the present application, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer-readable medium carrying instructions that, in such embodiments, may be downloaded and installed from a network via
以上所述实施例,仅为本申请的具体实施方式,用以说明本申请的技术方案,而非对其限制,本申请的保护范围并非局限于此,尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或替换,并不使相应技术方案的本质脱离本申请实施例技术方案的精神和范围,都应涵盖在本申请的保护范围之内。The above-mentioned embodiments are only specific implementations of the present application, and are used to illustrate the technical solutions of the present application, but not to limit them, and the protection scope of the present application is not limited thereto. It should be understood by those of ordinary skill in the art that: any person skilled in the art can still modify the technical solutions described in the foregoing embodiments or easily think of changes within the technical scope disclosed in this application, or Some of the technical features are equivalently replaced; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application, and should be included within the protection scope of the present application.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111203015.4A CN113919812A (en) | 2021-10-15 | 2021-10-15 | Method and device for checking crowd queue research data |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111203015.4A CN113919812A (en) | 2021-10-15 | 2021-10-15 | Method and device for checking crowd queue research data |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN113919812A true CN113919812A (en) | 2022-01-11 |
Family
ID=79240701
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111203015.4A Pending CN113919812A (en) | 2021-10-15 | 2021-10-15 | Method and device for checking crowd queue research data |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN113919812A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114912804A (en) * | 2022-05-17 | 2022-08-16 | 四川大学华西医院 | Scientific research data related property control method and system |
Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020095406A1 (en) * | 2001-01-16 | 2002-07-18 | Mc George Vernon E. | Method and system for validating data submitted to a database application |
| DE102014223225A1 (en) * | 2014-11-13 | 2015-06-25 | Robert Bosch Gmbh | Method for processing data |
| US20150379432A1 (en) * | 2013-03-29 | 2015-12-31 | Fujitsu Limited | Model updating method, model updating device, and recording medium |
| CN108541363A (en) * | 2015-12-26 | 2018-09-14 | 英特尔公司 | Technology for management of sensor exception |
| WO2019073557A1 (en) * | 2017-10-11 | 2019-04-18 | 三菱電機株式会社 | Sample data generation device, sample data generation method, and sample data generation program |
| CN110362476A (en) * | 2019-06-17 | 2019-10-22 | 深圳壹账通智能科技有限公司 | Verification method, device, computer equipment and the storage medium of data conversion tools |
| CN110414186A (en) * | 2019-06-20 | 2019-11-05 | 阿里巴巴集团控股有限公司 | Data assets cutting method of calibration and device |
| CN110515937A (en) * | 2019-09-02 | 2019-11-29 | 中国农业银行股份有限公司 | A kind of data verification method and device |
| CN111782728A (en) * | 2020-06-30 | 2020-10-16 | 北京金山云网络技术有限公司 | Data synchronization method, device, electronic equipment and medium |
| CN112286912A (en) * | 2020-08-12 | 2021-01-29 | 上海柯林布瑞信息技术有限公司 | Medical data quality checking method and device, terminal and storage medium |
| CN112347137A (en) * | 2019-08-06 | 2021-02-09 | 阿里巴巴集团控股有限公司 | Data verification method and device and readable storage medium |
| CN112579632A (en) * | 2020-12-28 | 2021-03-30 | 中国建设银行股份有限公司 | Data verification method, device, equipment and medium |
-
2021
- 2021-10-15 CN CN202111203015.4A patent/CN113919812A/en active Pending
Patent Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020095406A1 (en) * | 2001-01-16 | 2002-07-18 | Mc George Vernon E. | Method and system for validating data submitted to a database application |
| US20150379432A1 (en) * | 2013-03-29 | 2015-12-31 | Fujitsu Limited | Model updating method, model updating device, and recording medium |
| DE102014223225A1 (en) * | 2014-11-13 | 2015-06-25 | Robert Bosch Gmbh | Method for processing data |
| CN108541363A (en) * | 2015-12-26 | 2018-09-14 | 英特尔公司 | Technology for management of sensor exception |
| WO2019073557A1 (en) * | 2017-10-11 | 2019-04-18 | 三菱電機株式会社 | Sample data generation device, sample data generation method, and sample data generation program |
| CN110362476A (en) * | 2019-06-17 | 2019-10-22 | 深圳壹账通智能科技有限公司 | Verification method, device, computer equipment and the storage medium of data conversion tools |
| CN110414186A (en) * | 2019-06-20 | 2019-11-05 | 阿里巴巴集团控股有限公司 | Data assets cutting method of calibration and device |
| CN112347137A (en) * | 2019-08-06 | 2021-02-09 | 阿里巴巴集团控股有限公司 | Data verification method and device and readable storage medium |
| CN110515937A (en) * | 2019-09-02 | 2019-11-29 | 中国农业银行股份有限公司 | A kind of data verification method and device |
| CN111782728A (en) * | 2020-06-30 | 2020-10-16 | 北京金山云网络技术有限公司 | Data synchronization method, device, electronic equipment and medium |
| CN112286912A (en) * | 2020-08-12 | 2021-01-29 | 上海柯林布瑞信息技术有限公司 | Medical data quality checking method and device, terminal and storage medium |
| CN112579632A (en) * | 2020-12-28 | 2021-03-30 | 中国建设银行股份有限公司 | Data verification method, device, equipment and medium |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114912804A (en) * | 2022-05-17 | 2022-08-16 | 四川大学华西医院 | Scientific research data related property control method and system |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN109636061B (en) | Training method, device and equipment for medical insurance fraud prediction network and storage medium | |
| CN111309822B (en) | User identification method and device | |
| EP2936366B1 (en) | Method and system for network validation of information | |
| CN112365987A (en) | Diagnostic data anomaly detection method and device, computer equipment and storage medium | |
| CN109634941B (en) | Medical data processing method, device, electronic device and storage medium | |
| CN110197214A (en) | A kind of patient identity matching process based on multi-field similarity calculation | |
| CN115794916A (en) | Data processing method, device, equipment and storage medium for multi-source data fusion | |
| JP2016149127A (en) | Device and method for determining entity attribute value | |
| CN113782195A (en) | A kind of medical examination package customization method and device | |
| WO2024103765A1 (en) | Sensitive data recognition model generation method and apparatus, and device and storage medium | |
| CN110298371A (en) | The method and apparatus of data clusters | |
| CN113919812A (en) | Method and device for checking crowd queue research data | |
| CN115715418A (en) | Disease risk prediction method, device, storage medium and electronic equipment | |
| CN108228896A (en) | A kind of missing data complementing method and device based on density | |
| CN115391611A (en) | Report comparison method and system and electronic equipment | |
| WO2020199692A1 (en) | Method and apparatus for screening predictive image features for cancer metastasis, and storage medium | |
| CN106339401A (en) | Method and equipment for confirming relationship between entities | |
| CN111190902A (en) | A structured method, device, device and storage medium for medical data | |
| CN107870913A (en) | The high of effective time it is expected weight item collection method for digging, device and processing equipment | |
| CN111261298A (en) | Medical data quality pre-judging method and device, readable medium and electronic equipment | |
| CN115729949A (en) | Data verification method, device, server and medium | |
| CN120723835B (en) | Data conversion method, device, equipment and medium | |
| EP3510507A1 (en) | Patient healthcare record linking system | |
| CN107273293A (en) | Big data system performance testing method, device and electronic equipment | |
| CN113488178A (en) | Information generation method and device, storage medium and electronic equipment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |