CN119474232A

CN119474232A - A method for data source classification, tracing and recommendation for clinical research data

Info

Publication number: CN119474232A
Application number: CN202411513818.3A
Authority: CN
Inventors: 赖俊恺; 孙定华
Original assignee: Hangzhou Laimai Medical Information Technology Co ltd
Current assignee: Hangzhou Laimai Medical Information Technology Co ltd
Priority date: 2024-10-28
Filing date: 2024-10-28
Publication date: 2025-02-18

Abstract

The invention discloses a data source classifying, tracing and recommending method for clinical research data, and belongs to the field of medical data processing. A method for classifying, tracing and recommending data sources of clinical research data includes such steps as obtaining the report form of clinical research, converting the form names and data fields in report form of electronic medical record to metadata text description, classifying the metadata text description into several classes, generating ETL data transmission program and code for transmitting source data, obtaining desensitized source data sample, searching and ranking source data class for each research, determining trace-to-source route by manual check module, generating data interaction specification form, and so on.

Description

Data source classifying, tracing and recommending method for clinical research data

Technical Field

The invention belongs to the field of medical data processing, and particularly relates to a data source classifying, tracing and recommending method for clinical research data.

Background

In the clinical research process, the data sources are various and numerous, such as an electronic medical record system (EMR), a Laboratory Inspection System (LIS), a medicine management system and the like, and the data are usually stored in different systems and are in different formats, so that the management, analysis and tracing of the data become extremely complex;

Moreover, according to relevant regulations, researchers can only use patient clinical data within a necessary range when conducting clinical studies to reduce the risk of data leakage and exposure to sensitive information, and patient clinical data is generally stored in an internal hospital system, which results in that researchers cannot effectively determine which specific data a hospital can provide for a study item.

Disclosure of Invention

The invention aims to solve the technical problem of providing a data source classifying, tracing and recommending method for clinical research data, which can effectively protect sensitive patient data used in clinical research and improve the efficiency of data transmission and processing.

The invention discloses a data source classifying, tracing and recommending method for clinical research data, which comprises the following steps:

S1, acquiring a clinical research case report form problem;

s2, recommending relevant hospital source data categories for answering pathology report questions, wherein the relevant hospital source data categories specifically comprise:

S201, converting the table names and the corresponding data fields of the electronic medical record report table acquired in the S1 into metadata text description;

s202, classifying metadata text description by using semantic Embedding technology and HNSW vector library retrieval technology, embedding metadata into semantic vector representation by using Embedding technology, and constructing a database by adopting HNSW algorithm ideas;

s203, constructing a data world model by adopting Embedding technology and HNSW algorithm, wherein the data world model is used for processing unstructured and structured metadata respectively;

s3, generating an ETL data transmission program and codes for transmitting source data, wherein the ETL data transmission program and codes specifically comprise the following steps:

s301, preprocessing unstructured metadata and/or structured metadata by applying an algorithm of S2, and reserving corresponding data sources;

s302, generating a data mapping document, wherein the data mapping document contains example SQL codes required by each type of metadata and usage description about the metadata in a case report table;

s4, obtaining anonymized sample source data;

S5, searching and sequencing source data categories for each research problem;

S6, manually determining a tracing path;

and S7, generating a data interaction specification form.

As a further improvement of the present invention, unstructured metadata is structured with patient ID, visit ID, record time, source data name, text content, additional field name and additional field result in step S203, other content stored in the hospital data system is called up to be linked and sent under the additional field result, and metadata corresponding to the additional content is added in the additional field name to describe the result.

As a further improvement of the present invention, in step S203, the structured metadata is structured into a patient ID, a visit ID, a record time, a source data name, a specified data field name, a specified field result, an additional field name, an additional field result, and for structured data fields, common data fields are used as specified fields, while additional fields cover other data stored in the hospital data system.

In step S5, traversing each research problem in the case report table, matching N text fragments most relevant to the research problem in a sample library and data sources corresponding to the text fragments by using the technology of S202, counting the occurrence frequency of each data source, sorting in descending order according to the frequency, and reserving the first three data sources with high evaluation rate as recommended data sources of the research problem;

the study questions are the study fields in the case report form.

As a further improvement of the present invention, N is a custom value.

As a further development of the invention, the N value is 10.

As a further improvement of the present invention, in step S6, a manual review module is provided for review and selection of data sources related to each study problem in the case report form.

As a further improvement of the present invention, in step S7, the result form in S302 is updated using the data source after the manual review in S6 as the final data interaction specification form.

As a further improvement of the present invention, in step S4, the data mapping document in S302 is subjected to a desensitization operation, and anonymized sample source data after the desensitization is obtained.

As a further development of the invention, each metadata has a corresponding data field structure.

Compared with the prior art, the method has the advantages that the method can recommend the data source category corresponding to the requirement according to the specific data requirement of clinical research and generate a data transmission requirement table (such as laboratory results from an LIS system), so that the auditing and efficiency of data transmission are improved, the transmitted data are de-identified patient sample data, patient privacy is further protected, the metadata data sources can be classified and ordered to find the data source related to the research problem and bind the data source with the specific research problem, the correlation of the source data and the research problem can be measured and summary reports of the data collection range and the corresponding correlation can be generated, and the reports are provided for a correlation party to meet the research requirement on the premise that the sensitive patient data is not exposed.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples, which are illustrative of the present invention and are not intended to limit the present invention thereto.

Referring to fig. 1, a method for classifying, tracing and recommending data sources of clinical study data includes the following steps:

S1, acquiring a clinical research case report form problem;

In step S1, the case report form is an electronic medical record report form, preferably, the electronic medical record report form is in a form format, such as an Excel form, the electronic medical record report form comprises a plurality of worksheets, each worksheet is used for recording specific purposes for data collection, such as collection of medicine use cases or population information, and each worksheet is provided with data fields, such as a medicine name, a use frequency and a use time, in the worksheet of the medicine use cases.

S202, classifying metadata text descriptions by using semantic Embedding technology and HNSW vector library retrieval technology;

The method comprises embedding metadata into semantic vector representation by Embedding technology, and constructing database by HNSW algorithm idea, and realizing rapid matching of data field, metadata and data source thereof by semantic similarity retrieval technology, namely rapidly searching metadata and data source thereof related to the data field according to specific data field;

Classifying the metadata textual description into different categories, such as patient basic information, drug orders, laboratory examinations, ward reports, past and present history, surgical history, vital signs, and the like;

The method comprises the steps of structuring unstructured metadata into an architecture with a patient ID, a visit ID, a recording time, a source data name, text content, an additional field name and an additional field result, retrieving other content stored in a hospital data system to be linked and sent to the under the additional field result, adding metadata corresponding to the additional content in the additional field name to describe the result, structuring the structured metadata into the architecture of the patient ID, the visit ID, the recording time, the source data name, the designated data field name, the designated field result, the additional field name and the additional field result, taking common data fields as designated fields for the structured data fields, wherein the additional fields are used for covering other data stored in the hospital data system.

Source data name, e.g., laboratory exam;

Specifying data field names, such as laboratory item, results, normal results upper and lower limits;

The specified field results, e.g., hbA c, 7.0, 6.0, 7.2, correspond to the laboratory item, results, normal results upper and lower limits, respectively, in the specified data field names.

Examples of single pieces of metadata that may be stored in the vector database after the framework process:

The source data name in the unstructured metadata includes discharge records, ward records, and the like.

Additional field names in unstructured metadata include doctor name, change status, etc.

The specified data field names include laboratory exam names, exam results, exam result lower limits, exam result upper limits, etc.

S301, preprocessing unstructured metadata and/or structured metadata by applying an algorithm of S2, and reserving corresponding data sources, wherein mapping is completed between medicine order metadata and data sources such as orders, ward-round records, discharge records and the like;

S302, generating a data mapping document, wherein the data mapping document contains an example SQL (Structured Query Language) code required by each type of metadata and a description of the use of the metadata in a case report form.

S4, obtaining anonymized sample source data;

and (3) performing desensitization operation on the data mapping document in the step S302 to obtain anonymized sample source data after desensitization.

S5, searching and sequencing source data categories for each research problem;

traversing each research problem in the case report table, matching N text fragments most relevant to the research problem in a sample library and data sources corresponding to the text fragments by using the technology of S202, counting the occurrence frequency of each data source and sorting the data sources in a descending order according to the frequency;

The value of N can be customized, and preferably, the invention is set to 10.

Preferably, the study questions are the study fields in the case report form, such as whether there is diabetes, hypertension, coronary heart disease, etc.

S6, manually determining a tracing path;

the system comprises a manual rechecking module, a database module and a database module, wherein the manual rechecking module is used for rechecking and selecting data sources related to each research problem in a case report form;

Specifically, a manual review module is added on the basis of the recommended data source of each field obtained in S5, and the design of the manual review module is to review the recommended data sources of the system and select the data source most relevant to each research problem in the pathology report table, so that the data source tracing error caused by the system error is avoided, and a better balance is achieved between the high efficiency and the accuracy of the system.

S7, generating a data interaction specification form;

And updating the result form in the step S302 by using the data source after the manual review in the step S6 to serve as a final data interaction specification form.

The update cases in each system can be updated at fixed time or manually when temporary demands exist.

By adopting the method, the following operations can be performed to realize the acquisition of data:

1. Defining a data structure, namely determining formats required by source data and research fields in a data transmission time;

2. Constructing a data model by using the historical desensitization data;

3. According to specific research fields (such as whether patients have coronary heart disease or not) in the clinical research case report table, matching to obtain data sources of the possibility Top K;

4. manually rechecking and only reserving the most relevant data sources and metadata thereof;

5. A data interaction form is generated for a particular study field (e.g., whether the patient has coronary heart disease).

The above description is only of the preferred embodiments of the present application, but the protection scope of the present application is not limited thereto, any person skilled in the art should be able to make equivalent substitutions or modifications according to the technical solution and the modified concept thereof within the scope of the present application.

Claims

1. A method for data source classification, tracing and recommendation of clinical research data, characterized in that it comprises the following steps:

S1: Access to clinical study case report form questions;

S2: Recommended categories of relevant hospital source data for answering questions about pathology reports, including:

S201: converting the table name and corresponding data fields of the electronic medical record report form obtained in S1 into metadata text description;

S202: Use semantic embedding technology and HNSW vector library retrieval technology to classify metadata text descriptions; embed metadata into semantic vector representation using Embedding technology, and then use HNSW algorithm ideas to build a database; use semantic similarity retrieval technology to achieve rapid matching of data fields with metadata and their data sources;

S203: Using the combination of Embedding technology and HNSW algorithm, a data world model is constructed to process unstructured and structured metadata respectively;

S3: Generate ETL data transmission program and code for transmitting source data; specifically, it includes:

S301: applying the algorithm of S2 to pre-process the unstructured metadata and/or structured metadata while retaining the corresponding data source;

S302: Generate a data mapping document, which includes sample SQL codes required for each type of metadata and a description of the use of metadata in the case report form;

S4: Obtain anonymized sample source data;

S5: Search and sort source data categories for each research question;

S6: Manually determine the traceability path;

S7: Generate a data interaction specification form.

2. A method for data source classification, tracing and recommendation of clinical research data according to claim 1, characterized in that: in step S203, the unstructured metadata is structured into an architecture with patient ID, consultation ID, record time, source data name, text content, additional field name and additional field result; other content stored in the hospital data system is retrieved to be linked and sent to the additional field result, and metadata corresponding to the additional content is added to the additional field name to describe the result.

3. According to claim 1, a method for data source classification, tracing and recommendation of clinical research data is characterized in that: in step S203, the structured metadata is structured into the following architecture: patient ID, consultation ID, record time, source data name, specified data field name, specified field result, additional field name, additional field result; for structured data fields, common data fields are used as specified fields, and additional fields cover other data stored in the hospital data system.

4. A method for data source classification, tracing and recommendation of clinical research data according to claim 1, characterized in that: in step S5, each research question in the case report form is traversed, and the N text segments in the sample library that are most relevant to the research question and the data sources corresponding to the text segments are matched using the technology of S202; then the frequency of occurrence of each data source is counted and sorted in descending order according to the frequency, and the top three data sources with the highest evaluation rate are retained as the recommended data sources for the research question;

The research question is the research field in the case report form.

5. A method for data source classification, tracing and recommendation for clinical research data according to claim 4, characterized in that: N is a custom value.

6. A method for data source classification, tracing and recommendation for clinical research data according to claim 4, characterized in that the value of N is 10.

7. A method for classifying, tracing and recommending data sources for clinical research data according to claim 1, characterized in that: in step S6, a manual review module is provided, and the manual review module is used to review and select the data source related to each research question in the case report form.

8. A method for data source classification, tracing and recommendation of clinical research data according to claim 1, characterized in that: in step S7, the result form in S302 is updated using the data source after manual review in S6 as the final data interaction standard form.

9. A method for data source classification, tracing and recommendation of clinical research data according to claim 1, characterized in that: in step S4, the data mapping document in S302 is desensitized to obtain anonymized sample source data after desensitization.

10. A method for data source classification, tracing and recommendation of clinical research data according to claim 1, characterized in that each metadata has a corresponding data field structure.