CN119474232A - A method for data source classification, tracing and recommendation for clinical research data - Google Patents
A method for data source classification, tracing and recommendation for clinical research data Download PDFInfo
- Publication number
- CN119474232A CN119474232A CN202411513818.3A CN202411513818A CN119474232A CN 119474232 A CN119474232 A CN 119474232A CN 202411513818 A CN202411513818 A CN 202411513818A CN 119474232 A CN119474232 A CN 119474232A
- Authority
- CN
- China
- Prior art keywords
- data
- metadata
- source
- tracing
- clinical research
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6254—Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Bioethics (AREA)
- Computational Linguistics (AREA)
- Pathology (AREA)
- Software Systems (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Primary Health Care (AREA)
- Epidemiology (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data source classifying, tracing and recommending method for clinical research data, and belongs to the field of medical data processing. A method for classifying, tracing and recommending data sources of clinical research data includes such steps as obtaining the report form of clinical research, converting the form names and data fields in report form of electronic medical record to metadata text description, classifying the metadata text description into several classes, generating ETL data transmission program and code for transmitting source data, obtaining desensitized source data sample, searching and ranking source data class for each research, determining trace-to-source route by manual check module, generating data interaction specification form, and so on.
Description
Technical Field
The invention belongs to the field of medical data processing, and particularly relates to a data source classifying, tracing and recommending method for clinical research data.
Background
In the clinical research process, the data sources are various and numerous, such as an electronic medical record system (EMR), a Laboratory Inspection System (LIS), a medicine management system and the like, and the data are usually stored in different systems and are in different formats, so that the management, analysis and tracing of the data become extremely complex;
Moreover, according to relevant regulations, researchers can only use patient clinical data within a necessary range when conducting clinical studies to reduce the risk of data leakage and exposure to sensitive information, and patient clinical data is generally stored in an internal hospital system, which results in that researchers cannot effectively determine which specific data a hospital can provide for a study item.
Disclosure of Invention
The invention aims to solve the technical problem of providing a data source classifying, tracing and recommending method for clinical research data, which can effectively protect sensitive patient data used in clinical research and improve the efficiency of data transmission and processing.
The invention discloses a data source classifying, tracing and recommending method for clinical research data, which comprises the following steps:
S1, acquiring a clinical research case report form problem;
s2, recommending relevant hospital source data categories for answering pathology report questions, wherein the relevant hospital source data categories specifically comprise:
S201, converting the table names and the corresponding data fields of the electronic medical record report table acquired in the S1 into metadata text description;
s202, classifying metadata text description by using semantic Embedding technology and HNSW vector library retrieval technology, embedding metadata into semantic vector representation by using Embedding technology, and constructing a database by adopting HNSW algorithm ideas;
s203, constructing a data world model by adopting Embedding technology and HNSW algorithm, wherein the data world model is used for processing unstructured and structured metadata respectively;
s3, generating an ETL data transmission program and codes for transmitting source data, wherein the ETL data transmission program and codes specifically comprise the following steps:
s301, preprocessing unstructured metadata and/or structured metadata by applying an algorithm of S2, and reserving corresponding data sources;
s302, generating a data mapping document, wherein the data mapping document contains example SQL codes required by each type of metadata and usage description about the metadata in a case report table;
s4, obtaining anonymized sample source data;
S5, searching and sequencing source data categories for each research problem;
S6, manually determining a tracing path;
and S7, generating a data interaction specification form.
As a further improvement of the present invention, unstructured metadata is structured with patient ID, visit ID, record time, source data name, text content, additional field name and additional field result in step S203, other content stored in the hospital data system is called up to be linked and sent under the additional field result, and metadata corresponding to the additional content is added in the additional field name to describe the result.
As a further improvement of the present invention, in step S203, the structured metadata is structured into a patient ID, a visit ID, a record time, a source data name, a specified data field name, a specified field result, an additional field name, an additional field result, and for structured data fields, common data fields are used as specified fields, while additional fields cover other data stored in the hospital data system.
In step S5, traversing each research problem in the case report table, matching N text fragments most relevant to the research problem in a sample library and data sources corresponding to the text fragments by using the technology of S202, counting the occurrence frequency of each data source, sorting in descending order according to the frequency, and reserving the first three data sources with high evaluation rate as recommended data sources of the research problem;
the study questions are the study fields in the case report form.
As a further improvement of the present invention, N is a custom value.
As a further development of the invention, the N value is 10.
As a further improvement of the present invention, in step S6, a manual review module is provided for review and selection of data sources related to each study problem in the case report form.
As a further improvement of the present invention, in step S7, the result form in S302 is updated using the data source after the manual review in S6 as the final data interaction specification form.
As a further improvement of the present invention, in step S4, the data mapping document in S302 is subjected to a desensitization operation, and anonymized sample source data after the desensitization is obtained.
As a further development of the invention, each metadata has a corresponding data field structure.
Compared with the prior art, the method has the advantages that the method can recommend the data source category corresponding to the requirement according to the specific data requirement of clinical research and generate a data transmission requirement table (such as laboratory results from an LIS system), so that the auditing and efficiency of data transmission are improved, the transmitted data are de-identified patient sample data, patient privacy is further protected, the metadata data sources can be classified and ordered to find the data source related to the research problem and bind the data source with the specific research problem, the correlation of the source data and the research problem can be measured and summary reports of the data collection range and the corresponding correlation can be generated, and the reports are provided for a correlation party to meet the research requirement on the premise that the sensitive patient data is not exposed.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples, which are illustrative of the present invention and are not intended to limit the present invention thereto.
Referring to fig. 1, a method for classifying, tracing and recommending data sources of clinical study data includes the following steps:
S1, acquiring a clinical research case report form problem;
In step S1, the case report form is an electronic medical record report form, preferably, the electronic medical record report form is in a form format, such as an Excel form, the electronic medical record report form comprises a plurality of worksheets, each worksheet is used for recording specific purposes for data collection, such as collection of medicine use cases or population information, and each worksheet is provided with data fields, such as a medicine name, a use frequency and a use time, in the worksheet of the medicine use cases.
S2, recommending relevant hospital source data categories for answering pathology report questions, wherein the relevant hospital source data categories specifically comprise:
S201, converting the table names and the corresponding data fields of the electronic medical record report table acquired in the S1 into metadata text description;
S202, classifying metadata text descriptions by using semantic Embedding technology and HNSW vector library retrieval technology;
The method comprises embedding metadata into semantic vector representation by Embedding technology, and constructing database by HNSW algorithm idea, and realizing rapid matching of data field, metadata and data source thereof by semantic similarity retrieval technology, namely rapidly searching metadata and data source thereof related to the data field according to specific data field;
Classifying the metadata textual description into different categories, such as patient basic information, drug orders, laboratory examinations, ward reports, past and present history, surgical history, vital signs, and the like;
s203, constructing a data world model by adopting Embedding technology and HNSW algorithm, wherein the data world model is used for processing unstructured and structured metadata respectively;
The method comprises the steps of structuring unstructured metadata into an architecture with a patient ID, a visit ID, a recording time, a source data name, text content, an additional field name and an additional field result, retrieving other content stored in a hospital data system to be linked and sent to the under the additional field result, adding metadata corresponding to the additional content in the additional field name to describe the result, structuring the structured metadata into the architecture of the patient ID, the visit ID, the recording time, the source data name, the designated data field name, the designated field result, the additional field name and the additional field result, taking common data fields as designated fields for the structured data fields, wherein the additional fields are used for covering other data stored in the hospital data system.
Source data name, e.g., laboratory exam;
Specifying data field names, such as laboratory item, results, normal results upper and lower limits;
The specified field results, e.g., hbA c, 7.0, 6.0, 7.2, correspond to the laboratory item, results, normal results upper and lower limits, respectively, in the specified data field names.
Examples of single pieces of metadata that may be stored in the vector database after the framework process:
The source data name in the unstructured metadata includes discharge records, ward records, and the like.
Additional field names in unstructured metadata include doctor name, change status, etc.
The specified data field names include laboratory exam names, exam results, exam result lower limits, exam result upper limits, etc.
S3, generating an ETL data transmission program and codes for transmitting source data, wherein the ETL data transmission program and codes specifically comprise the following steps:
S301, preprocessing unstructured metadata and/or structured metadata by applying an algorithm of S2, and reserving corresponding data sources, wherein mapping is completed between medicine order metadata and data sources such as orders, ward-round records, discharge records and the like;
S302, generating a data mapping document, wherein the data mapping document contains an example SQL (Structured Query Language) code required by each type of metadata and a description of the use of the metadata in a case report form.
S4, obtaining anonymized sample source data;
and (3) performing desensitization operation on the data mapping document in the step S302 to obtain anonymized sample source data after desensitization.
S5, searching and sequencing source data categories for each research problem;
traversing each research problem in the case report table, matching N text fragments most relevant to the research problem in a sample library and data sources corresponding to the text fragments by using the technology of S202, counting the occurrence frequency of each data source and sorting the data sources in a descending order according to the frequency;
The value of N can be customized, and preferably, the invention is set to 10.
Preferably, the study questions are the study fields in the case report form, such as whether there is diabetes, hypertension, coronary heart disease, etc.
S6, manually determining a tracing path;
the system comprises a manual rechecking module, a database module and a database module, wherein the manual rechecking module is used for rechecking and selecting data sources related to each research problem in a case report form;
Specifically, a manual review module is added on the basis of the recommended data source of each field obtained in S5, and the design of the manual review module is to review the recommended data sources of the system and select the data source most relevant to each research problem in the pathology report table, so that the data source tracing error caused by the system error is avoided, and a better balance is achieved between the high efficiency and the accuracy of the system.
S7, generating a data interaction specification form;
And updating the result form in the step S302 by using the data source after the manual review in the step S6 to serve as a final data interaction specification form.
The update cases in each system can be updated at fixed time or manually when temporary demands exist.
By adopting the method, the following operations can be performed to realize the acquisition of data:
1. Defining a data structure, namely determining formats required by source data and research fields in a data transmission time;
2. Constructing a data model by using the historical desensitization data;
3. According to specific research fields (such as whether patients have coronary heart disease or not) in the clinical research case report table, matching to obtain data sources of the possibility Top K;
4. manually rechecking and only reserving the most relevant data sources and metadata thereof;
5. A data interaction form is generated for a particular study field (e.g., whether the patient has coronary heart disease).
The above description is only of the preferred embodiments of the present application, but the protection scope of the present application is not limited thereto, any person skilled in the art should be able to make equivalent substitutions or modifications according to the technical solution and the modified concept thereof within the scope of the present application.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411513818.3A CN119474232A (en) | 2024-10-28 | 2024-10-28 | A method for data source classification, tracing and recommendation for clinical research data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411513818.3A CN119474232A (en) | 2024-10-28 | 2024-10-28 | A method for data source classification, tracing and recommendation for clinical research data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN119474232A true CN119474232A (en) | 2025-02-18 |
Family
ID=94588259
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202411513818.3A Pending CN119474232A (en) | 2024-10-28 | 2024-10-28 | A method for data source classification, tracing and recommendation for clinical research data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN119474232A (en) |
-
2024
- 2024-10-28 CN CN202411513818.3A patent/CN119474232A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9703927B2 (en) | System and method for optimizing and routing health information | |
US8086468B2 (en) | Method for computerising and standardizing medical information | |
US9740665B2 (en) | Systems and methods for processing patient information | |
WO2020243732A1 (en) | Systems and methods of clinical trial evaluation | |
US10614913B2 (en) | Systems and methods for coding health records using weighted belief networks | |
AU2011247830B2 (en) | Method and system for generating text | |
US20200311610A1 (en) | Rule-based feature engineering, model creation and hosting | |
US20230147366A1 (en) | Systems and methods for data normalization | |
Ramalho et al. | The use of artificial intelligence for clinical coding automation: a bibliometric analysis | |
CN112655047B (en) | Method for classifying medical records | |
Metzger et al. | The use of regional platforms for managing electronic health records for the production of regional public health indicators in France | |
CN117438079B (en) | Method and medium for evidence-based knowledge extraction and clinical decision assistance | |
Feng et al. | Usability of the clinical care classification system for representing nursing practice according to specialty | |
Saigaonkar et al. | Predicting chronic diseases using clinical notes and fine-tuned transformers | |
Patel et al. | Creation of a mapped, machine-readable taxonomy to facilitate extraction of social determinants of health data from electronic health records | |
EP3654339A1 (en) | Method of classifying medical records | |
Naeimaei Aali et al. | Clinical event knowledge graphs: enriching healthcare event data with entities and clinical concepts-research paper | |
CN119474232A (en) | A method for data source classification, tracing and recommendation for clinical research data | |
Mishra et al. | Summarization of unstructured medical data for accurate medical prognosis—a learning approach | |
KR102538131B1 (en) | Apparatus for collecting cancer information of patient and method therefor | |
US20240370404A1 (en) | Systems and methods for metadata driven normalization | |
Valentini | Ontology-based Data Management in Healthcare | |
Bansal et al. | Healthcare Data Organization | |
Zhang | Semantic Data Integration of Health Information | |
Lathrop et al. | Medical terminology coding systems and medicolegal death investigation data: Searching for a standardized method of electronic coding at a statewide medical examiner’s office |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |