CN120544762A

CN120544762A - Intelligent entry and exit methods, equipment, media and products based on clinical research

Info

Publication number: CN120544762A
Application number: CN202511023709.8A
Authority: CN
Inventors: 张维拓; 殷自强; 朱浩铭; 郭晓雨; 沈姝怡; 周念
Original assignee: Shanghai Hejin Information Technology Co ltd
Current assignee: Shanghai Hejin Information Technology Co ltd
Priority date: 2025-07-24
Filing date: 2025-07-24
Publication date: 2025-08-26

Abstract

The embodiments of the present application relate to the field of computer technology, and disclose an intelligent inclusion and exclusion method, device, medium, and product based on clinical research. The method is implemented based on a large model, and the method includes: determining the research design elements according to the received research plan; determining the inference rules according to the research design elements; generating the discrimination code of each research design element according to the inference rules; and determining the list of patients who meet the research plan according to the discrimination code. It can at least be used to solve the technical problems of the low degree of automation in related technologies, the need for manual intervention in links such as data mapping and code debugging, the inability to achieve true end-to-end intelligence, and the lack of generalized adaptability to different research types, multi-source heterogeneous data, and medical knowledge updates.

Description

Intelligent entry and discharge method, device, medium and product based on clinical study

Technical Field

The application relates to the technical field of computers, in particular to an intelligent entering and discharging method, equipment, medium and product based on clinical research.

Background

In clinical research, whether the subjects conforming to the research scheme can be accurately screened out directly determines the quality and efficiency of the research. In traditional methods, the task of subject screening requires close collaboration of clinical researchers and technicians. The clinical researchers have medical background and are responsible for defining the retrieval logic of the required patient data based on medical requirements, while the technicians have technical expertise and are responsible for processing the technical links of storage, retrieval, analysis and the like of the medical data, and the clinical researchers concretely comprise the steps of writing SQL codes according to the retrieval logic provided by the clinical researchers and acquiring the required patient data from medical databases such as electronic medical records, medical insurance data and the like. The data extraction flow relying on manual cooperation is often low in efficiency and easy to make mistakes due to the reasons of difficult cross-professional communication, difficult multi-source heterogeneous data adaptation, limitation of manual rule design and the like, and the increasing demands of digital clinical researches on efficient and accurate screening are difficult to meet.

In order to address these challenges, some solutions have begun to attempt to introduce a rule engine or standardized template solution to simplify the screening process, improve efficiency and reduce human error.

However, the inventor discovers that the technical schemes have at least the following technical problems that the degree of automation is low, manual intervention is still needed in links such as data mapping and code debugging, real end-to-end intelligence cannot be realized, and the technical problems of generalization adaptability to different research types, multi-source heterogeneous data and medical knowledge updating are lacking.

Disclosure of Invention

The application aims to provide an intelligent entering and arranging method, equipment, medium and product based on clinical research, which are at least used for solving the technical problems that the degree of automation is low, manual intervention is still needed in links such as data mapping and code debugging, real end-to-end intelligence cannot be realized, and the generalization adaptability to different research types, multi-source heterogeneous data and medical knowledge updating is lacking.

To achieve the above object, some embodiments of the present application provide the following aspects:

in a first aspect, some embodiments of the present application further provide an intelligent ranking method based on clinical studies, where the method is implemented based on a large model, and the method includes determining study design elements according to a received study plan, determining inference rules according to the study design elements, generating a discriminant code for each study design element according to the inference rules, and determining a patient list according to the study plan according to the discriminant code.

In a second aspect, some embodiments of the application also provide an electronic device comprising one or more processors and a memory storing computer program instructions that, when executed, cause the processors to perform the steps of the method as described above.

In a third aspect, some embodiments of the application also provide a computer readable medium having stored thereon computer program instructions executable by a processor to implement a method as described above.

In a fourth aspect, some embodiments of the application also provide a computer program product comprising a computer program/instruction which, when executed by a processor, implements the steps of the method as described above.

Compared with the related art, the embodiment of the application provides an intelligent, automatic, accurate and efficient in-line solution, which can fundamentally solve the problem that the traditional technology cannot realize real intelligence and generalization adaptation deficiency by dynamically deducing research design elements, deducing rules and distinguishing end-to-end flows of codes based on the research scheme, specifically, the method can automatically identify required research design elements according to the received research scheme, and the process can avoid the type adaptation stiffness of the traditional method caused by relying on fixed rules or manual experience, thereby naturally having the adaptation capability to different research types; furthermore, the dynamic generation mechanism of the discriminant code enables the system to automatically adjust the research design elements and the inference rules along with the updating of medical knowledge, and can avoid the knowledge solidification problem of the traditional static rule base. The full-link automatic deduction and dynamic updating mechanism from the study scheme input to the patient list output can realize the end-to-end intelligent flow without manual intervention, and can comprehensively improve the generalization capability of the system under different scenes by self-adaptive analysis of study types, unified mapping of multi-source data and dynamic iteration of knowledge.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.

FIG. 1 is an exemplary flow chart of an intelligent scheduling method based on clinical studies provided in some embodiments of the present application;

FIG. 2 is an exemplary schematic diagram of an intelligent entry method based on clinical studies according to some embodiments of the present application;

Fig. 3 is an exemplary block diagram of an electronic device according to some embodiments of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The following terms are used herein.

1. Large model-refers to an artificial intelligence model with a large number (millions to trillions of through constants) of parameters that can handle complex natural language understanding and generating tasks. In the present application, a large model is used to extract key study design elements from the study plan text and assist in generating inference rules.

2. Clinical studies are a scientific research method for assessing the safety and effectiveness of medical interventions, drugs or treatments, including randomized controlled trials, simulated randomized controlled trials, cohort studies, cross-sectional studies, and the like.

3. The data dictionary is metadata for recording the structure information of the database. In the application, the data dictionary at least comprises all form names in a database, variable names of variables in each form, variable labels for explaining meaning of each variable, data types of the variables and corresponding coding rules.

4. Study design element refers to the smallest unit in the study design that can be performed by the study for data mapping. It is classified into the categories of inclusion criteria, exclusion criteria, intervention, control methods, clinical outcome, covariates (confounding factors), etc. Taking a study protocol as an example, wherein "inclusion criteria, (1) older than 18 years, (2) having type II diabetes, and (3) not receiving hypoglycemic therapy" each of the inclusion criteria may be considered one or more study design elements. Generally, a clinical study protocol can be broken down into more than ten or even tens of study design elements by structuring.

5. Inference rules, which in the present application refer to element variable inference rules, are content presented in the form of natural language text or pseudo code, and are mainly used to describe how to infer study design elements from variables (variable labels) in a database. There may be a number of possible inference rules for a study design element. For example, when the database contains variables such as "disease diagnosis" and "fasting blood glucose level", the corresponding inference rule may be "the disease diagnosis contains the keyword < diabetes >", or "the fasting blood glucose level is greater than 7mmol/L".

6. The discriminating code is a section of code capable of actually running, and can be written in SQL language or other languages capable of operating database. The function of the method is to judge whether the information of a specific patient in the database meets the requirement of a specified study design element according to a specified inference rule.

7. The workflow is a set of logic flow comprising a plurality of nodes (such as steps and functions) and is used for controlling the processing mode of each node on user input, the function calling logic, the response decision and other operations.

8. The thinking chain is a technology for enhancing the reasoning capability of an AI system, and the understanding and solving efficiency of a model to complex problems and scenes are enhanced by simulating the gradual thinking process when people solve the problems.

The RAG knowledge base, namely a search enhancement generation knowledge base, is a mechanism integrating the advantages of a search technology and a generation model, and allows an AI system to generate answers by means of self training data and acquire related information from an external knowledge source to support output contents.

10. Semantic embedding is a technical method for converting text into numerical vectors. Through this transformation, the machine can understand the meaning of the terms and the semantic relationships between the terms.

11. Multisource heterogeneous data refers to data sets from different data sources having different structures and formats.

First embodiment

The first embodiment of the application relates to an intelligent row-in method based on clinical research. As shown in fig. 1 and 2, the method is implemented based on a large model, and may include the following steps:

step S101, determining research design elements according to the received research scheme;

step S102, determining an inference rule according to the research design elements;

step S103, generating a distinguishing code of each research design element according to the deducing rule;

Step S104, determining a patient list conforming to the research scheme according to the discrimination code.

Illustratively, the method may be applied to a large model-based clinical study patient screening and data extraction system. In view of the strict requirements of the operation mechanism and clinical research of the large model on the accuracy of the results, the embodiments of the application can systematically realize the steps by adopting a mode of combining workflow and thinking chain. The workflow is orderly connected with a plurality of functional nodes in series to ensure the high efficiency of data processing and rule execution, and the thinking chain simulates a human reasoning process and assists the large model to disassemble and verify complex medical logic layer by layer, so that the screening efficiency is improved, and the medical accuracy and reliability of the result are ensured. The above steps are described in detail below.

For step S101, the study plan may be embodied in the form of electronic text, for example. After the system receives the electronic text of the study plan, the large model may extract all study design elements according to the content of the electronic text of the study plan. The format of the electronic text may be, but is not limited to, text. The study may be, but is not limited to, a clinical trial study, an observational study design, a mission assignment, and the like. The research design element is a key framework item suitable for defining core contents such as target groups of clinical research, intervention measures, observation indexes, influence factors and the like.

Illustratively, the study design factors can include multiple dimensions, including defining target groups meeting study conditions by specific requirements of age ranges, disease diagnosis conditions, complications restrictions and the like in terms of inclusion criteria, intervention/exposure factors specifying specific operations of experimental groups and control groups, such as drug dosage, treatment period, exposure environment and the like, clinical outcomes including primary and secondary endpoints, indexes specifying study needs to be observed and measured, such as disease symptom improvement degree, survival time and the like, and key covariates involving important factors that may affect study results, such as gender, BMI, smoking history and the like of patients.

For step S102, exemplary, inference rules may be determined from the study design elements. For example, for the inclusion criteria of "age. Gtoreq.18 years and <65 years", the inference rule requires defining an age calculation method (based on difference between date of inclusion and date of birth), treatment of boundary values (including 18 years and not including 65 years), and for "treatment with a certain drug for more than 3 months", the drug name matching rule, start-stop time of treatment duration calculation, and the like are determined.

For step S103, the inference rules may be converted into discriminant code that can be understood and executed by the computer, for example, using a programming language or a specific data processing tool. For example, "age. Gtoreq.18 years and <65 years" is written as SQL query statement, R code or Python code, and "use a medication for more than 3 months" is converted into code logic, which can be determined by retrieving medication records in electronic medical records, calculating medication time.

For step S104, illustratively, according to the discriminating code, the data retrieving and screening operations may be performed in a medical database (e.g., electronic medical record system, test database). The system can compare patient data with conditions of discriminating codes one by one, and patients meeting the requirements of all study design factors are screened out to form a patient list meeting the study scheme. The patient list may be the direct object of subsequent clinical studies, data statistical analysis.

It is not difficult to find that compared with the related art, the embodiment of the application provides an intelligent, automatic, accurate and efficient solution for in-line. According to the scheme, the problems that real intellectualization and generalization adaptation deficiency cannot be achieved in the traditional technology can be fundamentally solved through dynamic deduction of research design elements, deduction rules and end-to-end flow of discrimination codes based on the research scheme, specifically, the method can automatically identify required research design elements according to the received research scheme, the process can avoid type adaptation stiffness caused by dependence on fixed rules or manual experience of the traditional method, so that adaptation capability to different research types is naturally achieved, furthermore, multisource heterogeneous data can be uniformly mapped into standardized discrimination logic based on the deduction rules generated by the research design elements, and therefore the problem of integration of heterogeneous data due to structural differences can be solved. The full-link automatic deduction and dynamic updating mechanism from the study scheme input to the patient list output can realize the end-to-end intelligent flow without manual intervention, and can comprehensively improve the generalization capability of the system under different scenes by self-adaptive analysis of study types, unified mapping of multi-source data and dynamic iteration of knowledge.

Second embodiment

The second embodiment of the application relates to an intelligent row-in method based on clinical research. The second embodiment is an improvement on the first embodiment, and is specifically improved in that in the present embodiment, a specific implementation manner of determining the study design element according to the received study plan is provided.

Specifically, in some embodiments, the determining the study design element according to the received study plan, that is, the step S101 may include the steps of:

step S1011, determining an element extraction prompt word according to the research scheme;

step S1012, extracting prompt words according to the elements, and outputting a structured list of research design elements through the large model.

For step S1011, exemplary, after the system receives the electronic text of the research scheme, the implementation process of element extraction of the prompt word is determined according to the research scheme, which is a process of converting unstructured text into structured instructions by combining research type features and industry specifications through a large model. Specifically, the electronic text of the research scheme can be subjected to semantic analysis by using a large model, research design factors such as research purposes, ranking standards, intervention measures and the like are extracted, and natural language description is converted into more specific instructive sentences by combining knowledge in the clinical research field. For example, a detailed hint such as "extract type II diabetes diagnostic criteria" and "define age definition of adult patient" is disassembled from "adult patient suffering from type II diabetes and not treated with insulin" to form an element extraction hint word that assists in large model extraction of elements.

For step S1012, exemplary, the element extraction prompt term may be taken as input, and by using the information extraction and structuring capability of the large model, key information (such as age range, intervention dose, etc.) is extracted from the research scheme, and is sorted into a structured list containing elements such as an inclusion criterion, intervention measure, clinical outcome, etc. according to a preset standardized format (such as JSON, table). The structured list has a unified data format and expression mode, can eliminate ambiguity and disambiguation of original content corresponding to the research scheme, and is beneficial to providing a standard data basis for research. Taking a clinical research scheme of type II diabetes as an example, after a prompt word of 'extracting a diagnosis standard of type II diabetes and a judgment basis of unused insulin' is extracted from an input element, a large model can extract key information from the research scheme and carry out structural treatment, wherein the 'diagnosis standard of type II diabetes' is correspondingly an international disease classification code 'ICD-10 code E11. X', so that the diagnosis standard can be directly searched through medical record codes, and the 'judgment basis of unused insulin' is converted into 'no insulin administration record of an electronic medical advice system in approximately 3 months', thereby being convenient for screening patients meeting the conditions from real world data. In the formed structured list, the selected standard part can be clearly presented as 'no insulin administration record in 3 months after the ICD-10 code is E11.x and the age is more than or equal to 18 years old', and therefore, the fuzzy expression of 'unused insulin' and the like in the original text of a research scheme can be eliminated through a standardized format, and the method is favorable for providing a quantifiable and verifiable data basis for screening subjects.

Optionally, in some embodiments, the determining the element extraction hint word according to the research scheme, that is, step S1011 may include:

step S10111, identifying the study type of the study plan;

step S10112, acquiring international specification guidelines corresponding to the research types and/or prompt word templates predefined by users;

Step S10113, determining the element extraction prompt words according to the research scheme, the research type and the international standard guide and/or the prompt word template predefined by the user.

For step S10111, exemplary, the large model may be used to perform semantic analysis on the electronic text of the research scheme, extract key features from the core links of the research design, and determine the research type. For example, it may be first identified whether the study protocol refers to "random assignment", "control setting", or the like, and if the study protocol refers to "random assignment of patients to different treatment groups", the study type is a Random Control Test (RCT), if "long-term follow-up of disease progression in a specific population" is emphasized, the study type is a cohort study, and if "disease status investigation in a population at a certain time point" is the core, the study type is a cross-sectional study. For example, when the study plan describes "patients are randomly divided into a drug group A and a placebo group according to a 1:1 ratio, blood glucose changes are observed in 6 months of follow-up", the large model can judge the study type as RCT by capturing keywords such as "random grouping", "control intervention", and the like, and ensures that the identification of the study type accords with the standard classification logic of clinical scientific research.

By way of example, the study types may include randomized controlled trials, cross-sectional studies, cohort studies, diagnostic studies, and the like. In some embodiments, the study type may also be specified by the user according to personalized needs.

For step S10112, for example, according to the identified study type, a built-in knowledge-graph association mechanism may be automatically triggered in the large-model prompt word generation logic. Thereby automatically associating international guideline specifications corresponding to the study type and/or user-predefined alert word templates. For example, for RCT types, the claim CONSORT may be automatically invoked, and for queue studies, the STROBE claim may be invoked. If a user pre-defined template (such as a tumor research design template formulated in an enterprise) exists, the large model can synchronously load preset instructions (such as extracting specific gene mutation detection standards) in the template, integrate the standard requirements with the original text content of a research scheme to form element extraction prompt words containing industry standards and custom requirements, and ensure that the research design elements extracted subsequently meet international traffic standards and meet the requirements of personalized research scenes.

The international specification guidelines may include, but are not limited to, SPIRIT statements, CONSORT statements, STROBE statements, and the like.

The user-predefined prompt word template is a command set preset by the user for a specific research type according to specific research requirements, industry specifications or organization standards. The template can clearly define research design elements and expression requirements to be extracted according to different research types, for example, in a tumor clinical test template, a user can customize instructions such as a PD-L1 expression detection method extraction method, an imaging evaluation standard for clear disease progress and the like.

For step S10113, the electronic text of the research scheme can be analyzed through a large model to extract core contents such as an input standard, an intervention measure and the like, the standard requirements on research design in the corresponding international standard guide can be called by combining the identified research type, and/or personalized instructions aiming at the type in a user predefined template can be loaded. Then, the large model can carry out semantic fusion on key information of the original text of the research scheme, specification requirements of the research type and a user customization instruction, and convert the key information, specification requirements of the research type and the user customization instruction into a structured prompt sentence, for example, the scheme description of 'random grouping of diabetics' is integrated with the CONSORT specification of the RCT type and the requirement of 'extracting sample amount details' in a user template, so as to generate an element extraction prompt word of 'extracting a randomization method, a sample amount estimation basis and a blood glucose monitoring frequency designated by a user of the RCT research according to the CONSORT standard', and ensure that the prompt word has scheme original meaning, industry standard and customization requirements.

Optionally, in some embodiments, the extracting the prompt term according to the element and outputting the structured list of study design elements through the large model, that is, step S1012 may include:

Step S10121, determining target items of research design elements according to the PICO framework of the research type;

Step S10122, extracting prompt words according to the elements, and inquiring research design elements of the target items in the research scheme item by item to a large model;

Step S10123, forming the structured list according to the inquired study design factors.

For step S10121, the exemplary PICO framework is used as a classical paradigm of clinical study design, and the study core content is divided into study objects, intervention measures, comparison schemes, and ending indexes, etc., where the division of the PICO framework study design elements needs to be dynamically adjusted in combination with specific study types. In this step, the key areas to be focused on by the design are explicitly studied and defined by using the PICO framework as a guide, and the range of the extraction elements is defined.

For example, for a clinical study of a novel hypoglycemic agent, the target items are determined to be 'type II diabetes mellitus patients (study objects)' novel hypoglycemic agent use modes (intervention measures) 'traditional hypoglycemic agent as a control (control scheme)' and 'glycemic control index change (ending index)' by a PICO framework based on the study type. For another example, the target items are determined by PICO framework as "age not less than 18 years", "type II diabetes patient diagnosed according to WHO standard" (study subject), "combined severe liver and kidney insufficiency", "gestational patient" (exclusion standard), "experiment group using drug A, control group using placebo" (intervention/exposure factor), "primary endpoint is change amount of fasting blood glucose compared with baseline after 12 weeks of treatment", "secondary endpoint includes total survival rate within 24 weeks (clinical outcome)," body mass index BMI "," past history of malignancy "(key covariates).

For step S10122, illustratively, when extracting the study design element based on the element extraction prompt term and the target item, the system may first semantically match the element extraction prompt term including the study plan, the study type, and the international standard guideline and/or the prompt term template predefined by the user with each target item (inclusion criteria, intervention measures, etc.). For example, for a type II diabetes drug clinical trial, after the matching of the "type II diabetes in the element extraction prompt to the" disease diagnosis criteria "target element under the" WHO diagnosis criteria "is successful, the system may generate a structured question based on the key information in the prompt, such as" what is the specific blood glucose index and detection requirement for defining type II diabetes in the study protocol according to the WHO diagnosis criteria. By executing a "match-question" procedure for each successfully matched target element, the system can guide the large model to accurately locate the corresponding content from the electronic text of the research scheme, and ensure that the element extraction direction meets the specification and has pertinence.

For step S10123, the system may perform standardization processing and integration on the dispersion information after the large model completes extraction of each target element. The large model can automatically check the consistency of the extracted content (such as excluding the expression contradicting the main diagnosis standard), thereby ensuring the accuracy of the information. Then, the system can classify and arrange all elements according to the JSON and other structural formats, and summarize specific contents under the categories of 'optional criteria', 'exclusion criteria', 'intervention measures', and the like, so as to form a complete structural list. Taking type II diabetes as an example, the final list can clearly show the diagnosis standard of fasting blood glucose not less than 7.0mmol/L and glycosylated hemoglobin not less than 6.5%, the information of intervention details of drug A taken once a day for two weeks, and the like. The structured integration is convenient for researchers to intuitively acquire core contents, and provides a standardized data basis for subsequent screening of patients based on real world data and research analysis.

For example, taking a clinical test study plan of type II diabetes as an example, when the study design element is extracted by combining the secondary prompt word, the system firstly carries out semantic matching on the secondary prompt word comprising the study plan original text, "Random Control Test (RCT)" study type, the loaded CONSORT specification and the user-defined template of "needing definite diagnosis and exclusion criteria" and the target items such as "selection criteria", "intervention measures" and the like. When the target element of the disease diagnosis standard under the ' selection standard ' is matched, the system can ask a large model based on the instruction of the ' type II diabetes mellitus needs to meet the WHO diagnosis standard ' in the secondary prompt word, ask the large model for ' according to the WHO diagnosis standard, what is the specific blood sugar index and detection requirement of the type II diabetes mellitus defined in a research scheme. After completing the "match-question-extract-check" procedure for all target elements (e.g., intervention drug dose, major clinical outcome, etc.), the large model integrates the dispersion information into a structured list in JSON format, for example:

{

"selection criterion" { "disease diagnosis criterion": "fasting blood glucose not less than 7.0mmol/L and glycosylated hemoglobin not less than 6.5%", "age requirement": "18-65 years" },

"Exclusion criteria" { "eGFR <60", "pregnancy status" },

"Intervention" is { "drug A is orally administered once a day for two weeks" },

...

Clinical index { "patient has tumor volume detection value 3 months before and after treatment }

...

}

It is easy to find that in the embodiment of the application, the information such as the original text of the research scheme, the research type specification, the user-defined template and the like is integrated into a clear instruction by determining the element extraction prompt words according to the research scheme, so that an accurate extraction direction is provided for the large model, on the basis, a structured list of the research design elements is output through the large model according to the element extraction prompt words, and the unstructured research scheme is converted into a standardized and systematic data structure by utilizing the information processing capability of the large model. Therefore, the combination of the two steps enables the extraction process of the research design elements to be more efficient and accurate, can avoid omission and errors possibly occurring in manual extraction, and can quickly and completely comb out the core content of the research scheme, thereby providing a reliable data basis for subsequent research analysis, data processing, research result evaluation and the like, and remarkably improving the processing efficiency and quality of the research design elements.

Third embodiment

The third embodiment of the application relates to an intelligent row-in method based on clinical research. The third embodiment is an improvement on the second embodiment, and specifically improves that in the present embodiment, a specific implementation manner of forming the structured list according to the queried study design elements is provided.

Specifically, the step of forming the structured list according to the queried study design elements, that is, the step S10123 may include the steps of:

S3A, performing conversion operation on the inquired research design elements to generate specific conditions capable of judging data;

And step S3B, forming the structured list according to the specific conditions.

For step S3A, for example, because the study plan is often described in natural language, there are fuzzy expressions such as "stable illness state" or unstructured medical terms such as "type II diabetes", and the like, and the method is directly used for matching real world data (such as electronic medical records and inspection reports) and is prone to ambiguity. Therefore, in this step, conversion operation is required to be performed on the queried study design element, namely, the abstract concept is quantized into a specific numerical standard (such as "stable illness state" is thinned into "vital sign is in a normal interval for 72 hours continuously), and the medical term is mapped into a standard code (such as" type II diabetes "is replaced by ICD-10 code E11.9), so that the study design element is converted into a specific condition which can be directly compared with data fields such as blood pressure values, medication records and the like in electronic medical records. The process eliminates the ambiguity of natural language, and is favorable for providing accurate interpretation basis for subsequent data screening.

For step S3B, the specific conditions after the normalization process may be integrated to form a structured list. For example, specific conditions such as "age. Gtoreq.18 years", "ICD-10 code E11.9", "vital sign is normal for 72 hours continuously" are classified and arranged according to categories such as "selection criteria", "exclusion criteria" and the like, and are presented in a table or JSON format. The structured list not only can enable research design elements and real world data fields to form clear mapping (such as a field of disease diagnosis codes in an electronic medical record corresponding to ICD-10 codes E11.9), but also can ensure that patients meeting the conditions can be accurately screened through a data retrieval tool (such as SQL query) later, and the efficiency and reliability of research data collection are improved.

Optionally, in some embodiments, the converting operation may include at least one of:

disassembling the composite inclusion condition into an atomic condition;

converting the protocol description of the intervention measure into an operable medical action;

And converting the ending index calculation requirement into the existence requirement of the original detection data at the preset time node.

Specifically, the composite inclusion condition is a comprehensive screening standard (such as 'early breast cancer is confirmed to belong to triple negative breast cancer based on pathological detection') formed by combining a plurality of related elements in a research scheme, and the comprehensive screening standard comprises multi-dimensional limiting factors and is complex in expression, so that verification difficulty is easily caused by semantic ambiguity when the comprehensive screening standard is directly used for screening electronic medical records. The method is characterized in that the logical structure is analyzed firstly during disassembly, and the logical structure is separated into an unrewritable basic judging unit, namely, breast cancer cases are taken as an example, and the basic judging unit can be formed into a diagnosis of breast cancer (matching medical record diagnosis field or ICD code), an early stage of breast cancer (judging according to TNM stage or tumor size) and a post-operation pathology confirmation of triple negative (through negative result verification of estrogen receptor, progestogen receptor and human EGFR 2 in a pathology report). Each atomic condition corresponds to a specific data field of the electronic medical record or the inspection report, so that the judgment complexity of multi-factor superposition is avoided, and the accurate screening of the research object is realized.

Specifically, the scheme description of the intervention measures (such as "new therapy is adopted" implementation a scheme ") in the research scheme is expressed in abstract and generalized language, and lacks specific details such as medicine names, doses, frequency and the like, and cannot be directly matched with the medication records and medical advice data of the electronic medical record. The fuzzy concept is required to be refined into operable medical actions when the medical action is converted, for example, the treatment of the scheme A is definitely carried out as that the medicine A is orally taken once a day, 50mg is carried out for two weeks each time, and the intervention measures are converted into specific actions from abstract states by supplementing factors such as medicine dosage forms, doses, treatment courses and the like. The conversion can be accurately matched with specific records in real world data, clear judgment basis is provided for screening patients meeting intervention conditions, and accuracy of data collection is improved.

Specifically, the partial outcome index is calculated (e.g. "3 months tumor volume change"), which depends on the comparison operation of the data at different time points, and cannot be directly obtained from the original data. The calculation requirement is disassembled into definition of data integrity during conversion, taking tumor volume change as an example, the tumor volume detection values of 3 months before and after treatment are ensured to exist in the original data, and the existence requirement of 'the patient needs to have tumor volume detection records of 3 months before and after treatment' is formed through defining the time node and index type of data acquisition. The conversion can avoid calculation obstacle caused by data loss, and ensure that the subsequent analysis can accurately obtain the ending index result based on complete data.

Optionally, in some embodiments, after the determining of the study design element according to the received study plan, the method may further include:

Step S201, a sample training step, namely taking a historical research scheme and corresponding research design elements as training samples, and optimizing the large model by a supervision fine tuning or low-rank fine tuning mode;

Step S202, a dynamic calibration step is performed in response to the identification of the fuzzy description item, wherein the operation of retrieving a medical knowledge base to acquire related clinical guidelines, generating quantifiable rules based on the clinical guidelines to replace the fuzzy description item, and outputting the quantifiable rules as newly added study design elements.

For step S201, an exemplary sample training step aims to optimize the performance of a large model in a study design element extraction task through historical data. Specifically, the historical research scheme and the research design elements after corresponding confirmation are used as training samples, and the supervision fine tuning or low-rank fine tuning technology is applied to purposefully adjust the large model parameters. By constructing a specific training task under a research design element extraction scene, the model is guided to deeply learn the semantic mode, the data characteristic and the logic relation in the medical text, so that a knowledge representation system adapting to the field is established. Through system training, the large model can more accurately identify and analyze various research design elements based on the learned rules and modes when an actual research scheme is processed, the accuracy and consistency of an extraction result are remarkably improved, and a reliable data basis is provided for subsequent research.

For step S202, illustratively, the problem of semantic uncertainty caused by fuzzy expression in the research solution can be solved through a dynamic calibration step. When the large model identifies clinical descriptions such as 'stable illness' and 'symptom improvement' which lack definite quantification standards, the technology of triggering search enhancement generation (RAG) can be used for firstly providing standardization basis for calibration by relying on international medical guidelines, clinical consensus and other authoritative documents stored in a local medical knowledge base, secondly matching fuzzy expression vectorization with a knowledge graph through semantic analysis and searching relevant authoritative terms from the knowledge base, and finally converting the fuzzy expression into quantifiable operation rules based on search results and combining research requirements, for example, defining 'recent blood sugar control poor' as 'glycosylated hemoglobin of nearly 3 months' to be more than 8%. Through the process, the problem of non-uniform standard caused by subjective interpretation difference can be eliminated, the scientificity and authority of research condition setting can be ensured, the original fuzzy expression is converted into an operable and verifiable research design element, and the standardization and the data usability of the research design are effectively improved.

It should be noted that the number of the substrates, the present embodiment may also be a modification of the first embodiment.

It is easy to find that in the embodiment of the application, by executing conversion operation on the inquired research design elements, abstract and fuzzy original expressions can be converted into specific conditions which can be directly interpreted based on real world data, so that clear and accurate basis is provided for data screening and matching, and data misjudgment caused by unclear semantics is effectively avoided, therefore, the specific conditions are further integrated to form a structured list, and systematic and normalized presentation of the research design elements is realized. Therefore, seamless connection of research design elements and real world data can be ensured, the accuracy and efficiency of data extraction are improved, clear and ordered data basis can be provided for subsequent data analysis, research result evaluation and other works, and the reliability and operability of research are remarkably improved.

Fourth embodiment

The fourth embodiment of the application relates to an intelligent row-in method based on clinical research. A fourth embodiment is an improvement over the first embodiment in that in the present embodiment, a specific implementation of determining an inference rule based on the study design element is provided.

Specifically, in some embodiments, the determining an inference rule according to the study design element, that is, step S102 may include:

Step S1021, carrying out semantic matching on the research design elements and a data dictionary to generate element-variable mapping relation;

Step S1022, determining an inference rule according to the element-variable mapping relation.

For step S1021, illustratively, this step aims at establishing an association between the study design element and the data dictionary, and generating an element-variable mapping relationship by semantically matching the study design element with the data dictionary. Because the semantic difference exists between the expression of the research design element and the variable in the real world data, the element semantic is analyzed by using the large model, and the corresponding relation between the research design element and the data dictionary variable can be found by combining two modes of direct matching and indirect reasoning. For example, for a study design element of "a patient suffering from diabetes", by analyzing the medical meaning of the "patient suffering from diabetes", the element-variable mapping relationship can be obtained by directly correlating a definite diagnostic variable such as "diagnosis: past medical history", or indirectly reasoning an auxiliary diagnostic variable such as "laboratory test: fasting blood glucose level", through medical logic.

For step S1022, exemplary, the data dictionary may include detailed information such as form name, variable tag/description, variable type, coding rule, etc., and in this step, specific data filtering logic may be formulated according to the variables and attributes corresponding to each study design element. For example, when the study design element of "age. Gtoreq.18 years" is mapped to the "visit age" variable, in combination with the setting of the "visit age" in the data dictionary to the numerical variable, the inference rule is determined as "screening patients with a visit age field number of 18 or more from the electronic medical record". The element-variable mapping relation is converted into a clear judgment condition, so that the research design element has operability and can be directly applied to screening and analysis of real world data.

Optionally, in some embodiments, the semantically matching the study design element with the data dictionary to generate an element-variable mapping relationship, that is, step S1021 may further include:

step S10211, analyzing semantic information of the research design elements through a large model;

Step S10212, carrying out semantic matching on the semantic information and variable labels in a data dictionary, and outputting an element-variable name matching list containing variable names, variable labels and belonging forms.

For step S10211, the semantic information of the design element is illustratively and deeply parsed by utilizing the powerful semantic understanding capability of the large model. Research design elements in the medical field often contain complex terms of art and implicit logic, such as "employing enhanced hypoglycemic regimen" may involve multiple information about drug type, dosage, frequency, etc. The large model identifies key semantics in elements through a natural language processing technology, splits complex expressions and mines potential medical associations. For example, for "recent poor glycemic control," the large model can resolve its dual semantics involving blood glucose indicators and time ranges, providing an accurate semantic basis for subsequent matches to the data dictionary.

For step S10212, in the exemplary matching process, two strategies of direct matching and indirect reasoning can be adopted, wherein the direct matching can be directly mapped to variable labels through character string similarity calculation or regular expression aiming at elements with definite semantics. If the variable label corresponding to the direct matching of the age of more than or equal to 18 years is the diagnosis age (the form of the diagnosis age is basic information), the indirect reasoning can be based on a medical knowledge graph, and semantic expansion matching can be realized according to the medical association potential variables of the elements. Variable labels such as "glycosylated hemoglobin" are inferred indirectly from the element "patient suffering from diabetes". The generated matching list can be presented in a structured form, the source form (such as 'laboratory detection', 'diagnosis record') and the matching type of each variable are clearly marked, the corresponding relation between the research design elements and the data dictionary variables is ensured to be visual and traceable, and an accurate data association basis is provided for the establishment of the follow-up inference rules.

Optionally, in some embodiments, the determining an inference rule according to the element-variable mapping relationship, that is, step S1022 may further include:

step S10221', constructing and filling a prompting word template for each generated element-variable mapping relation, wherein the prompting word template is injected with the following information including medical definition of target elements, matched variable names, variable labels, types and coding rules of a data dictionary;

step S10222', inputting the filled prompt words into a large model, and outputting inference rules in the form of pseudo codes by combining a data dictionary, wherein the inference rules at least comprise one of threshold judgment of numerical variables, keyword matching of text variables and code value mapping of classification variables.

For step S10221', exemplary, a hint word template needs to be constructed and populated for each element-variable mapping relationship to generate semantic instructions that can direct the large model to output inference rules. In a specific operation, the medical definition of the target element can be firstly extracted, for example, specific diagnosis standard of 'patient suffering from diabetes' is definitely determined, the matched variable name, variable label and type, such as 'prior medical history' of text type variable, are simultaneously obtained, and coding rules in a data dictionary, such as ICD-10 code of diabetes, are called as E11.9. Further, the information can be injected into a preset hint word template. If the mapping is univariate mapping, for example, a mapping relation between ' patient suffering from diabetes ' and ' past history ', generating a prompt word ' how to infer from the variables [ past history ], [ patient suffering from diabetes ],? if a multivariate composite inference is involved, for example, when reference is made to both "admission diagnosis" and "discharge diagnosis" variables, then a hint word may be constructed that contains a plurality of variable names, such as "how to export an inference rule in the form of a pseudo code from the variables [ admission diagnosis ], [ discharge diagnosis ], [ patient suffering from diabetes ]. By filling the key information, the prompt words can be ensured to accurately convey the inference requirement, so that the large model can deeply understand the logical relationship between the elements and the variables, and an accurate inference rule is generated.

For step S10222', exemplary, the filled-in hint words may be entered into a large model and combined with a data dictionary to generate inference rules in the form of pseudo codes. The large model takes a data dictionary as a knowledge base, and outputs corresponding judgment logic based on variable types (numerical value type, text type, classified variable and the like) and coding rules in the prompt words. For example:

Generating 'diabetes' in admission diagnosis '|' diabetes 'in discharge diagnosis' for text variables 'admission diagnosis', 'discharge diagnosis' and judging through keyword matching;

A logarithmic variable (e.g., blood glucose value), combined with the threshold requirement of the data dictionary to generate a "fasting blood glucose value > = 7.0mmol/L";

The "ICD-10 code= E11.9" is generated by coding dictionary mapping for a classification variable (e.g., disease code).

The finally output pseudo-code type inference rule covers variable judgment logic and corresponding form information, so that the inference rule can be directly used for screening and verifying electronic medical record data, for example, whether diabetes mellitus is caused or not is directly matched with Boolean type variables in a data dictionary through logic type.

Optionally, in some embodiments, in the element-variable tag matching and/or inference rule generation process, the method may further include invoking a RAG knowledge base to:

step S301, a medical knowledge base is constructed, and element-variable mapping results of disease guidelines, clinical documents, medical coding dictionaries and historic confirmation are stored;

Step S302, searching the knowledge base to acquire related medical definitions and threshold standards in response to element descriptions with fuzzy semantics, and correcting variable matching logic;

step S303, when the inference rule in the form of pseudo code is generated, the knowledge base is searched to acquire clinical diagnosis standards and historical rules, wherein if the same generation requirement exists, the historical archiving rule is directly output, and if the similar requirement exists, the historical rules are used as reference context to be injected into the large model prompt word.

For step S301, the operation is to build an integrated medical knowledge storage system to store information such as disease guidelines (e.g. diabetes diagnosis and treatment standards), clinical documents, medical coding dictionaries (e.g. ICD-10 coding library), and historically-confirmed element-variable mapping results in a structured manner. For example, the knowledge base records medical definitions of "fasting blood glucose ∈7.0mmol/L is a diagnostic criterion for diabetes" and history matching cases of "patients suffering from diabetes" corresponding to variables such as "fasting blood glucose value", "glycosylated hemoglobin". By constructing the knowledge base, authoritative knowledge support and historical experience reference can be provided for element-variable matching and inference rule generation, and the accuracy and consistency of subsequent operations are ensured.

For step S302, when a semantically ambiguous element description (such as "abnormal blood glucose control") is encountered, the system retrieves the medical knowledge base to enhance the matching logic, namely, firstly, acquiring the relevant medical definition and the threshold standard (such as "fasting blood glucose >7mmol/L is diabetes diagnosis threshold"), converting the ambiguous expression into an explicit variable matching condition, and if the historic matching result of the same element exists in the knowledge base (such as that the "abnormal blood glucose control" corresponds to the "fasting blood glucose value" variable in the past), directly calling the confirmed variable label to avoid repeated reasoning. For example, when the recent blood sugar control is poor, the matching logic can be corrected by searching the quantization standard of 'approximately 3 months glycosylated hemoglobin > 8%' stored in the knowledge base, so that the element is accurately related to the 'glycosylated hemoglobin' variable, and the matching accuracy and efficiency are improved.

For step S303, illustratively, when generating the inferred rules in the form of pseudo codes, the system may retrieve the knowledge base to obtain clinical diagnosis criteria and historical rules, directly output the historical confirmed rules (e.g., "diabetes" in-hospital diagnosis ") if the rules to be generated are identical to the requirements of the historical archive in the knowledge base (e.g., diabetes is inferred based on" hospital admission diagnosis ") as well as inject the historical rules as reference contexts into the large model prompt words to assist in generating more reliable rules if the requirements are similar (e.g., diabetes is inferred based on" hospital discharge diagnosis "). For example, when a rule of "deducing anemia from 'blood routine' variables" is generated, clinical standards of "hemoglobin <110g/L is anemia diagnosis standard" and historical rules of "hemoglobin value <110g/L" in the knowledge base are retrieved, and a numerical judgment pseudo code can be directly generated based on the clinical standards and the historical rules, so that the rule accords with medical specifications and accords with historical experience.

Optionally, in some embodiments, the element-variable tag matching process may further include a semantic embedding optimization step:

Step S401, pre-training semantic embedding is carried out on all variable labels in a data dictionary, and high-dimensional vector representation is generated;

step S402, inputting the target research element description into the same embedded model, and converting the target research element description into element semantic vectors;

Step S403, calculating the similarity between the element semantic vector and each variable label vector, screening variable labels with the similarity exceeding a preset threshold value, and adding the variable labels into a matching candidate set;

Step S404, executing subsequent element-variable mapping relation generation based on the matching candidate set.

For step S401, illustratively, the step performs semantic embedding processing on variable labels (e.g., "gfr" and "serum creatinine") in the data dictionary by means of a pre-trained model, and converts each label into a vector representation in a high-dimensional space. Such vectors not only contain the literal meaning of the tag, but also capture the medical semantic association behind it (e.g. "gfr" with "clinical relevance of chronic kidney disease"). For example, after the processing of a pre-training model special for the medical field, the vector of serum creatinine is closer to the class vector of kidney function index in a high-dimensional space, so that the limitation of character matching is broken through, and a foundation is laid for accurate matching of a semantic level.

For step S402, illustratively, a study design element (e.g., "chronic kidney disease") is input into the same embedding model as the variable label, and a corresponding element semantic vector is generated. The process converts the research element in the text form into a vector space representation in the same dimension as the variable label, and ensures that the two are comparable in the same semantic space. For example, the semantic vector of "chronic kidney disease" may form clusters in a high dimensional space with vectors of concepts related to "kidney function impairment", "glomerular filtration rate", etc., so that implicit medical associations between element descriptions and variable labels can be captured even though the two do not directly coincide in character (e.g., "chronic kidney disease" and "evfr").

For step S403, exemplary, variable tags with similarity higher than a preset threshold may be included in the matching candidate set. For example, when treating a "chronic kidney disease" element, the model calculates the similarity between its vector and the labeled vector of variables such as "evfr", "serum creatinine" and "urine protein", and since these variables are key indicators for diagnosis of kidney disease, the similarity between its vector and the element vector exceeds a threshold (e.g., 0.7), and is thus selected as a candidate set. The semantic-based screening mechanism can avoid miss-selection (such as missing a variable which has no character overlapping but is related to semantics) caused by character matching, and improves the comprehensiveness of matching.

For step S404, illustratively, after obtaining the semantically related variable label candidate set, the system may further combine policies such as direct matching, indirect reasoning, and the like to generate a final element-variable mapping relationship. For example, the "eGFR" and "serum creatinine" variables in the candidate set may be correlated to a diagnostic criteria for "chronic kidney disease" by medical knowledge-graph (e.g., eGFR <60mL/min/1.73m 2), thereby forming a "chronic kidney disease→laboratory test: eGFR" mapping. The step utilizes the semantic embedded optimized candidate set to ensure that the mapping relation is based on the structural information of the data dictionary, accords with the semantic logic of the medical field, and improves the accuracy and generalization capability of element-variable matching.

It should be noted that the number of the substrates, the present embodiment may also be a modification of the second and/or third embodiments.

It is not difficult to find that in the embodiment of the application, the element-variable mapping relation is generated by carrying out semantic matching on the research design elements and the data dictionary, a bridge between the research design text and the real world data can be built, the abstract research elements accurately correspond to specific variables in the data dictionary, and further, the inference rule is determined based on the mapping relation, so that the research elements are converted into executable data screening logic, therefore, the ambiguity of natural language description is eliminated, the accuracy and operability of the inference rule are ensured by utilizing the standardized structure of the data dictionary, and finally, the accurate mapping from the research scheme to the clinical data is realized, and a reliable logic basis is provided for research object screening and data analysis.

Fifth embodiment

The fifth embodiment of the application relates to an intelligent row-in method based on clinical research. The fifth embodiment is an improvement on the first embodiment, and is specifically an improvement in that in the present embodiment, a specific implementation of generating a discrimination code for each study design element according to the inference rule is provided.

Specifically, in some embodiments, the generating the discriminating code of each study design element according to the inference rule, that is, the step S103 may include the steps of:

Step S1031, extracting structural metadata of variables related to the inference rule from a data dictionary, wherein the metadata comprises a form name, a variable data type and a variable value range definition of the variables;

step S1032, generating code generation prompt words according to the inference rules, the structured metadata and the target language;

and step S1033, converting the code generation prompt word into a discrimination code through the large model.

For step S1031, illustratively, in this step, before generating the discriminating code, structured metadata of variables related to the inference rule is extracted from the data dictionary, and the structured metadata is the basis for generating the discriminating code. In particular operations, the system may retrieve detailed attributes of its corresponding variables in a data dictionary for each inference rule, such as "fasting blood glucose >7mmol/L", including the name of the form to which the variable belongs (e.g., "LABTEST" laboratory test table), the name of the variable (e.g., "FBG"), the type of data (e.g., "numerical"), and the value range definition (e.g., units "mmol/L", normal reference range, etc.), etc. For example, when the inference rule relates to a variable of 'fasting blood glucose value', the extracted metadata is 'table name: LABTEST |variable name: FBG|data type: numeric|unit: mmol/L', and the information can accurately reflect the storage structure and semantic definition of the variable in the database, so that code generation errors caused by variable attribute ambiguity are avoided, and the subsequently generated discriminating codes such as SQL and the like can be ensured to be accurately matched with the database structure.

For step S1032, illustratively, code generation hint words are generated based on the inference rules, the structured metadata, and the target language. This step integrates logic to infer rules (e.g., "fasting blood glucose >7 mmol/L), variable metadata (e.g., belonging forms, data types), and user-specified code language (e.g., SQL) into a preset template. For example, the hint word is exemplified by a "known database structure [ fasting blood glucose value ] [ Table name: LABTEST ] variable name: FBG|data type: numeric|units: mmol/L ], using variables [ laboratory test: fasting blood glucose value ], an SQL code expression inference rule [ fasting blood glucose value >7mmol/L ] is generated. In this way, the hint terms provide a complete code generation context for the large model, ensuring that the generated discriminant code conforms to the database structure and grammatical rules of the specified language.

For step S1033, the code generation hint words are input into a large model, and are converted into executable discrimination codes by the large model, for example. The large model generates code conforming to the target language grammar based on variable metadata (such as data type is numerical) and inference rules (such as threshold judgment) in the prompt. For example, for the prompt, the large model generates an SQL code of 'SELECT FROMLABTESTWHEREFBG > 7.0', which can be directly executed in the database, and records with a free blood glucose value greater than 7mmol/L are screened out. The generated discrimination codes not only meet the structural requirement of the data dictionary, but also can accurately realize logic of inference rules, and provide an automatic tool for the follow-up screening of patients conforming to a research scheme from a medical database.

Optionally, in some embodiments, after generating the inference rule, the method may further comprise the step of code validating the feedback loop:

Step S501, generating a simulation data set according to the data dictionary structure, and filling sample data of the medical compliance according to variable types;

Step S502, executing the generated pseudo code rule on the simulation data set, and capturing execution errors or abnormal output in real time;

and step S503, when the execution error is detected, the error log, the data snapshot and the execution environment are injected into the code to generate the prompt word, the inference rule generation step is triggered again, and the verification is repeatedly executed until the pseudo code rule stably operates in the simulation environment.

For step S501, an exemplary method includes generating a simulation dataset from a data dictionary structure and populating sample data of a medical compliance by variable type. In a specific operation, the system may generate the simulation data conforming to the medical logic based on metadata such as a form to which the variables recorded in the data dictionary belong, a data type (such as a numeric type and a text type), a value range definition (such as a blood glucose value range and an ICD coding rule), and the like. For example, a numerical sample in the range of 5.0-12.0mmol/L is generated for a "fasting blood glucose" variable, and a text record containing keywords such as "diabetes", "hypertension" and the like is filled for a "disease diagnosis" variable.

For step S502, illustratively, the generated pseudo-code rules are run on the simulated dataset to capture execution errors or abnormal output in real-time. By applying pseudo-code (e.g., SQL query, python conditional statement) transformed by inference rules to the simulation data, the system can verify the logical correctness and compatibility of the code. For example, when an SQL code of "SELECT" FROMPATIENTSWHERE fasting blood glucose >7.0 "is run, if the unit of the" fasting blood glucose "variable in the simulation data is marked as" mg/dL "instead of" mmol/L ", an execution error with a mismatch unit is triggered, and if the threshold judgment logic in the code is written backwards (such as" <7.0 "), a screening result which does not meet the expectations is output.

For step S503, illustratively, when an execution error is detected, the error log, the data snapshot, and the execution environment may be injected with code generation hint words to re-trigger the inference rule generation step. For example, when an error of "fasting blood glucose unit mismatch" is found, the system may integrate information such as an error type (unit conversion exception), an actual unit (mg/dL) in the analog data, a code execution environment (SQL syntax requirement), etc. into the code generation hint word, for example, "the current code fails to execute due to the unit mismatch," the fasting blood glucose unit in the analog data is mg/dL, and 7.0mmol/L is converted into 126mg/dL according to a medical standard, and then the SQL rule is regenerated. By feeding back error information to the large model, the system can automatically correct the inference rules and regenerate the codes until the pseudo codes stably run in the simulation environment, so that the finally generated discrimination codes can be accurately adapted to the structure and logic of the real medical data.

It should be noted that the number of the substrates, the present embodiment may also be a modification of any one or more of the second to fourth embodiments.

It is easy to find that in the embodiment of the application, by extracting the structural metadata related to the variables from the data dictionary, the form, the data type and the value range definition of the variables can be clarified, accurate basic information is provided for code generation, and further, the prompt word is generated by combining the inference rules, the structural metadata and the target language generation codes, so that the prompt word contains complete code generation elements, and therefore, when the prompt word is converted into the discrimination code through a large model, the generated code can be ensured to conform to the data dictionary structure, the inference logic and the target language grammar, thereby realizing the accurate conversion from the study design elements to the executable code, and being beneficial to efficiently screening the patient data conforming to the study scheme.

Sixth embodiment

The sixth embodiment of the application relates to an intelligent row-in method based on clinical research. A sixth embodiment is an improvement over the first embodiment in that in this embodiment, a specific implementation of determining a patient list according to the study plan based on the discrimination code is provided.

Specifically, in some embodiments, the determining the patient list according to the study plan according to the discrimination code, that is, step S104 may include:

Step S1041, executing the discrimination code to obtain discrimination results of each research design element;

and step S1042, determining a patient list conforming to the research scheme according to the discrimination result.

For step S1041, exemplary, by executing the discrimination code, the patient data in the medical database may be automatically screened, and the discrimination result of each study design element may be obtained. The discrimination code is generated based on the research design elements and the data dictionary, and comprises accurate logic judgment conditions (such as numerical threshold and text matching rules). For example, executing the SQL discrimination code of "SELECT × FROMPATIENTSWHERE fasting blood glucose >7.0AND age > =18", the system will traverse the patient data table, judge the "fasting blood glucose" AND "age" fields in each record, mark the record meeting the condition (blood glucose exceeding standard AND age reaching standard) as "true", AND the record not meeting the condition as "false", thus generating the discrimination result set of each element, AND providing the quantization basis for the subsequent screening.

For step S1042, the patient list finally conforming to the study plan is determined by integrating the judgment cases of all the study design elements. The step logically integrates the discrimination results of the elements (e.g. by combining conditions of "and", "or" relationships), for example, the discrimination results can be included in the list only when the patient satisfies all the element conditions of "fasting blood glucose value >7.0", "age > =18", "no serious cardiovascular and cerebrovascular diseases (corresponding to the relevant diagnostic variables in the data dictionary)", etc. The system outputs a complete and accurate patient list conforming to a research scheme by screening patient records with the discrimination result of true.

Optionally, in some embodiments, the executing the discriminating code obtains a discriminating result of each study design element, that is, step S1041 may further include:

Step S10411, executing the discrimination code in the target database to obtain an original discrimination result set;

Step S10412, generating a discrimination result matrix according to the original discrimination result set, wherein a row index of the discrimination result matrix corresponds to a patient identifier, a column index corresponds to a research design element, and a matrix unit value stores discrimination results of each patient-element combination;

and step S10413, taking the discrimination result matrix as the discrimination result of each research design element.

For step S10411, the system may connect to the target database through a secure channel, and sequentially run the discrimination code (e.g. SQL query, python script) corresponding to each study design element. For example, for two elements of "age. Gtoreq.18 years" and "fasting blood glucose >7.0mmol/L", the corresponding code sentence is executed, and the patient record is extracted from the database and the condition judgment is performed. For each patient, the code generates a boolean value (TRUE/FALSE) or fuzzy probability value (0-1) to indicate whether the element condition is met, and if the data is missing, the code is marked as NA, and finally an original patient-element discrimination result set is formed.

For step S10412, the discriminant result matrix uses the patient identifier as a row index, the study design element as a column index, and each matrix unit stores the discriminant result of the corresponding patient under the element. For example, a row of the matrix with a partial_id=1 sequentially records the determination result (TRUE/FALSE/NA) under each element such as "age condition" and "blood glucose condition". When a certain element corresponds to a plurality of discrimination codes (such as the same element is inferred through different variables), the result can take the maximum value (Boolean value takes OR logic, probability value takes the maximum value), and the comprehensiveness of the result is ensured.

For step S10413, the discrimination result matrix is exemplarily outputted as the discrimination result of each study design element. The discrimination result matrix may be shown in table 1 to intuitively present the matching relationship between all patients and the study design elements, for example, whether the patient with the patient_id=2 satisfies all the grouping conditions (for example, each column of results is TRUE), or on which elements the patient_id=i has data missing (for example, NA is displayed). The structured output not only provides clear judgment basis for determining the final patient list in the step S1042, but also can support researchers to rapidly locate target groups meeting the conditions through a matrix, and is convenient for subsequent statistical analysis (such as the coincidence rate, the data missing rate and the like of each element) on the judgment result, thereby improving the efficiency and the accuracy of research data processing.

Table 1 discrimination result matrix

Optionally, in some embodiments, the determining the patient list according to the study plan according to the discrimination result, that is, step S1042 may further include:

step S10421, determining a summary function according to a research scheme;

step S10422, inputting the discrimination result matrix into the summarization function, and calculating the rule-in conclusion of each patient;

And step S10423, determining a patient list conforming to the research scheme according to the scheduling conclusion.

For step S0421, an exemplary summary function for synthesizing the discrimination results of each study design element may be determined according to the specific requirements of the study plan. Different types of elements (e.g., inclusion criteria, exclusion criteria, intervention conditions, etc.) in the study protocol may be summarized using different logic rules. For example, the inclusion criteria generally requires "all conditions are satisfied simultaneously", which may be defined as boolean logic "res1& res2& gt resj", "none of the exclusion criteria requires" any exclusion conditions are satisfied simultaneously ", corresponding to" |res1 | res2& gt resj ". If the study involves probability weighted judgment, a weight distribution rule (such as a primary endpoint weight of 0.8 and a secondary endpoint weight of 0.5) can be defined, so that the summary function accords with scientific logic and screening strictness of the study.

For step S0422, the discrimination result matrix may be input into a summary function to calculate an in-row conclusion for each patient, for example. Each row in the matrix corresponds to the patient identifier, each column corresponds to the element discrimination result (Boolean value or probability value), and the summarization function operates the row data according to preset logic. Taking boolean logic as an example, the inclusion criteria integrate element results (e.g., resPin =res1 & RES 2) by and operation, the exclusion criteria integrate element results by and operation (resPout = | RES1 ≡ | RES 2), and finally the inclusion conclusion res= resPin & resPout. If the probability weighting is adopted, the standard weight is assumed to be 0.6, the standard weight is assumed to be excluded from being 0.4, res= resPin ≡0.6 x resPout A0.4, and when RES is more than or equal to a threshold L, judging that the conditions are met.

For step S0423, illustratively, a list of patient IDs meeting the entry group condition is output according to the entry conclusion. The system can traverse the discrimination result matrix, screen the patient records with the in-out conclusion of TRUE or RES L, and extract the corresponding patient identifiers. For example, when the RES value of patient i is 0.92 and the threshold L is set to 0.8, its ID will be included in the list. The finally output patient ID list can be directly connected with an electronic medical record system or a research database, so that accurate target groups are provided for subsequent recruitment and data acquisition, and meanwhile, researchers are supported to trace the complete medical records of patients through the IDs, so that the patients in the group are ensured to strictly meet all requirements of a research scheme.

It should be noted that the number of the substrates, the present embodiment may also be a modification of any one or more of the second to fifth embodiments.

It is not difficult to find that in the embodiment of the application, because the discrimination code is executed to automatically screen the medical data, the discrimination result of each research design element can be obtained quickly and accurately, the coincidence condition of each patient under a single element is clarified, and then the patient list conforming to the research scheme is determined by utilizing the comprehensive analysis of the summarization function based on the discrimination results, so that the full-process automation from the research design element to the screening of specific patients can be realized, the inefficiency and subjectivity error of manual screening can be avoided, and the accurately conforming research requirements of the screened patients can be ensured based on the standardized logic of the data dictionary and the inference rule, and the reliable and efficient target crowd data support is provided for clinical research.

Seventh embodiment

The seventh embodiment of the application relates to an intelligent row-in method based on clinical research. The seventh embodiment is an improvement over the first embodiment in that in this embodiment, the step of expert audit feedback loops is further included after the patient list is generated.

Specifically, in some embodiments, after the determination of the list of patients according to the study protocol based on the discriminant code, the method may further comprise:

step S601, according to the patient list, time-invariant information and time-series medical data are extracted from a database and are integrated according to a time axis to generate a structured personal narrative report;

step S602, receiving in-row annotation data generated by a medical expert based on the narrative report;

And step S603, updating the element-variable mapping relation, the inference rule, the discrimination code and the large model according to the in-row annotation data.

For step S601, illustratively, time-invariant information (e.g., age, gender, underlying disease history) and time-series medical data (e.g., inspection results, medication records, medical procedures) may be extracted from the database according to the patient list and integrated in a time-axis to generate a structured personal narrative report. In a specific operation, the system may traverse all forms in the electronic medical record system with the patient ID as an index, and extract all data related to the study, such as blood glucose values, medication adjustment conditions, etc. of the patient at different time points. These discrete data points are concatenated by time stamping into a continuous chain of medical events forming a complete medical calendar representation for each patient. For example, the report may present information such as "patient diagnosed with diabetes at 1 month 2023, beginning to take metformin, poor glycemic control adjusted to insulin therapy for 3 months, and abnormal renal function index for 6 months" which provides a comprehensive, consistent clinical view for expert reviews.

For step S602, the medical professional determines whether the patient truly complies with the study protocol, illustratively by looking at the structured report, in combination with clinical experience. If a misinclusion or screening missing condition is found, specific error types can be positioned, such as missing inference rules (such as excluding chronic kidney disease only according to the diagnosis name including 'renal insufficiency', 'excluding chronic kidney disease', omitting 'eGFR < 55'), missing content matching (such as not identifying the corresponding 'EQizumab' of the trade name 'rubbing' and omitting time requirements (such as not limiting the time range of 'folic acid deficiency evidence in the first half year of group entry'). Expert selects error element and adds annotation (such as "need to supplement eGFR <55 as exclusion condition") through interface to form into row label data set containing patient ID, error element, correction suggestion, and provides clear direction for follow-up model optimization.

For step S603, exemplary, the element-variable mapping relationship, the inference rule, the discrimination code, and the large model are updated according to the in-row labeling data. For example, according to the in-row labeling data, when expert feedback "eGFR <55 shall be used as chronic kidney disease exclusion index", the system will optimize each module in turn, the advice is fed back to the element-variable mapping module to supplement the mapping relation between "chronic kidney disease" and "eGFR", the exclusion condition of "eGFR <55" is newly added in the inference rule generating module, meanwhile SQL discrimination code containing eGFR threshold judgment is automatically generated, finally, the final in-group name list confirmed by expert is used as GroundTruth, the supervision fine tuning or reinforcement learning is performed on the large model, and the new screening logic is mastered by the large model through repeated training of optimized model parameters. The precision of element-variable matching, inference rule construction and code generation is continuously improved through the closed loop feedback mechanism, the high fit between the automation level of patient screening and expert labeling results is gradually achieved, and the precision and the intellectualization of study scheme execution are achieved.

It should be noted that the number of the substrates, the present embodiment may also be a modification of any one or more of the second to sixth embodiments.

It is not difficult to find that in the embodiment of the application, because the structured narrative report can be generated by extracting and integrating the medical data according to the patient list, comprehensive and time-sequential patient clinical information can be provided for medical professionals, the professionals are further supported to verify the correctness of the group entering based on the report and mark errors, the types and the correction directions of the problems are clarified, the marked data are synchronized to each module for iterative optimization, and the closed-loop feedback from data generation and expert auditing to model improvement is realized.

Eighth embodiment

The eighth embodiment of the application relates to an intelligent row-in method based on clinical research. The eighth embodiment is an improvement over the first embodiment in that, in the present embodiment, a history archiving and knowledge multiplexing mechanism is further included.

Specifically, in some embodiments, the method further comprises:

Step S701, storing the verified research design elements, the inference rules, the discrimination codes and the associated metadata as a structured history archive;

Step S702, responding to the input of a new research scheme, performing the following operations of encoding the current research elements and the inference rules into semantic vectors, searching Top-K similar archive entries in a historical knowledge base, directly multiplexing the inference rules and the discrimination codes if complete matching entries exist, and performing automatic adaptation operation to generate the adapted inference rules and discrimination codes if partial matching entries exist.

For step S701, the validated study design elements, inference rules, discriminant codes, and associated metadata may be stored as a structured historic archive, forming a reusable knowledge base, for example. The archive content covers research design elements (such as an inclusion standard and an exclusion standard) under the PICO framework, inference rules in the form of pseudo codes, expert-verified discrimination codes (such as SQL query sentences), and metadata such as research types, data dictionaries, audit records and the like, and is stored in a structural format such as JSON. For example, for type II diabetes studies, archiving would record the inclusion criteria of "fasting glucose ∈7mmol/L," inferred rules of "FBG >7.0ORHbA1c >6.5%", and the corresponding SQL code "SELECTPATIENT _ idFROMLABTESTWHEREFBG >7.0;". The archiving mechanism provides historical experience reference for new research and avoids repeated labor.

For step S702, where when a new study plan is entered, the system may first encode the current study element and inference rules into semantic vectors, and then retrieve Top-K similar archive entries in a historical knowledge base. If there is a complete matching entry (for example, the same disease, the same research type and the same data dictionary are consistent), the verified inference rule and the discrimination code can be directly multiplexed, and if there is a partial matching entry (for example, variable naming difference or threshold adjustment), the following automatic adaptation operation can be executed, namely, extracting the variable mapping relation and threshold parameter in the history rule, and injecting the variable naming change of the new data dictionary, the threshold adjustment of the new scheme and other difference information into the large model prompt word. For example, the new research requires that the fasting blood glucose is more than or equal to 7.2mmol/L as an optional standard, the history archiving rule is searched to be that the fasting blood glucose is more than or equal to 7mmol/L, after the partial matching entry is searched, a history SQL code FBG >7.0 can be extracted, a prompt word is injected into the threshold difference FBG 7.2, and the big model automatically generates an adapted code FBG >7.2 based on the history context, so that the efficiency and the accuracy of rule generation are improved, and the efficient multiplexing of the history knowledge is realized.

It should be noted that the number of the substrates, the present embodiment may also be a modification of any one or more of the second to seventh embodiments.

It is not difficult to find that in the embodiment of the application, by storing the verified research design elements, the inference rules, the discrimination codes and the associated metadata as the structural history archive, a traceable knowledge asset library is constructed, so that when a new research scheme is input, the history knowledge library can be searched through semantic vectors, similar items can be rapidly positioned, and the verified discrimination codes or the extracted history rules are multiplexed as contexts, therefore, the method can not only avoid the resource waste of repeatedly developing the discrimination codes, but also improve the accuracy and reliability of new rule generation by utilizing the history experience, and especially in partial matching scenes, realize the standardization and the intellectualization of research scheme execution by automatically adapting variable differences and threshold updating through a large model, and achieve the purposes of obviously shortening the research preparation period and reducing human errors.

Ninth embodiment

The ninth embodiment of the application relates to an intelligent row-in method based on clinical research. The ninth embodiment is an improvement over the first embodiment in that in the present embodiment, the study plan for time constraint further includes a time logic strengthening step.

Specifically, in some embodiments, the time logic strengthening step may include:

Step S801, analyzing a study scheme to identify index events and determining the starting time of individual study of patients;

Step S802, expanding a hook time variable field in a data dictionary, and dynamically associating a medical variable with a timestamp;

Step S803, converting the time related description in the research scheme into a time window relative to the initial time, and generating an inference rule in the form of pseudo codes;

Step S804, outputting database query codes including time links to ensure that the screening conditions correspond to the starting time and the hooking time variables.

For step S801, illustratively, by parsing the study protocol to identify index events, the start time T0 of the patient individual study may be determined. In clinical studies, index events are typically critical medical events with a definite time stamp, such as first diagnosis date, first medication date, or operation date, etc. For example, for a time condition of "no insulin treatment within 6 months before the group entry", the index event of "group entry" is first identified as T0, and the time condition is converted into a relative time range based on T0.

For step S802, as shown in table 2, an exemplary field may be added (i.e. a hook time variable) based on the original data dictionary field (e.g. table name, variable name), for specifying a timestamp (e.g. a "fasting blood glucose" variable associated "test_date" time field) corresponding to the medical variable. If the variable does not have an explicit time field, then the standard timestamp associated with the table to which it belongs (e.g., diagnostic record association "visual_date") is defaulted. By means of the expansion, the original independent medical variable is provided with time dimension attributes, for example, the variable 'ICD 10 coding' can determine diagnosis time through 'visual_date', and a data basis is provided for screening of subsequent time conditions.

Table 2 example of hook time variable

For step S803, illustratively, in processing the time-related description in the study plan, the system may translate the time-related description into a relative time window referenced to a start time T0 and generate an inference rule in the form of a pseudo code. For example, the "record of having fasting blood glucose≥7 for 6 months prior to group entry" may be converted to a logical expression of "T0-180≤test date < T0 and fasting blood glucose≥7", where T0 is the start time (e.g., group entry date) determined by step S801. For complex time conditions (e.g., "detect a time is earlier than detect B time"), this may be accomplished by cross-table time variable comparisons, such as associating timestamp fields in different forms. The generated pseudocode rules will combine the time constraint with the medical index condition, e.g. embodied as "WHERE TEST _date BETWEEN T0-180 AND T0 AND FBG > = 7.0" in SQL pseudocode. The screening logic is ensured to meet the index threshold requirement and the time window requirement, and the automatic processing of the time sensitive research condition is realized.

For step S804, illustratively, the pseudo code rules are converted into an executable database language (e.g., SQL) and a precise time constraint query is generated by integrating the T0 time base with the extended hook time variable in the data dictionary. For example, for the condition that "no insulin treatment is accepted 6 months before group entry", the generated SQL code will correlate the "media_date" time field of the medication table with T0, and filter out the record of "media_date < T0-180", thereby excluding patients with insulin administration within the time window. The code containing time connection can accurately match the requirement of clinical study on time sensitivity, and can avoid the patient from being misplaced due to time logic omission.

It should be noted that the number of the substrates, the present embodiment may also be a modification of any one or more of the second to eighth embodiments.

It is easy to find that in the embodiment of the application, the index event is identified and the starting time T0 is determined by analyzing the research scheme, a unified reference origin is established for the time logic, the medical variable has time dimension attribute by expanding the time variable field of the data dictionary hook, further, the time description can be converted into a relative time window to generate a pseudo code rule and a query code containing time connection is output, therefore, the mode realizes the accurate mapping from the research scheme time condition to the database screening logic, not only can ensure that the screening condition strictly accords with the time sensitivity requirement of clinical research, but also can avoid the patient from being misplaced or missed due to the omission of the time logic, and finally provides accurate time dimension screening capability for clinical research (such as simulated RCT and longitudinal queue research) requiring strict time control.

Tenth embodiment

The tenth embodiment of the application relates to an intelligent row-in method based on clinical research. The tenth embodiment is an improvement over the first embodiment, and specifically, the improvement is that, in this embodiment, the method adopts a distributed server architecture to implement privacy protection.

Specifically, in some embodiments, privacy protection is implemented using a distributed server architecture, comprising the steps of:

Step S901, a first server of a public network is deployed, wherein the first server is used for executing large model reasoning and knowledge base management and outputting encrypted discrimination codes;

Step S902, deploying an intranet second server, wherein the second server is connected with a medical database and executes the discrimination code to generate a list;

Step S903, the first server unidirectionally transmits the discriminating code to the second server through a compliant encryption channel;

In step S904, the server responds only to the authorization request of the first server, and does not transmit the original medical data outwards.

For step S901, the first server may operate an NLP model analysis study scheme, extract study design elements under the PICO framework, and combine with a local knowledge base such as a data dictionary and a medical guideline to generate a discriminant code (e.g., SQL). In order to ensure data privacy, the generated codes are subjected to encryption processing, so that sensitive information leakage in the transmission process is prevented. The public network deployment has the advantages of supporting cloud elastic expansion, dynamically allocating resources according to large model training requirements or concurrent request quantity, and being convenient for remotely maintaining model versions and updating a knowledge base.

For step S902, the second server interfaces with the hospital electronic medical record library, the medical insurance database, and the like through the security interface, and supports database software operation such as Oracle, mySQL, and the like. Upon receiving the encrypted code from the first server, the second server performs a query operation in the local medical database, such as screening patient records that meet the study criteria, and generating a group list. The intranet deployment adopts strict security design, such as implementing access control such as dynamic token through a fort machine, ensuring that only authorized requests can be accessed, and simultaneously automatically desensitizing treatment on patient data (such as personal narrative report) needing to be sent out, and hiding sensitive fields such as name, identification card number and the like.

For step S903, exemplary, the unidirectional transmission of the discrimination code from the first server to the second server may be implemented through a compliant encryption channel. In order to meet the privacy requirement of medical data, a transmission channel adopts a bidirectional authentication encryption technology (such as TLS 1.3) to ensure that codes are not intercepted or tampered in the transmission process of a public network. The unidirectional transmission mechanism limits data to flow from the first server to the second server only, so that sensitive information in the medical database is prevented from being transmitted to the public network environment in the reverse direction, for example, the second server cannot return original data to the first server, only a code execution result (such as a group-entering patient ID list) is fed back, and therefore a leakage path of private data is cut off.

For step S904, it may be specified that the second server only responds to the authorization request of the first server, and network access control is enhanced. The firewall rules of the second server may strictly define that only encrypted requests from the first server IP address are received, and that other sources of access (e.g. external hacking or unauthorized internal requests) may be denied. The authorization mechanism is combined with the dynamic token or the API key verification to ensure that each code execution request is subjected to identity authentication, for example, the first server can trigger the code execution operation of the second server by attaching the timeliness key, thereby preventing unauthorized parties from forging the request to access the medical database and ensuring the security and compliance of data access from the architecture level.

It should be noted that the number of the substrates, the present embodiment may also be a modification of any one or more of the second to ninth embodiments.

It is not difficult to find that in the embodiment of the application, because the first server is deployed in the public network to execute large model reasoning and knowledge base management and output encryption discrimination codes, the second server is deployed in the intranet to connect the medical database and execute code generation and list entering, further, the unidirectional code transmission from the first server to the second server can be realized through the compliance encryption channel, and the second server is limited to only respond to the authorization request of the first server, therefore, the distributed architecture not only realizes large model reasoning and code generation by utilizing the elastic computing capacity of the public network server, but also ensures the physical isolation of medical data through the intranet server, and prevents privacy leakage by means of encryption transmission and authorization mechanism, thereby realizing the dual targets of data processing and safety protection while meeting the requirement of clinical research automation screening.

Eleventh embodiment

An eleventh embodiment of the present application relates to a specific application example of an intelligent ranking method based on clinical research.

The following is the combing of the breast cancer simulated RCT study protocol workflow, structured presentation according to step logic and key content:

1. study design element extraction

1.1 Study type identification

Study plan text (including the expression "develop simulated RCT study").

The method comprises the steps of identifying a research type as 'simulated RCT' through a large model, and loading CONSORT guide as a prompt word context.

And outputting a secondary prompt word with a study type label, and definitely defining a guide for study design compliance.

1.2 Study design element extraction

Inputting prompt words containing research types.

Extracting elements such as an selection standard, an exclusion standard, intervention measures, clinical fates and the like item by item through a large model according to the simulated RCT requirement, for example:

the selection criteria are 18-70 years old female, triple negative breast cancer postoperative, ECOG score of 0-1 and the like.

Exclusion criteria, receiving neoadjuvant therapy, bilateral breast cancer, metastatic lesions, etc.

Intervention measures first treatment drug of intervention group (PCb protocol) and control group (CEF-T protocol).

And outputting a research design element list in a standardized JSON format.

2. Element-variable matching and inference rule generation

2.1 Element-variable tag matching

And inputting a research design element list and a data dictionary.

The large model realizes direct matching (such as 'female'. Fwdarw.sex ') and indirect reasoning (such as' poor blood sugar control '. Fwdarw.fasting blood sugar') through semantic analysis.

And outputting an element-variable name matching list, and associating the medical element with a specific variable in the data dictionary.

2.2 Inference rule Generation

And inputting a matching list, a data dictionary and a coding dictionary.

Operations of generating a hint word (e.g., "how to infer breast cancer from [ diagnosis name ]) for each element-variable pair, the large model generates pseudo-code rules in conjunction with a data dictionary, such as:

text [ 'breast cancer' in diagnosis name | 'breast cancer' in diagnosis name).

Classification variable "ICD10 encodes IN ('C50')".

And outputting the inferred rules and corresponding variables in the form of pseudo codes.

3. Discrimination code generation

3.1 Code Generation procedure

And (3) inputting inference rules, SQL language and data dictionary.

The operation is as follows:

the database structure extracts the table name, field name, type, etc. of the variable from the data dictionary (e.g. "cTNM stages" corresponds to the table name "rxa _zd_ zkzd").

The prompt word is constructed by integrating rules, structures and languages, and the example is 'generating SQL codes and screening cTNM patients with I-IIC stage by stage'.

The large model generates codes, such as judging the stage value through a CASE WHEN statement, and screening the patient ID in groups.

And outputting SQL discrimination codes corresponding to the elements, such as codes for screening early breast cancer:

SELECT patient_id, MAX(CASE WHEN "ZD-03-004" IN ('I', 'IA', ..., 'IIC') THEN 1 ELSE 0 END) AS res4FROM rxa_zd_zkzd GROUP BY patient_id;

4. discriminating code execution and result collection

And (3) inputting a discrimination code set and a database port.

The operation is to execute the code in the database to generate boolean value discrimination results (TRUE/FALSE/NA) by patient-factor.

And outputting a discrimination result matrix, and listing the discrimination result matrix as element results (such as event_1 is FALSE under the early breast cancer element).

5. Grouping patient list determination

And (5) inputting a discrimination result matrix.

The operation is as follows:

Defining summary functions-integrating elements in boolean logic, for example:

the inclusion criteria resPin =res 1& res2 &.& res11 (while satisfying).

Exclusion criteria: resPout = | res 12? res20 (none satisfied).

The calculation of the inclusion conclusion is res= resPin & resPout & resIC & resO, screening res=true patients.

A list of eligible patient IDs (e.g., [2,3,5,., 1000 ]).

6. Expert audit and feedback optimization (optional)

6.1 Patient narrative data Generation

And inputting a group list and a database port.

Operations of integrating patient demographic, diagnostic, pathology, etc. data in a time axis to generate a structured report (e.g., diagnostic time, examination results, etc. of event_1).

And outputting the personal narrative data containing the time stamp.

6.2 Expert auditing and error labeling

Expert reviews the narrative data, labels error types (e.g., inference rule insufficiency, keyword matching omission), such as:

Patient 1 was not identified as breast cancer due to the "breast cancer" diagnosis, and the keyword "breast cancer" was added.

And outputting a labeling data set (patient ID, error element and correction proposal).

6.3 Model iteration

The operation is as follows:

rule modification, updating element-variable matches (e.g. "diagnosing breast cancer" adding "breast cancer" keywords).

Code updating, namely regenerating the judging code containing the correction rule.

And (3) fine tuning the large model, namely marking the expert as Groundtruth, and optimizing the reasoning capacity of the model.

7. History archiving and knowledge multiplexing (optional)

7.1 Historic archive construction

Content, store study elements, inference rules, codes, audit records, etc., example JSON archives contain item IDs, study types, revision notes (e.g. "diagnosis adds 'breast cancer' keywords").

7.2RAG retrieval and multiplexing

The new research (such as cervical cancer simulation RCT) is archived through semantic vector retrieval history, codes are directly multiplexed by completely matching elements (such as 18-70 years old), and keywords are adjusted and then matched by similar elements (such as cervical cancer diagnosis).

8. Time logic enhancement (optional)

8.1 Initial event definition

Study protocol (including "post-operative pathology detection") was entered.

Operation of identifying the initiation event as "breast cancer surgery", defining an initiation time T0.

Output, extended study element (containing T0).

8.2 Data dictionary extension

Operation, a hook time field (such as "pathology report date" as a time stamp for "histological typing") is added to the variable.

8.3 Time Condition embedding

The operation is that the time description is converted into a relative window (such as 'blood routine in 14 days after operation'. Fwdarw.T0 is less than or equal to the test date is less than or equal to T0+14), and SQL codes containing time constraint (such as associated T0 form screening patients) are generated.

In conclusion, the application realizes full-automatic closed loop from study scheme text to patient screening through large model driving, combines data dictionary, history archiving and expert feedback, ensures accuracy and reusability of screening logic, and is especially suitable for clinical study scenes requiring strict time control and privacy protection.

It can be seen that the application has at least the following technical effects:

1. and (3) full-automatic patient screening, namely, intelligent analysis of research scheme text by means of a large language model, automatic extraction of structural research design elements and generation of a discrimination code capable of directly running in a database, without programming background. Because medical researchers do not need to have programming capability, only a research scheme of natural language description is submitted, a patient list meeting the conditions can be directly obtained, and the obstacle of cross-domain cooperation is effectively eliminated.

2. The multi-source heterogeneous data unified processing capability is that the matching of semantic level is realized through a data dictionary and an RAG technology, and variable names, table structures and coding rules of different databases can be mapped automatically. Heterogeneous database integration of cross-mechanism and multi-center research is supported, labor cost of manually adapting to the database is greatly reduced, and utilization efficiency of multi-source data is remarkably improved.

3. Dynamic medical knowledge fusion and continuous optimization, wherein a local knowledge base can integrate medical guidelines, documents and historical archives, the latest knowledge is searched in real time by using a RAG technology, and meanwhile, the model can be driven to update iteratively by an expert audit feedback mechanism. Ensure that the rule of the row entry always keeps synchronous with the medical front and the specific requirements of researchers, and avoid rule degradation or misinterpretation of research purposes.

4. And the efficiency and the accuracy are improved, and technologies such as broad-spectrum medical knowledge retrieval, automatic code generation, feedback verification mechanism, accurate processing of time sensitive conditions and the like are combined. The patient screening of the large multi-center database is changed from the traditional manual docking and repeated communication demand mode to the intelligent man-machine interaction mode, and efficiency and accuracy are considered.

5. And strict data privacy and security protection, namely separating a networking large model from an intranet database by adopting a distributed deployment architecture, and ensuring data security through encryption transmission. The large model is not in contact with real patient data in the whole process, sensitive information is completely isolated from an intranet, and data privacy and safety requirements are strictly met.

The above steps of the methods are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they include the same logic relationship, and all the steps are within the scope of the present application, and adding insignificant modifications to the algorithm or the process or introducing insignificant designs, but not changing the core designs of the algorithm and the process, are within the scope of the application.

In addition, some embodiments of the application also provide an electronic device. The electronic device may be a digital computer in various forms, such as a laptop computer, a desktop computer, a workstation, a personal digital assistant, a server, a blade server, a mainframe computer, and the like. The electronic device may also be various forms of mobile equipment, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.

The electronic device comprises one or more processors and a memory storing computer program instructions that, when executed, cause the processors to perform the steps of a method as provided in any one or more of the embodiments described above. Fig. 3 discloses an exemplary structural diagram of the electronic device. The electronic device includes one or more processors 1101, memory 1102, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, a plurality of electronic devices may be connected, each providing a part of the necessary operations. Wherein the components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

The electronic device may further comprise input means 1103 and output means 1104. The processor 1101, memory 1102, input device 1103 and output device 1104 may be connected by a bus or other means, as illustrated by a bus connection.

The input device 1103 may receive input digital or character information and generate key signal inputs related to user settings and function control of the electronic device, such as a touch screen, keypad, mouse, trackpad, touchpad, pointer stick, one or more mouse buttons, trackball, joystick, and like input devices. The output device 1104 may include a display device, auxiliary lighting (e.g., LEDs), and haptic feedback (e.g., a vibration motor), among others. The display device may include, but is not limited to, a liquid crystal display, a light emitting diode display, and a plasma display. In some embodiments, the display device may be a touch screen.

To provide for interaction with a user, the electronic device may be a computer. The computer has a display device (e.g., a cathode ray tube or LCD monitor) for displaying information to a user, and a keyboard and pointing device (e.g., a mouse) through which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with the user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback), and input from the user may be received in any form (e.g., voice input or tactile input).

In an embodiment of the present application, a computer readable medium has stored thereon a computer program/instruction which, when executed by a processor, implements the steps of the method provided by any one or more of the embodiments described above. The computer readable medium may be contained in the electronic device described in the above embodiment or may exist alone without being incorporated in the device. The computer-readable medium carries one or more computer-readable instructions.

Memory 1102 may be used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules. The processor 1101 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 1102 to implement program instructions/modules corresponding to the methods provided by any one or more of the embodiments of the present application.

The memory 1102 may include a storage program area that may store an operating system, application programs required for at least one function, and a storage data area that may store data created according to the use of the electronic device, etc. In addition, memory 1102 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 1102 optionally includes memory remotely located relative to processor 1101, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The computer readable medium according to the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable medium can be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory, a read-only memory, an erasable programmable read-only memory, an optical fiber, a portable compact disc read-only memory, an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Computer-readable media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory, static random access memory, dynamic random access memory, other types of random access memory, read only memory, electrically erasable programmable read only memory, flash memory or other memory technology, compact disc read only memory, digital versatile disks or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the C-language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the user computer through any kind of network, including a local area network or a wide area network, or may be connected to an external computer (e.g., connected through the internet using an internet service provider).

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. For example, an application specific integrated circuit, a general purpose computer, or any other similar hardware device may be employed. In some embodiments, the software program of the present application may be executed by a processor to implement the above steps or functions. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

Embodiments of the present application provide a computer program product comprising one or more computer programs/instructions which, when executed by a processor, produce, in whole or in part, a process or function in accordance with embodiments of the present application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

The flowchart or block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The scope of the application is indicated by the appended claims rather than by the foregoing description, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The words "first," "second," and the like are used merely to distinguish between descriptions and do not indicate any particular order, nor are they to be construed as indicating or implying relative importance.

The above embodiments are merely illustrative examples, but the scope of the present application is not limited thereto, and any person skilled in the art can easily mention variations or substitutions within the scope of the present application. The present application is therefore to be considered in all respects as illustrative and not restrictive, and the scope of the application is indicated by the appended claims.

Claims

1. An intelligent entry and exit method based on clinical research, characterized in that the method is implemented based on a large model, the method comprising:

Determining research design elements according to the received research scheme;

determining an inference rule according to the study design element;

generating a discrimination code of each research design element according to the inference rule;

and determining a patient list conforming to the research scheme according to the discrimination code.

2. The method of claim 1, wherein determining study design elements from the received study plan comprises:

Identifying a study type of the study plan;

Acquiring international standard guidelines corresponding to the research types and/or prompt word templates predefined by users;

Determining elements to extract prompt words according to the research scheme, the research type and/or the international specification guide and/or a prompt word template predefined by a user;

And extracting prompt words according to the elements, and outputting a structured list of research design elements through the large model.

3. The method of claim 2, wherein the extracting the hint words from the elements and outputting the structured list of study design elements through the large model comprises:

determining target items of research design elements according to the PICO framework of the research type;

Extracting prompt words according to the elements, and inquiring research design elements of the target items in the research scheme item by item to a large model;

and forming the structured list according to the inquired study design factors.

4. The method of claim 3, wherein forming the structured list based on the queried study design elements comprises:

The method comprises the steps of inquiring research design elements, executing conversion operation on the inquired research design elements to generate specific conditions capable of judging data, wherein the conversion operation comprises at least one of disassembling compound conditions into atomic conditions, converting scheme description of intervention measures into operable medical behaviors, and converting ending index calculation requirements into existence requirements of original detection data at preset time nodes;

and forming the structured list according to the specific conditions.

5. The method of claim 1, wherein determining inference rules from the study design element comprises:

carrying out semantic matching on the research design elements and a data dictionary to generate element-variable mapping relation;

and determining an inference rule according to the element-variable mapping relation.

6. The method of claim 5, wherein semantically matching the study design element with a data dictionary to generate an element-variable mapping relationship comprises:

analyzing semantic information of the research design elements through a large model;

and carrying out semantic matching on the semantic information and variable labels in a data dictionary, and outputting an element-variable name matching list containing variable names, variable labels and belonging forms.

7. The method of claim 5, wherein determining an inference rule from the element-variable mapping relationship comprises:

Constructing and filling a prompt word template for each generated element-variable mapping relation, wherein the prompt word template is injected with the following information including medical definition of target elements, matched variable names, variable labels, types and coding rules of a data dictionary;

And inputting the filled prompt words into a large model, and outputting inference rules in the form of pseudo codes by combining a data dictionary, wherein the inference rules at least comprise one of threshold judgment of numerical variables, keyword matching of text variables and code value mapping of classification variables.

8. The method of claim 1, wherein generating a discriminant code for each study design element in accordance with the inference rules comprises:

extracting structural metadata of variables related to the inference rule from a data dictionary, wherein the metadata comprises a form name, a variable data type and a variable value range definition of the variables;

Generating a code generation prompt word according to the inference rule, the structural metadata and the target language;

and converting the code generation prompt word into a discrimination code through the large model.

9. The method of claim 1, wherein said determining a list of patients who are eligible for a study based on the discriminant code comprises:

Executing the discrimination code to obtain discrimination results of the research design elements;

and determining a patient list conforming to the research scheme according to the discrimination result.

10. The method of claim 9, wherein the executing the discrimination code to obtain discrimination results for each study design element comprises:

executing the discrimination code in a target database to obtain an original discrimination result set;

Generating a discrimination result matrix according to the original discrimination result set, wherein the row index of the discrimination result matrix corresponds to the patient identifier, the column index corresponds to the research design element, and the matrix unit value stores the discrimination result of each patient-element combination;

And taking the discrimination result matrix as the discrimination result of each research design element.

11. The method of claim 9, wherein determining a list of patients who are eligible for a study based on the outcome of the determination comprises:

Determining a summary function according to a research scheme;

inputting the discrimination result into the summarizing function, and calculating the in-row conclusion of each patient;

and determining a patient list conforming to the research scheme according to the ranking conclusion.

12. The method of claim 1, wherein after said determining a list of patients who are eligible for a study, based on said discriminant code, the method further comprises:

According to the patient list, extracting time-invariant information and time-series medical data from a database, and integrating the time-axis information and the time-series medical data to generate a structured personal narrative report;

receiving in-row annotation data generated by a medical expert based on the narrative report;

And updating the element-variable mapping relation, the inference rule, the discrimination code and the large model according to the in-row annotation data.

13. The method according to claim 1, wherein the method further comprises:

Storing the verified study design elements, inference rules, discriminant codes and associated metadata as a structured historical archive;

In response to the new study plan input, the following operations are performed, namely, the current study elements and the inference rules are encoded into semantic vectors, top-K similar archive entries are searched in a historical knowledge base, if complete matching entries exist, the inference rules and the discrimination codes are directly multiplexed, and if partial matching entries exist, automatic adaptation operation is performed, so that the adapted inference rules and discrimination codes are generated.

14. The method according to claim 1, wherein the method further comprises:

analyzing the study plan to identify index events, determining a start time of an individual study of the patient;

Expanding a hook time variable field in a data dictionary, and dynamically associating the medical variable with the timestamp;

Converting the time-related description in the research scheme into a time window relative to the starting time, and generating an inference rule in the form of pseudo codes;

And outputting a database query code containing time connection, and ensuring that the screening conditions correspond to the starting time and the hooking time variable.

15. The method of claim 1, wherein the method employs a distributed server architecture to achieve privacy protection:

The system comprises a public network first server, a first server and a second server, wherein the first server is used for executing large model reasoning and knowledge base management and outputting encrypted discrimination codes;

The second server is connected with the medical database and executes the discrimination code to generate a list;

the first server unidirectionally transmits the discrimination code to the second server through a compliant encryption channel;

the server second server responds only to the authorization request of the first server and does not transmit the original medical data outwardly.

16. An electronic device, the electronic device comprising:

one or more processors, and

A memory storing computer program instructions that, when executed, cause the processor to perform the steps of the method of any one of claims 1 to 15.

17. A computer readable medium having stored thereon a computer program/instruction, which when executed by a processor, implements the steps of the method of any of claims 1 to 15.

18. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 15.