WO2026003128A1 - Method and system for generating information on a health condition of a patient - Google Patents
Method and system for generating information on a health condition of a patientInfo
- Publication number
- WO2026003128A1 WO2026003128A1 PCT/EP2025/067992 EP2025067992W WO2026003128A1 WO 2026003128 A1 WO2026003128 A1 WO 2026003128A1 EP 2025067992 W EP2025067992 W EP 2025067992W WO 2026003128 A1 WO2026003128 A1 WO 2026003128A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- information
- patient
- computer
- report
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H15/00—ICT specially adapted for medical reports, e.g. generation or transmission thereof
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/20—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Definitions
- the invention relates to a computer-implemented method of generating information on a health condition of a patient, a training method of training a language model, a method of generating a proposal for a medical treatment of a patient, a computer system for generating information on a health condition of a patient, a computer program and a computer-readable storage medium.
- the methods and the computer system according to the present invention may mainly be used for evaluating cancer data of a patient. Other applications are feasible.
- US 2024029848 Al discloses a computer implemented method for generating a text report including receiving input data including at least one table, using a first generative model to identify one or more variables in the input data and generate a table extract comprising the identified variables in a specified order, and using a second generative model to generate a text report based on the table extract, the text report including each of the extracted variables in the specified order.
- a system for simulating the healthcare journey of one or more patients is also disclosed. The system receives patient data; creates a simulation model of the patient; executes the simulation model to predict health variables; generates treatment variables; provides the predicted health variables, the treatment variables and clinician inputs to the simulation model for continuous learning of the simulation model.
- US 2021118559 Al discloses a system and method, the method comprising receiving a laboratory diagnostic testing result associated with a specimen of a subject, the steps of receiving a clinomic profile of the subject, identifying a cohort of similar subjects based at least in part on the clinomic profile of the subject, providing the diagnostic testing results, clinomic profile, and the cohort of similar subjects to a smart output module to generate a personalized, precision medicine based laboratory diagnostic testing result as a smart output and displaying the smart output to a user.
- US 2019108898 Al discloses a system and method for enhancing the efficiency and accuracy of analysis and interpretation of medical diagnostic laboratory test data for real-time clinical decision support, utilizing artificial intelligence techniques to automatically improve analytical performance and enhance provider and patient communications.
- a method for automatically generating a field of a radiology report includes: receiving a radiologist identifier (radiologist ID); receiving a set of finding inputs; determining a context of each of the set of finding inputs; determining text associated with a portion or all of the radiology report based on the context and the radiologist style; and inserting the text into the report.
- US 2024095455 Al relates to a multi-modal end to end learning system configured to answer questions about clinical documents like patient notes, medical reports, and lab results.
- Documents are polled from an electronic medical record system, converted to text, and scrubbed for protected health information before processing.
- Sanitized text data is then fed as context to a language model that has been fine-tuned for question-answering (QA).
- the other input to the model is a prompt or a question that is either provided on-the-fly by a clinician as part of a search or pre-determined for specific needs.
- the model outputs an answer highlighting part of the text/image where it found the answer and a confidence score quantifying the likelihood of the answer being correct.
- a clinician can optionally correct the answer if needed. This feedback by the clinician is fed back to a fine-tuner module and used to improve the model over time.
- Zhu Libing et al describe in "Testing and Validation of a Custom Retrained Large Language Model for the Supportive Care of HN Patients with External Knowledge Base", CANCERS, vol. 16, no. 13, 24 June 2024, page 2311, a study aimed to develop a retrained large language model (LLM) tailored to the needs of HN cancer patients treated with radiotherapy, with emphasis on symptom management and survivorship care.
- LLM retrained large language model
- a comprehensive external database was curated for training ChatGPT-4, integrating expert-identified consensus guidelines on supportive care for HN patients and correspondences from physicians and nurses within our institution's electronic medical records for 90 HN patients.
- the performance of the model was evaluated using 20 patient post-treatment inquiries that were then assessed by three Board certified radiation oncologists (RadOncs).
- the custom-trained model demonstrates high accuracy in providing support to HN patients offering evidence-based information and guidance on their symptom management and survivorship care.
- answers to specific questions on cancer reports are generated that may comprise reduced hallucination content introduced by a language model used to generate the answers.
- the terms “have”, “comprise” or “include” or any arbitrary grammatical variations thereof are used in a non-exclusive way. Thus, these terms may both refer to a situation in which, besides the feature introduced by these terms, no further features are present in the entity described in this context and to a situation in which one or more further features are present.
- the expressions “A has B”, “A comprises B” and “A includes B” may both refer to a situation in which, besides B, no other element is present in A (i.e. a situation in which A solely and exclusively consists of B) and to a situation in which, besides B, one or more further elements are present in entity A, such as element C, elements C and D or even further elements.
- the terms “at least one”, “one or more” or similar expressions indicating that a feature or element may be present once or more than once typically will be used only once when introducing the respective feature or element.
- the expressions “at least one” or “one or more” will not be repeated, non-withstanding the fact that the respective feature or element may be present once or more than once.
- the terms “preferably”, “more preferably”, “particularly”, “more particularly”, “specifically”, “more specifically” or similar terms are used in conjunction with optional features, without restricting alternative possibilities. Thus, features introduced by these terms are optional features and are not intended to restrict the scope of the claims in any way.
- a computer-implemented method of generating information, specifically context-related information, on a health condition of a patient comprises the following steps, which specifically may be performed in the given order. However, two or more or even all of the method steps may be performed at least partially simultaneously. Further, one or more or even all of the method steps may be performed once or repeatedly. The method may also comprise further method steps which are not listed herein. For said aspect, reference may be made to any definition, Embodiment, claim and/or further aspect herein.
- the method comprises the following steps: i. retrieving, specifically by at least one of a computer, a computer system or a computer network, a computer-readable report on cancer data of the patient; ii. creating, specifically by at least one of a computer, a computer system or a computer network, a vector store on the report on cancer data, by
- the term “computer implemented method” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a method involving at least one computer and/or at least one computer network.
- the computer and/or computer network may comprise at least one processor which is configured for performing at least one of the method steps of the method according to the present invention.
- each of the method steps is performed and/or supported by the computer and/or computer network.
- all of the method steps may be performed by or on one and the same computer or computer network.
- two or more of the method steps may also be performed by separate but interacting computers or computer networks.
- all of the method steps may be performed on a local computer or server, such as by using the RAM of the local computer or server.
- the method may fully or partially be embodied by using delocalized computer resources, such as cloud computing.
- the method may be performed completely automatically, specifically without user interaction.
- one or more method steps may imply user interaction.
- the term “computer” may refer to a device, system or network being configured for data processing, specifically by having at least one processor and optionally further elements, such as, for example, one or more interfaces, one or more data storage devices, one or more display devices and the like.
- the computer may optionally be cloud-based.
- processor also referred to as a “processing unit” or as a “processing device”, as generally used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to an arbitrary logic circuitry configured for performing basic operations of a computer or system, and/or, generally, to a device or system which is configured for performing calculations or logic operations.
- the processing unit may be configured for processing basic instructions that drive the computer or system.
- the processing unit may be or may comprise one or more application-specific integrated circuits (ASICs) and/or one or more field-programmable gate arrays (FPGAs) or the like.
- ASICs application-specific integrated circuits
- FPGAs field-programmable gate arrays
- the processing unit specifically may be configured, such as by software programming, for performing one or more evaluation operations.
- the information on the health condition may comprise one or more items of qualitative and/or quantitative information on the person’s body condition, such as including one or more items of information on at least one of: the presence or absence of a specific disease of the person in the person’s body; possible treatments of one or more diseases present in the person’s body; a physiological state condition of the person’s body; possible means of disease prevention to be practiced by the person and/or to be practiced on the person’s body; the physical fitness of the person; a probability of a benefit from performing at least one specific medical interventions; a probability of a benefit from not performing and/or ceasing to perform at least one specific medical intervention, such as by watchful waiting and/or in a de-escalation of a specific medical intervention.
- the information on the health condition may include one or more items of information on a likely lack of benefit on at least one specific medical intervention.
- a specific medical intervention may be a pharmacotherapy, a nutritional change and/or an exercise.
- patient is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a human or animal being or person subject to medical examination and/or treatment.
- the term may be used both for persons having diseases and for persons not having diseases or not having the relevant diseases, wherein the latter, still, may be subject to medical examination and/or observation.
- the term “retrieve”, as well as any grammatical variations thereof, as used herein in the context of data processing, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to the process, specifically performed by a system, more specifically by a computer system, of generating data and/or obtaining data from an arbitrary data source, such as from a data storage, via a data network or interface, from a further computer or computer system, or by user input.
- the retrieving specifically may take place via at least one computer interface, such as via a port such as a serial or parallel port, and/or by a user interface.
- the retrieving may comprise several sub-steps, such as the sub-step of obtaining one or more items of primary information and generating secondary information by making use of the primary information, such as by applying one or more algorithms to the primary information, e.g. by using a processor.
- the retrieving step specifically may be performed by at least one computer.
- the method may imply uploading and/or downloading the computer-readable report from an external source, such as from a computer readable storage medium, from a database, specifically a web-accessible database, from the Internet or from one or more other sources.
- the retrieving step may provide the computer prompting a user to define or select the source of the retrieving, specifically an address from which the report may be downloaded, such as a storage address or a URL.
- reporter is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to an arbitrary document containing information on at least one subject.
- cancer data is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to medical or biological information on the status and/or the functioning of the body of a patient having cancer or suspected of having cancer, wherein the information, specifically, may assist one or more of the diagnosis of the cancer and/or the stage of the cancer, a prognosis of the cancer and/or the stage of the cancer, and possible treatments of the cancer, preferably comprising the probability of a response and/or a lack of a response to at least one possible treatments and/or other interventions, such as a diet, an exercise and/or the withholding of at least one treatments.
- the term “computer-readable”, as used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to an arbitrary object which may be processed by a computer, such as a computer file which may be read and processed by at least one processor and/or at least one computer program running on the at least one processor.
- the reading of the object, specifically the file, by the computer, specifically the at least one processor may also comprise one or more steps of transforming the object, such as one or more preprocessing steps such as steps transforming data from one readable format into another one, such as one or more steps of optical character recognition, as will be outlined in further detail below.
- Step i. may comprise retrieving a document containing the cancer data of the patient, specifically a .pdf document, wherein step i. further comprises transforming the document into a computer-readable format, specifically a computer-readable format usable in step ii., more specifically using a ‘.pdf parsing library or an optical character recognition (OCR), preferably by using at least one machine learning model configured for performing OCR, such as a Visual Language Model (VLM).
- OCR optical character recognition
- the term “document containing the cancer data”, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to any arbitrary document comprising the cancer data.
- Said document may comprise the cancer data in an arbitrary form and/or an arbitrary format.
- the document containing the cancer data may be provided from at least one laboratory, at least one patient medical record, specifically in a structured and/or an unstructured form, e.g. in from of at least one hospital electronic medical records system and/or insurance claims systems, at least one handwritten note, at least one spoken dictation, any other reliable source.
- transform or any grammatical variation thereof, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a process of converting data from one format, structure and/or value set to another.
- the document format may be converted to into computer-readable text, specifically continuous text or running text, wherein the structure of the document may be conserved during the transformation process. Thereby, sections comprised by the documents may still be identifiable.
- Step ii. may comprise vectorizing the entire computer-readable report on the cancer data of the patient, or at least a part thereof.
- vectorize as well as any grammatical variations thereof, as used herein and as specifically used in the context of language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to the transformation of arbitrary data into vectors, thereby representing the data by corresponding vectors.
- vector is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a mathematical representation of data or at least one data object, such as a text, image or audio data object, in a high-dimensional space, such as in a space having a dimensionality of at least 3, more specifically of at least 10 or even at least 100.
- the number of dimensions typically, in the context of language models may even range from a few hundreds to tens of thousands or even beyond, depending on the data to be represented. In this vector space, each dimension may correspond to a feature of the data.
- a vector's position in this space may represent its characteristics. Thereby, data including words, phrases, entire documents, images, audio, and other types of data can be vectorized.
- vector store is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a computerized and/or computer-accessible collection of vectors, such as on at least one computer-readable storage medium comprising the vectors.
- the vector store specifically may comprise at least one of a vector database, a vector library and a vector plugin.
- the vector store may comprise a vector database, having database properties, such as Create-Read-Update-Delete (CRUD). Thereby, a short query time and a high scalability may be achieved, and query search for the closest vectors may be sped up. Other options, however, are also feasible.
- CRMUD Create-Read-Update-Delete
- tokenize as well as any grammatical variations thereof, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to the process of breaking down a data object, such as a text data object, an image data object or an audio data object, into a plurality of smaller units or elements, called “tokens”.
- token as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to unit or element of a data object, such as of a text data object, an image data object or an audio data object.
- a text data object such as the computer readable report on the cancer data of the patient, may be broken, in the process of tokenizing, down into tokens in the form of one or more of passages, paragraphs, phrases, sentences, words, sub-words, punctuation marks, characters.
- the specific choice of the tokens may depend on the specific tokenization strategy.
- Tokenization may refer to a process of splitting a text into tokens, which may be helpful in completing machine learning tasks including text summarization, text classification, sentiment analysis, machine translation, and named entity recognition.
- breaking down the text into tokens i.e., by tokenization, it may be possible to apply statistical and machine learning techniques to analyze and process natural language data.
- the term “embed”, as well as any grammatical variations thereof, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to the process of generating, from an input, such as a text input, e.g. typed and/or prespecified, an image input or an audio input, specifically from tokens as defined above, one or more of a numeric, vector, or spatial representation of the input, specifically by using at least one encoder.
- the embedding may imply a vectorization.
- the embedding may comprise mapping the input, such as the tokens, to a multidimensional vector representation.
- Creating a vector store on the report on cancer data in step ii. further may comprise segmenting the report on cancer data.
- segment as well as any grammatical variations thereof, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to dividing and/or splitting the data into smaller, manageable units or segments before processing and storing them as vectors. Segmenting may comprise splitting said data into at least two deviating portions.
- the data may be divided and/or split in accordance to splitting criteria, such as a document structure, e.g. paragraphs, number of sentences; a data type, e.g. text, image; or specific features, e.g. time intervals, categories.
- Each segment may be stored individually in the vector store.
- Step ii. specifically may be performed automatically, e.g. when a computer program implementing the computer-implemented method is started and/or once the computer-readable report on the cancer data of the patient has been retrieved.
- Step ii. may be performed by using at least one machine learning model configured for creating the vector store, preferably wherein the at least one machine learning model configured for creating a vector store is distinct from the trained language model used in step iv..
- the at least one machine learning model configured for creating the vector store may be a model and the trained language model may be a further model, wherein the model and the further model may differ from each other.
- the machine learning model configured for creating the vector store may be a trained machine learning model.
- the at least one machine learning model configured for creating the vector store may comprise a model selected from the group consisting of:
- step iii. computer-readable question data comprising information on a user question, specifically a context-related user question, is retrieved.
- the retrieval process may comprise one or more often input, an upload, a reading or any other type of providing the computer-readable question data, specifically to at least one processor.
- the term “user question”, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a request for at least one specific item of information and/or a command for generating at least one specific item of information, initiated by a user.
- the user question may be a question, a request or a command. Consequently, the user question may be or may be also referred to as a user query.
- the user question may be an input to a computer or a computer system, such as via at least one user interface.
- the user may be identical to the patient as defined above or may be distinct from the patient.
- the user may be a user of a software implementing the method.
- the user specifically may be a healthcare practitioner, such as a doctor, or a laboratory specialist tasked with writing clinical summaries of test results.
- the retrieval process may comprise prompting the user to input the user question, such as in a text format or in an audio format.
- the method specifically may comprise providing a chat window to the user, e.g. on a user interface such as a graphical user interface, allowing the user to input the user question, e.g. as text data or audio data.
- the user question may then be transformed in one or more steps.
- computer-readable question data is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to computer-readable data representing the user question.
- this computer-readable representation may be the result of a vectorization or encoding of the user question.
- the user question may optionally be tokenized, such as by using the tokenization strategy or tokenization scheme as used in step ii., and the user question and/or the tokens generated thereof may be embedded, thereby mapping the user question into a multidimensional space.
- context is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to the surrounding information or text that the model considers when processing language. By considering more information at once, the model can better understand the comprehensive perspective and generate more relevant responses.
- the term “comprehensive perspective” may refer to an understanding of a situation by knowing at least one relevant factor, at least one influence and/or at least one relationship between a plurality of details.
- context may comprise a preceding text, a user intent, a topic and/or or conversation history.
- the context specifically may be the healthcare situation of the patient, as reflected e.g. by the computer-readable report on cancer data of the patient.
- the context may apply to other fields in healthcare or outside healthcare where expertise is scarce.
- the context may further be customized by selecting at least one of
- the computer-readable question data of step iii may be provided by a user input into an input module.
- the term “input module”, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a hardware component configured to provide the question data.
- the input module may be selected from at least one of a storage device, specifically an external storage device, a scanner, a prompt on a computer device, specifically a mobile device.
- the input may be typed and/or transcribed from an audio-to-text prompt, such as without typing. Alternatively or in addition, the input may be pre-specified.
- pre-specified is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to at least one predefined item, such as an item that may be selected from a predefined list items.
- Such a pre-specified input or user question may be indicated to the user.
- the user may select said pre-specified input that is forwarded or input into the language model.
- a pre-specified input may be at least one standard user question that may be processed before requesting user input. Said pre-specified input may be forwarded or input into the language model automatically.
- a report on cancer data may be first analyzed by using a standard set of pre-speci- fied user questions, such as 5 to 10 pre-specified user questions. This may ensure that the user, such as an oncologist, receives the most relevant results. Then the input module may allow the user to ask at least one further user question on the report on cancer data and/or other related topics.
- a standard set of pre-speci- fied user questions such as 5 to 10 pre-specified user questions. This may ensure that the user, such as an oncologist, receives the most relevant results.
- the input module may allow the user to ask at least one further user question on the report on cancer data and/or other related topics.
- the method may comprise providing a plurality of pre-defined user questions, wherein, in step iii. , the user selects the user question from the plurality of pre-defined user questions.
- pre-defined user questions as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a plurality of question that are selected in advanced of at least one step of the method or the method being performed.
- the pre-defined user questions may be indicated to the user. The user may select at least one of the pre-defined user questions.
- the report on cancer data may comprise genomic data of the patient.
- genomic data is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to information on at least a portion of a deoxyribonucleic acid of a biological sample, preferably isolated from a cell, more preferably from a cancer cell.
- the genomic data may include at least one gene or biomarker associated with the cancer.
- biological sample refers to a sample of a body fluid, to a sample of separated cells or to a sample from a tissue or an organ comprising or suspected to comprise cancer driving cell population.
- Samples of body fluids can be obtained by well-known techniques and include, preferably samples of blood.
- Tissue or organ samples such as bone marrow samples, may be obtained by, e.g., biopsy.
- Separated cells may be obtained from the body fluids or the tissues or organs by separating techniques such as centrifugation or cell sorting.
- the answer to the user question may comprise at least one of information on at least one biomarker or a plurality of biomarkers, such as a driver mutation indicating that the patient may benefit from at least one specific therapy; information indicating genomic clues to resistance that may imply the patient’s lack of benefit from at least one specific therapy; information on potential pathogenic or likely pathogenic DNA alterations in the patient’s genome that may raise the possibility of inherited germline alterations; information on an indication of a disease recurrence or a probability of a lack of the disease recurrence, preferably which may imply a need of a patient for a further therapy or a lack of a need for further therapy or cessation of therapy.
- the term “therapy” may refer to at least one potential intervention in order to fight a disease, specifically the cancer, such as a pharmacotherapy, a diet, a nutrition, an exercise, a surgery.
- predetermined therapy is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a medical treatment being at least one attempted remediation of at least one health problem, typically following a medical diagnosis.
- the predetermined therapy may in the case of cancer be selected to kill and/or inhibit cancer cells.
- driver mutation is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a somatic mutation at in at least one signal transduction pathway and, thereby, giving the tumour cell a growth advantage and promoting the proliferation of the cancer cell.
- resistance is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to an ability of the cancer cells to survive, adapt and/or continue growing despite the predetermined therapy.
- pathogenic DNA alterations is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limi- tation, to a genetic alteration that increases a susceptibility and/or predisposition of an organism to a certain disease or disorder.
- the term pathogenic DNA may refer to likely pathogenic DNA alterations.
- the user may further ask for information concerning at least one further alteration.
- inherited germline alterations is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to an alteration in a reproductive cell that is propagated to the living organism and may confer susceptibility and/or predisposition of an organism to a certain disease or disorder.
- step iv. at least one trained language model is used, specifically at least one trained large language model, for automatically generating, by using the vector store, an answer to the user question, specifically a context-related answer to the user question.
- Step iv. may comprise using the trained language model for tokenizing and embedding the computer-readable question data of step iii..
- the trained language model may be a trainable model pre-trained, trained or fine-tuned for natural language processing.
- the term “trained model”, as used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a mathematical model which was trained on at least one training data set and which is configured for predicting at least one target variable for at least one input variable.
- the trained model may comprise, e.g., at least one model selected from the group consisting of: a linear regression model, e.g.
- ANN non-linear Artificial Neural Network
- SVM Support Vector Machine
- kernel based method Tree regression
- Random Forest Random Forest
- the term “trained language model”, as used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a trained model configured for natural language processing.
- the trained language model may be a or may comprise at least one probabilistic model for language processing, such as a probabilistic model trained for one or more tasks selected from the group consisting of natural language generation, natural language classification, machine translation, optical character recognition, handwriting recognition, and information retrieval. Other tasks in the field of language processing, including language generation, may be possible in addition or alternatively.
- An input of the trained language model may be, at least in part, in natural language.
- An output of the trained language model may be, at least in part, a vector, specifically when requesting data from the vector store.
- An output of the trained language model may be, at least in part, in natural language, specifically when providing data to the internet agent.
- the trained language model specifically may be or may comprise at least one large language model.
- large language model (LLM), as used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a mathematical, i.e. a computational, model configured and trained for general- purpose language processing tasks, such as general-purpose language generation, text classification, text summarization, and the like.
- Large language models are trained to perform these language processing tasks by using huge amount of text documents and a dedicated training process, such as a self-supervised and semi- supervised training process.
- the large language model specifically may be configured for generating text and performing other generated tasks of generative artificial intelligence.
- the large language model may be configured for using text input or tokens as defined above, and for processing the input in order to generate text output, specifically by prediction based on statistical means.
- the text may be natural text.
- the large language model specifically may be or may comprise at least one artificial neural network.
- the large language model may be configured to verify the generated answer in order to minimize hallucinations.
- the large language model specifically may comprise at least one transformer and/or may comprise the at least one transformer architecture.
- the trained language model may be selected from the group consisting of:
- the trained language model may be a model, such as - Medicine-chat; BioGPT-Large-PubMedQA; and/or OpenBioLLM-70B.
- the term “transformer”, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a mathematical answers or calculation of model having a deep learning or artificial neural network architecture.
- the transformer is configured for converting text into a numerical representation and/or for using text converted into numerical representations, i.e. for using tokenized text.
- the transformer further has a layer architecture for processing these tokens. In each layer, the tokens are contextualized by using a so-called attention mechanism amplifying signals for key tokens and diminishing signals for less important tokens.
- the large language model using at least one transformer and/or transformer architecture may comprise at least one pretrained model or architecture, such as at least one generative pre-trained transformer (GPT) and/or at least one Bidirectional Encoder Representations from Transformers (BERT).
- the large language model specifically may comprise at least one transformer having an architecture selected from the group consisting of a decoder-only model, and encoder-only model, and encoder-decoder-model. Examples of models which may be used in the trained language model of step iv. will be given in further detail below.
- the at least one trained language model is used for automatically generating, by using the vector store, an answer to the user question, specifically on the basis of the vector store, i.e. the tokenized and embedded computer-readable report on cancer data of the patient.
- the term “automatically” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically specifically may refer, without limitation, to a process which is performed completely by means of at least one computer and/or computer network and/or machine, in particular without manual action and/or interaction with a user.
- the trained language model may, specifically without user interaction, generate the answer to the user question.
- the answer generally, may be a single answer or may comprise a plurality of answer components.
- the vector store may comprise information in form of vectors, natural language and/or metadata, preferably comprising information about a text passage to which a specific entry refers.
- Using the vector store may comprise performing a semantic search within the vector store based on the user question, particularly in order to find segments of the cancer report relevant to answer the user question. In the semantic research a plurality of relevant entries in the vector store may be determined.
- the natural language will be provided to the trained language model.
- the metadata may be provided to an outputting module, such as a frontend.
- Step iv. may comprise context-customizing the answer to the user.
- the term “context-customizing” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to an adaption of the answer to a context that determines the at least one characteristic of the answer, preferably at least one characteristic of the answer related to a style of the answer.
- the at least one characteristic of the answer may be selected from at least one of: a level of detail of the answer, a use of technical terms, a length of the answer, a level of detail of the answer; a use of technical terms; a length of the answer; a user role; a number for data input points; a type of data input.
- the context-customizing may comprise taking into account at least one item of information on a role of the user, specifically a categorization of the user, more specifically a categorization into a category selected from the group consisting of: a layman patient or caregiver; a general medical practitioner; an oncology nurse or oncologist; a cancer genomicist; a laboratory director, a clinical specialist, a pathologist or a medical director; a clinical trialist, a monitor, a medical science liaison, a medical manager or another role involved in clinical trial management.
- the context-customizing may comprise taking into account at least one item of information on the intended purpose of the answer, specifically information categorizing the purpose into a category selected from the group consisting of: a patient-doctor consultation; a preparation of a tumor board; evaluating patient or subject suitability for a clinical trial, assigning subjects to clinical trial treatment arms.
- a literature search is automatically performed in at least one external data source on at least one subject related to the user question.
- the external data source may comprise at least one of: the internet; a publication server, specifically a medical publication server and/or pharmaceutical publication server; a data base comprising results of cancer genomics research, a data base comprising medical studies and/or pharmaceutical studies; a data base comprising associations between DNA alterations and sensitivity or resistance to at least one specific therapy.
- literature search is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a process of identifying and analyzing information through a systematic research process.
- a literature search may comprise gathering and collecting information relevant to a specific topic, such as the subject related to the user question. The purpose is to summarize existing findings, identify gaps, and provide a foundation for new research.
- a well-conducted literature search ensures a comprehensive understanding of the topic and supports evidence-based practice and decision-making.
- a literature search may be a search of at least one external resource, such as a published literature and/or a preknown database.
- OncoKb cf. https://www.oncokb.org/ - retrieved June 20, 2025
- CKB cf. https://ckbhome.genomenon.com/ - retrieved June 20, 2025
- ClinVar cf. https://www.ncbi.nlm.nih.gov/clinvar/ - retrieved June 20, 2025
- the term “external data source” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to an organized collection of structured information or data.
- the external database may, typically, be stored in an external computer system.
- the external computer system may be external to the computer system running the at least one trained language model. Thereby, the external database may be accessed via a computer network, specifically the internet.
- the external data source may be or may comprise, consequently, an online data source.
- the external database may comprise information from academic journals, books, and/or papers.
- the external data source may be a trusted external data source that is preselected in order to provide trusted information.
- the automatically performed literature search in step v. may comprise using at least one internet agent, wherein the internet agent is a machine learning model configured for performing internet searches.
- internet agent as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a software entity configured to interact with external online resources.
- Interacting may comprise performing at least one task, gathering information and/or disseminating information. Interacting may be performed by using the internet.
- the internet agent may be configured to understand and/or generate natural text.
- the internet agent may provide generated data to the outputting module.
- the at least one subject on which the literature search may be performed in step v. is automatically chosen by the trained language model.
- the at least one subject on which the literature search may be performed in step v. may comprise recent medical and/or pharmaceutical studies, data sets or publications on at least one topic related to at least one aspect of the cancer data of the patient.
- the answer to the user question may comprise a summary of at least one portion of the report on cancer data of the patient having relevance to the user question.
- relevant as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to the degree to which at least one portion of information, data and/or content is applicable, useful and/or pertinent to at least one specific context, question and/or purpose.
- the at least one portion of the report may be considered relevant when it addresses at least one aspect of a user question, in particular wherein the portion provides information that contributes to answering the question. Additionally, a portion may be relevant if it contains information that is consistent with and/or supports the findings or conclusions derived from a further portion of the report or external data sources. A portion may also be relevant if it contains information that is not consistent and/or contradicting to one or more findings and/or one or more conclusions derived from a further portion of the report and/or the external data source.
- the trained language model may be configured for retrieval-augmented generation (RAG), such that, in step iv., a connection to at least one portion of the report on cancer data of the patient is made on which the context-related answer to the user question is based.
- RAG retrieval-augmented generation
- the trained language model and/or the internet agent may be further configured, such that a connection to at least one portion of at least one document of the external data source is made on which the item of literature information is based, specifically a portion of at least one internet publication or reference to a data set.
- step vi. as outlined above, the information on the health condition of a patient is created, by enriching the answer obtained in step iv. with at least one item of literature information obtained by the literature search in step v..
- Step vi. may also be performed by the trained language model used in step iv..
- the term “item of literature information” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to data generated in a literature search, preferably on the at least one subject related to the user question.
- the item of literature information may comprise information on the at least one subject related to the user question.
- the item of literature information may further comprise information on the source, such as the external data source, from which the information on the at least one subject related to the user question is derived.
- enriching is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a process of enhancing and/or improving something by adding value, quality and/or additional elements.
- Enriching may comprise analyzing the answer obtained in step iv. by comparing the answer obtained in step iv. to the at least one item of literature information. Comparing may comprise identifying at least one deviation between the answer obtained in step iv. and the at least one item of literature information.
- Enriching may comprise correcting said at least one identified deviation, preferably in a manner that the answer obtained in step iv. is adapted to the at least one item of literature information in a manner that said identified deviation is compensated, preferably by using the trained language model and/or the internet agent.
- enriching may comprise providing an indication indicating that said identified at least one deviation is identified.
- the indication may comprise information on details of said identified at least one deviation, such as a portion of the answer obtained in step iv. that is related to said identified at least one deviation.
- enriching may comprise discarding at least the portion of the answer obtained in step iv. that is related to said identified at least one deviation. Consequently, the external literature search may add value and thereby enrich the answer by adding information to address the identified deviation; discarding information to address the identified deviation, and/or at least calling attention to the fact that there is an identified deviation.
- the information on the health condition of a patient may further comprise information, generated by the trained language model, on the source of the item of literature information.
- the information on the source of the item of literature information may contain information on at least one of a peer-reviewed status of a publication forming basis of the literature information; published guidelines from at least one expert, preferably in the field of cancer; a number of subjects in a trial forming basis of the literature information; a phase of a clinical trial forming basis of the literature information; a molecular biological plausibility of a connection between a laboratory finding and the potential association to mechanism or drugs that may benefit or not benefit .
- the information on the health condition of a patient may be enriched by the portion of the report on cancer data of the patient on which the context-related answer to the user question is based. Consequently, the information on the health condition provides an indica- tion of the source of the answer obtained in step iv. Particularly thereby, at least one hallucination, such as a false statement, that may be comprised by the information on the health condition may be identified, preferably by the user. The hallucinations may be introduced into the information on the health condition by the trained language mode and/or the literature search.
- the user can compare the answer provided by the language model to the source of the answer to identify discrepancies or hallucinations.
- the user may further iterate from said general result and ask for more details on certain aspects of the results. For example, in the case of an identified driver mutation that implies sensitivity to a certain therapy, the user may ask for the molecular biology of the mutation, e.g. the user may ask for sequence and/or structure alterations that convey altered function and/or sensitivity to certain drugs. Alternatively, the user may ask for clinical trial results that prove patients with those mutations benefit from specific drugs.
- the user may ask for more information on the occurrence of that alteration in inherited cancer syndromes.
- the user may ask for more information on the influence of one or more further alterations identified in the test results, which may reduce the likelihood of a patient or the user responding to cancer immunotherapies, to provide more patient or user-specific context to the treating physician.
- the information on the health condition of a patient may be enriched by the portion of the document report on cancer data of the patient on which the item of literature information is based.
- Steps iii. and iv., and optionally also step vi. may be performed by using a chat bot based on the trained language model used in step iii..
- chat bot as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a software application configured to simulate human conversation through at least one text and/or voice interaction.
- a chat bot may be integrated into at least one of a website, a mobile app, a messaging platform, an interface. It has an area for text input and output, which can be used to communicate with the system in natural language.
- a chat bot may comprise at least one text input area and/or at least one text output area, wherein the respective area may be configured to communicate with the user in natural language.
- the method may further may comprise: vii. outputting the information on the health condition of a patient enriched with the at least one item of literature information.
- the outputting may be performed by at least one of: providing the information on the health condition of a patient enriched with the at least one item of literature information to a frontend; displaying the information on the health condition of a patient enriched with the at least one item of literature information on a display; transmitting the information on the health condition of a patient enriched with the at least one item of literature information via at least one interface; storing the information on the health condition of a patient enriched with the at least one item of literature information in at least one data storage device.
- the information on the health condition of a patient further may comprise quality information characterizing at least one of the quality of the computer-readable report on the cancer data; a specimen quality of at least one specimen of the patient forming basis of the report; a specimen quantity of at least one specimen of the patient forming the basis of the report; quality information characterizing the completeness of the report or limitations of the report; quality information identifying anomalous results in the report; quality information on the reputation of a laboratory generating the report; quality information identifying certification of a laboratory generating the report; quality information identifying one or more of validation and at least one ISO standard for laboratories.
- quality is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a characteristic or attribute of at least one object, process and/or result that indicates its degree of excellence, reliability and/or suitability for a specific purpose.
- Quality may exemplarily relate to the accuracy, completeness and/or consistency of data.
- the quality of the report on cancer data which is typically from on a laboratory, may be represented or signified in the report on cancer data itself, such as by a College of American Pathologists, CAP, certification and/or a Clinical Laboratory Improvement Amendments, CLIA, certification of the laboratory, as well as by a tumor DNA input content along one or more metrics of the sequencing depth and uniformity and so on.
- the method may comprise to evaluate one or more of said quality criteria and consider them in the context of the report on cancer data. This may be used to highlight to the user if the laboratory own quality metrics have been or not been met, which has implications on the interpretation of the report on cancer data.
- the quality of the generated answer may be considered. This quality may be associated with an accuracy of the answer, such that, preferably, the answer is correct with respect to the report on cancer data, even if report on cancer data is out of date and the answer is based on one or more external sources.
- the quality may further be associated with a relevance of the answer such that only clinically-relevant information is provided and no non- clinically-relevant info is comprised.
- the quality may further be associated with a completeness, such that all clinically-relevant information pertaining to the user’ s question is provided in the response.
- a training method of training a language model for use in step iv. of the method as elsewhere disclosed herein.
- the method of training a language model may be a computer-implemented method.
- the method comprises the following steps, which specifically may be performed in the given order. However, two or more or even all of the method steps may be performed at least partially simultaneously. Further, one or more or even all of the method steps may be performed once or repeatedly.
- the method may also comprise further method steps which are not listed herein.
- the method may comprise:
- test report set comprising at least one computer-readable report on cancer data of a patient
- test question set comprising a plurality of training questions relating to the test report of the test question set, wherein the answers to the test questions are known;
- step i. of the method by using the test question set in step iii. of the method, and by comparing the answers generated in step iv. with the known answers to the test questions and adapting parameters of the foundation model.
- the term “foundation model” or “base model” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a pre-trained language model.
- the pre-trained language model may be trained, typically, on broad data sets in a manner that it may be applied across a wide range of use cases.
- the foundation model may be or may comprise a language model, preferably a large language model.
- a foundation model may further be trained and/or fine-tuned for a specific use case.
- the fine-tuned foundation model may be the trained language model as elsewhere described herein.
- the foundation model may comprise a transformers model, specifically selected from the group consisting of:
- the foundation model may be an already find-tuned model, such as
- test report set is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a preselected report set the is evaluated in order to fine-tune the foundation model.
- the test report set comprising at least one computer-readable report on cancer data of a patient may comprise a plurality of computer-readable report on cancer data, such as, typically, 10, 100, 200 or 1000, computer-readable report on cancer data.
- the plurality of computer-readable report on cancer data may be reports related to the same patient and/or to differing patients.
- a further test report set comprising even further data may be used to fine-tune an already finetuned foundation model.
- test question set is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a preselected question set with which the cancer data reports are evaluated, in order to fine-tune the foundation model.
- the answers to this "test question set” may comprise the ground truth to which the foundation models are fine-tuned or trained.
- the test questions may comprise at least one positive test question, wherein the positive test question is answerable on the basis of the report on cancer data of the patient.
- a test question that is answerable on the basis of the report on cancer data of the patient may be a test question of which the answer is comprised by the report on cancer data of the patient.
- the foundation model may, consequently, generate a substantive answer to the positive test question without having to hallucinate.
- the test questions may comprise at least one negative test question, wherein the negative test question is not answerable on the basis of the report on cancer data of the patient.
- a test question that is not answerable on the basis of the report on cancer data of the patient may be a test question of which the answer is not comprised by the report on cancer data of the patient.
- the foundation model may, consequently, only generate a substantive answer to the negative test question by speculating or hallucinating. In order to not hallucinate, the foundation model may have to indicate that it cannot provide an answer to the negative test question.
- the foundation model may perform a self-check by generating a plurality of answers, such as 3 or more answers assessing the consistency of said plurality of answers by using BERT and/or ngram methods, and judging the confidence in giving a correct answer based on said consistency. Inconsistent results may be flagged as likely hallucinations, inducing the model to answer "I don't know” or asking for more details.
- hallucinate or any grammatical variation thereof, as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a phenomenon where a language model generates information that is not accurate and/or grounded in reality.
- External hallucination may occur when the language model generates text that comprises false and/or fabricated information that appears to be grounded in external reality but is not.
- External hallucination may comprise making up at least one fact, detail or source that does not exist.
- Internal hallucination may occur when the language model generates text that is inconsistent and/or incoherent within the context of the conversation or text itself. Internal hallucination may comprise generating text contradicting itself and/or generating text that comprises irrelevant and/or nonsensical content.
- fine-tuning or any grammatical variation thereof, as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning.
- the term specifically may refer, without limitation, to a process in which a pre-trained model, such as the foundation model, is further trained on a specific dataset in order to adapt the foundation model to a specific task.
- Fine-tuning may commonly be used on language models in order to improve the performance of the language model on a specific task.
- Fine-tuning may comprise at least one optimization or tuning process, wherein a best parameter combination of the foundation model is determined.
- the test report set may be retrieved by the foundation model.
- the foundation model may further retrieve the test questions.
- the foundation may generate answers to the test questions. These answers may be compared with the known answers to the test questions, preferably in a training environment.
- Fine-tune the model may comprise selecting a loss function.
- the loss function may determine a cross-entropy loss, preferably when fine tuning text generation or classification.
- fine-tuning the model may comprise selecting a metric.
- a typical metric used to fine tune for text-generation may be Perplexity.
- a typical metric used to fine tune for classification may be Accuracy, Precision, Recall, and Fl Score.
- fine-tune the model may comprise selecting an optimizer, typically AdamW, SGD with Momentum, Lion, Lomo, Amos, Sophia or Confidence-guided Adaptive Memory Efficient Optimization (CAME) and/or variants thereof.
- an optimizer typically AdamW, SGD with Momentum, Lion, Lomo, Amos, Sophia or Confidence-guided Adaptive Memory Efficient Optimization (CAME) and/or variants thereof.
- a method of generating a proposal for a medical treatment of a patient comprises the following steps, which specifically may be performed in the given order. However, two or more or even all of the method steps may be performed at least partially simultaneously. Further, one or more or even all of the method steps may be performed once or repeatedly. The method may also comprise further method steps which are not listed herein. For said aspect, reference may be made to any definition, Embodiment, claim and/or further aspect herein.
- the method comprises using the method of generating information on a health condition of a patient as elsewhere disclosed herein.
- the method comprises, in step i., providing a computer-readable report on cancer data of the patient to the method of generating context-related information on a health condition of a patient, wherein the method further comprises providing, in step iii ., a question relating to at least one possible treatment of the patient.
- an alternative computer-implemented method of generating information, specifically context-related information, on a health condition of a patient comprises the following steps, which specifically may be performed in the given order. However, two or more or even all of the method steps may be performed at least partially simultaneously. Further, one or more or even all of the method steps may be performed once or repeatedly. The method may also comprise further method steps which are not listed herein. For said aspect, reference may be made to any definition, Embodiment, claim and/or further aspect herein.
- the alternative method comprises the following steps: i. retrieving, specifically by at least one of a computer, a computer system or a computer network, a computer-readable report on cancer data of the patient; ii. retrieving, specifically by at least one of a computer, a computer system or a computer network, at least one item of literature information obtained by a literature search; iii. retrieving, specifically by at least one of a computer, a computer system or a computer network, computer-readable question data comprising information on a context-related user question; iv.
- At least one of a computer, a computer system or a computer network at least one trained language model, specifically at least one trained large language model, for automatically generating, a context-related answer to the user question based on the computer-readable report on cancer data of the patient and the at least one item of literature information.
- a retrieving module configured for retrieving a computer-readable report on cancer data of the patient
- an input module configured for providing computer-readable question data comprising information on a context-related user question
- a processing module comprising at least one trained language model, specifically at least one trained large language model, the processing module being configured for automatically generating, by using the vector store, a context-related answer to the user question, preferably and self-checking that answer for possible hallucination;
- V. a searching module configured for automatically performing a literature search in at least one external data source using on at least one subject related to the user question; and VI. an evaluation module configured for creating the information on the health condition of a patient, by enriching the answer obtained in step iv. with at least one item of literature information obtained by the literature search in step v., preferably and displaying the source from which the information on the health condition is derived from by using a display, preferably in a manner that the user can see where the information on the health condition is obtained from.
- a computer program comprising instructions which, when the program is executed by a computer or a computer network, cause the computer or computer system to perform at least one of the method of generating information on a health condition of a patient as elsewhere disclosed herein; the training method as elsewhere disclosed herein; the method of generating a proposal for a medical treatment of a patient as elsewhere disclosed herein.
- the program is executed by a computer or a computer network, cause the computer or computer system to perform at least one of the method of generating information on a health condition of a patient as elsewhere disclosed herein; the training method as elsewhere disclosed herein; the method of generating a proposal for a medical treatment of a patient as elsewhere disclosed herein.
- a computer-readable storage medium comprising instructions which, when the instructions are executed by a computer or a computer system, cause the computer or computer network to perform at least one of the method of generating information on a health condition of a patient as elsewhere disclosed herein; the training method as elsewhere disclosed herein; the method of generating a proposal for a medical treatment of a patient as elsewhere disclosed herein.
- a computer-readable storage medium comprising instructions which, when the instructions are executed by a computer or a computer system, cause the computer or computer network to perform at least one of the method of generating information on a health condition of a patient as elsewhere disclosed herein; the training method as elsewhere disclosed herein; the method of generating a proposal for a medical treatment of a patient as elsewhere disclosed herein.
- the “computer-readable storage medium” specifically may refer to non-tran- sitory data storage means, such as a hardware storage medium having stored thereon computer-executable instructions.
- the stored computer-executable instruction may be associate with the computer program.
- the computer-readable data carrier or storage medium specifically may be or may comprise a storage medium such as a random-access memory (RAM) and/or a read-only memory (ROM).
- the proposed computer-implemented method of generating information on a health condition of a patient, training method of training a language model, method of generating a proposal for a medical treatment of a patient, computer system for generating information on a health condition of a patient, computer program and computer-readable storage medium provide many advantages over known devices and methods.
- Specific answers on cancer reports may be generated that help in understanding the cancer report, preferably wherein an experience of the user is considered when generating the answer such that the answer can be understood by the user. This may particularly be achieved by providing context-related answers and/or using a chat bot.
- Embodiment 1 A computer-implemented method of generating information on a health condition of a patient, comprising: i. retrieving a computer-readable report on cancer data of the patient; ii. creating a vector store on the report on cancer data, by
- Embodiment 31 A training method of training a language model, specifically a large language model, for use in step iv. of the method according to any one of the preceding Embodiments, the method comprising: - providing a foundation model;
- test question set comprising a plurality of training questions relating to the test report of the test question set, wherein the answers to the test questions are known;
- the answer to the user question may comprise at least one of information on at least one biomarker or a plurality of biomarkers, such as a driver mutation indicating that the patient may benefit from at least one specific therapy; information indicating genomic clues to resistance that may imply the patient’s lack of benefit from at least one specific therapy; information on potential pathogenic or likely pathogenic DNA alterations in the patient’s genome that may raise the possibility of inherited germline alterations; information on an indication of a disease recurrence or a probability of a lack of the disease recurrence, preferably which may imply a need of a patient for a further therapy or a lack of a need for further therapy or cessation of therapy.
- the answer to the user question may comprise a summary of at least one portion of the report on cancer data of the patient having relevance to the user question.
Landscapes
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The present invention relates to a computer-implemented method of generating information on a health condition of a patient, comprising: i. retrieving a computer-readable report on cancer data of the patient; ii. creating a vector store on the report on cancer data, by tokenizing the report and creating tokens; embedding the tokens, thereby creating vectors; and storing the vectors in the vector store; iii. retrieving computer-readable question data comprising information on a user question, specifically a context-related user question; iv. using at least one trained language model, specifically at least one large language model, for automatically generating, by using the vector store, an answer to the user question, specifically a context-related answer to the user question; v. automatically performing a literature search in at least one external data source on at least one subject related to the user question; and vi. creating the information on the health condition of a patient, by enriching the answer obtained in step iv. with at least one item of literature information obtained by the literature search in step v.
Description
Method and system for generating information on a health condition of a patient
Technical Field
The invention relates to a computer-implemented method of generating information on a health condition of a patient, a training method of training a language model, a method of generating a proposal for a medical treatment of a patient, a computer system for generating information on a health condition of a patient, a computer program and a computer-readable storage medium.
The methods and the computer system according to the present invention may mainly be used for evaluating cancer data of a patient. Other applications are feasible.
Background art
US 2024029848 Al discloses a computer implemented method for generating a text report including receiving input data including at least one table, using a first generative model to identify one or more variables in the input data and generate a table extract comprising the identified variables in a specified order, and using a second generative model to generate a text report based on the table extract, the text report including each of the extracted variables in the specified order. A system for simulating the healthcare journey of one or more patients is also disclosed. The system receives patient data; creates a simulation model of the patient; executes the simulation model to predict health variables; generates treatment variables; provides the predicted health variables, the treatment variables and clinician inputs to the simulation model for continuous learning of the simulation model.
US 2021118559 Al discloses a system and method, the method comprising receiving a laboratory diagnostic testing result associated with a specimen of a subject, the steps of receiving a clinomic profile of the subject, identifying a cohort of similar subjects based at least in part on the clinomic profile of the subject, providing the diagnostic testing results, clinomic profile, and the cohort of similar subjects to a smart output module to generate a
personalized, precision medicine based laboratory diagnostic testing result as a smart output and displaying the smart output to a user.
US 2019108898 Al discloses a system and method for enhancing the efficiency and accuracy of analysis and interpretation of medical diagnostic laboratory test data for real-time clinical decision support, utilizing artificial intelligence techniques to automatically improve analytical performance and enhance provider and patient communications.
Further information may be derived from one or more of the following references:
- Text Embeddings by Weakly-Supervised Contrastive Pre-training, by Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, Furu Wei, 2024, 2212.03533, https://arxiv.org/pdf/2212.03533
- Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing, by Gu, Yu and Tinn, Robert and Cheng, Hao and Lucas, Michael and Usuyama, Naoto and Liu, Xiaodong and Naumann, Tristan and Gao, Jianfeng and Poon, Hoifung, ACM Transactions on Computing for Healthcare, vol. 3, 2637-8051, 2021, https://arxiv.org/abs/2007.15779
- Towards General Text Embeddings with Multi-stage Contrastive Learning, by Zehan Li and Xin Zhang and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Meishan Zhang, 2023, 2308.03281, https://arxiv.org/abs/2308.03281
- Scaling Instruction-Finetuned Language Models, by Hyung Won Chung et al, 2022, 2210.11416, https://arxiv.Org/abs/2210. l 1416
- Mixtral of Experts, Albert Q. Jianget al, 2024, 2401.04088, https://arxiv.org/abs/2401.04088
- meta-llama/llama3, https://github.com/meta-llama/llama3/tree/main (June 20, 2024)
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, Marah Abdin et al, 2024, 2404.14219, https://arxiv.org/abs/2404.14219
- Adapting Large Language Models via Reading Comprehension, Daixuan Cheng, Shaohan Huang, Furu Wei, The twelfth International Conference on Learning Representations, 2024, https://openreview.net/forum?id=y886UXPEZ0
- BioGPT : generative pre-trained transformer for biomedical text generation and mining, Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, Tie- Yan Liu, Briefings in Bioinformatics, Volume 23, Issue 6, November 2022, bbac409, https://doi.org/10.1093/bib/bbac409
- OpenBioLLMs: Advancing Open-Source Large Language Models for Healthcare and Life Sciences, https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B (June 20, 2024)
US 2021082561 Al relates to a system for automatically generating a field of a radiology report including a set of one or more models. A method for automatically generating a field of a radiology report includes: receiving a radiologist identifier (radiologist ID); receiving a set of finding inputs; determining a context of each of the set of finding inputs; determining text associated with a portion or all of the radiology report based on the context and the radiologist style; and inserting the text into the report.
US 2024095455 Al relates to a multi-modal end to end learning system configured to answer questions about clinical documents like patient notes, medical reports, and lab results. Documents are polled from an electronic medical record system, converted to text, and scrubbed for protected health information before processing. Sanitized text data is then fed as context to a language model that has been fine-tuned for question-answering (QA). The other input to the model is a prompt or a question that is either provided on-the-fly by a clinician as part of a search or pre-determined for specific needs. In return, the model outputs an answer highlighting part of the text/image where it found the answer and a confidence score quantifying the likelihood of the answer being correct. A clinician can optionally correct the answer if needed. This feedback by the clinician is fed back to a fine-tuner module and used to improve the model over time.
Zhu Libing et al describe in "Testing and Validation of a Custom Retrained Large Language Model for the Supportive Care of HN Patients with External Knowledge Base", CANCERS, vol. 16, no. 13, 24 June 2024, page 2311, a study aimed to develop a retrained large language model (LLM) tailored to the needs of HN cancer patients treated with radiotherapy, with emphasis on symptom management and survivorship care. A comprehensive external database was curated for training ChatGPT-4, integrating expert-identified consensus guidelines on supportive care for HN patients and correspondences from physicians and nurses within our institution's electronic medical records for 90 HN patients. The performance of the model was evaluated using 20 patient post-treatment inquiries that were then assessed by three Board certified radiation oncologists (RadOncs). The custom-trained model demonstrates high accuracy in providing support to HN patients offering evidence-based information and guidance on their symptom management and survivorship care.
Problem to be solved
It is therefore desirable to provide a computer-implemented method of generating information on a health condition of a patient, a training method of training a language model, a method of generating a proposal for a medical treatment of a patient, a computer system for generating information on a health condition of a patient, a computer program and a computer-readable storage medium, which solve at least one of the problems of the prior art.
It is further desirable that answers to specific questions on cancer reports are generated that help in understanding the cancer report, preferably wherein an experience of the user is considered when generating the answer such that the answer can be understood by the user.
It is further desirable that answers to specific questions on cancer reports are generated that may comprise reduced hallucination content introduced by a language model used to generate the answers.
It is further desirable that answers to specific questions on cancer reports are generated that comprise up-to-date information.
Summary
This problem is addressed by a computer-implemented method of generating information on a health condition of a patient, a training method of training a language model, a method of generating a proposal for a medical treatment of a patient, a computer system for generating information on a health condition of a patient, a computer program and a computer-readable storage medium with the features of the independent claims. Advantageous embodiments which might be realized in an isolated fashion or in any arbitrary combinations are listed in the dependent claims as well as throughout the specification.
As used in the following, the terms “have”, “comprise” or “include” or any arbitrary grammatical variations thereof are used in a non-exclusive way. Thus, these terms may both refer to a situation in which, besides the feature introduced by these terms, no further features are present in the entity described in this context and to a situation in which one or more further features are present. As an example, the expressions “A has B”, “A comprises B” and “A includes B” may both refer to a situation in which, besides B, no other element is present in A (i.e. a situation in which A solely and exclusively consists of B) and to a situation in which, besides B, one or more further elements are present in entity A, such as element C, elements C and D or even further elements.
Further, it shall be noted that the terms “at least one”, “one or more” or similar expressions indicating that a feature or element may be present once or more than once typically will be used only once when introducing the respective feature or element. In the following, in most cases, when referring to the respective feature or element, the expressions “at least one” or “one or more” will not be repeated, non-withstanding the fact that the respective feature or element may be present once or more than once.
Further, as used in the following, the terms "preferably", "more preferably", "particularly", "more particularly", "specifically", "more specifically" or similar terms are used in conjunction with optional features, without restricting alternative possibilities. Thus, features introduced by these terms are optional features and are not intended to restrict the scope of the claims in any way. The invention may, as the skilled person will recognize, be performed by using alternative features. Similarly, features introduced by "in an embodiment of the invention" or similar expressions are intended to be optional features, without any restriction regarding alternative embodiments of the invention, without any restrictions regarding the scope of the invention and without any restriction regarding the possibility of combining the features introduced in such way with other optional or non-optional features of the invention.
In a first aspect of the present invention, a computer-implemented method of generating information, specifically context-related information, on a health condition of a patient is proposed. The method comprises the following steps, which specifically may be performed in the given order. However, two or more or even all of the method steps may be performed at least partially simultaneously. Further, one or more or even all of the method steps may be performed once or repeatedly. The method may also comprise further method steps which are not listed herein. For said aspect, reference may be made to any definition, Embodiment, claim and/or further aspect herein.
The method comprises the following steps: i. retrieving, specifically by at least one of a computer, a computer system or a computer network, a computer-readable report on cancer data of the patient; ii. creating, specifically by at least one of a computer, a computer system or a computer network, a vector store on the report on cancer data, by
• tokenizing the report and creating tokens;
• embedding the tokens, thereby creating vectors; and
• storing the vectors in the vector store; iii. retrieving, specifically by at least one of a computer, a computer system or a computer network, computer-readable question data comprising information on a context-related user question; iv. using, specifically by at least one of a computer, a computer system or a computer network, at least one trained language model, specifically at least one trained large language model, for automatically generating, by using the vector store, a context- related answer to the user question; v. automatically performing, specifically by at least one of a computer, a computer system or a computer network, a literature search in an external data source on at least one subject related to the user question; and
vi. creating, specifically by at least one of a computer, a computer system or a computer network, the information on the health condition of a patient, by enriching the answer obtained in step iv. with at least one item of literature information obtained by the literature search in step v..
The term “computer implemented method” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a method involving at least one computer and/or at least one computer network. The computer and/or computer network may comprise at least one processor which is configured for performing at least one of the method steps of the method according to the present invention. Preferably each of the method steps is performed and/or supported by the computer and/or computer network. Therein, all of the method steps may be performed by or on one and the same computer or computer network. Alternatively, two or more of the method steps may also be performed by separate but interacting computers or computer networks. As an example, all of the method steps may be performed on a local computer or server, such as by using the RAM of the local computer or server. Alternatively, however, the method may fully or partially be embodied by using delocalized computer resources, such as cloud computing.
The method may be performed completely automatically, specifically without user interaction. Alternatively, as will be outlined in further detail below, one or more method steps may imply user interaction.
As generally used herein, the term “computer” may refer to a device, system or network being configured for data processing, specifically by having at least one processor and optionally further elements, such as, for example, one or more interfaces, one or more data storage devices, one or more display devices and the like. The computer may optionally be cloud-based.
The term “processor”, also referred to as a “processing unit” or as a “processing device”, as generally used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to an arbitrary logic circuitry configured for performing basic operations of a computer or system, and/or, generally, to a device or system which is configured for performing calculations or logic operations. In particular, the processing unit may be configured for processing basic instructions that drive the computer or system. As an example, the processing unit may comprise at least one arithmetic logic unit (ALU), at least one floating-point unit (FPU), such as a math co-processor
or a numeric coprocessor, a plurality of registers, specifically registers configured for supplying operands to the ALU and storing results of operations, and a memory, such as an LI and L2 cache memory. In particular, the processing unit may be a multi-core processor. Specifically, the processing unit may be or may comprise a central processing unit (CPU). Additionally or alternatively, the processing unit may be or may comprise a microprocessor, thus specifically the processing unit’s elements may be contained in one single integrated circuitry (IC) chip. Additionally or alternatively, the processing unit may be or may comprise one or more application-specific integrated circuits (ASICs) and/or one or more field-programmable gate arrays (FPGAs) or the like. The processing unit specifically may be configured, such as by software programming, for performing one or more evaluation operations.
The term “information on a health condition”, as used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to arbitrary information relating to the overall well-being and/or functioning of one or more of the body, the mind, and the spirit of a person. It in contrast may encompass one or more of encompasses physical, mental, and social aspects. Specifically, the information on the health condition may comprise one or more items of qualitative and/or quantitative information on the person’s body condition, such as including one or more items of information on at least one of: the presence or absence of a specific disease of the person in the person’s body; possible treatments of one or more diseases present in the person’s body; a physiological state condition of the person’s body; possible means of disease prevention to be practiced by the person and/or to be practiced on the person’s body; the physical fitness of the person; a probability of a benefit from performing at least one specific medical interventions; a probability of a benefit from not performing and/or ceasing to perform at least one specific medical intervention, such as by watchful waiting and/or in a de-escalation of a specific medical intervention. Alternatively or in addition, the information on the health condition may include one or more items of information on a likely lack of benefit on at least one specific medical intervention. A specific medical intervention may be a pharmacotherapy, a nutritional change and/or an exercise.
The term “patient”, as used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a human or animal being or person subject to medical examination and/or treatment. The term may be used both for persons having diseases and for persons not having diseases or not having the relevant diseases, wherein the latter, still, may be subject to medical examination and/or observation.
The term “retrieve”, as well as any grammatical variations thereof, as used herein in the context of data processing, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to the process, specifically performed by a system, more specifically by a computer system, of generating data and/or obtaining data from an arbitrary data source, such as from a data storage, via a data network or interface, from a further computer or computer system, or by user input. The retrieving specifically may take place via at least one computer interface, such as via a port such as a serial or parallel port, and/or by a user interface. The retrieving may comprise several sub-steps, such as the sub-step of obtaining one or more items of primary information and generating secondary information by making use of the primary information, such as by applying one or more algorithms to the primary information, e.g. by using a processor.
The retrieving step specifically may be performed by at least one computer. As an example, the method may imply uploading and/or downloading the computer-readable report from an external source, such as from a computer readable storage medium, from a database, specifically a web-accessible database, from the Internet or from one or more other sources. The retrieving step may provide the computer prompting a user to define or select the source of the retrieving, specifically an address from which the report may be downloaded, such as a storage address or a URL.
The term “report”, as used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to an arbitrary document containing information on at least one subject.
The term “cancer data”, as used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to medical or biological information on the status and/or the functioning of the body of a patient having cancer or suspected of having cancer, wherein the information, specifically, may assist one or more of the diagnosis of the cancer and/or the stage of the cancer, a prognosis of the cancer and/or the stage of the cancer, and possible treatments of the cancer, preferably comprising the probability of a response and/or a lack of a response to at least one possible treatments and/or other interventions, such as a diet, an exercise and/or the withholding of at least one treatments.
The term “computer-readable”, as used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a
special or customized meaning. The term specifically may refer, without limitation, to an arbitrary object which may be processed by a computer, such as a computer file which may be read and processed by at least one processor and/or at least one computer program running on the at least one processor. The reading of the object, specifically the file, by the computer, specifically the at least one processor, may also comprise one or more steps of transforming the object, such as one or more preprocessing steps such as steps transforming data from one readable format into another one, such as one or more steps of optical character recognition, as will be outlined in further detail below.
Step i. may comprise retrieving a document containing the cancer data of the patient, specifically a .pdf document, wherein step i. further comprises transforming the document into a computer-readable format, specifically a computer-readable format usable in step ii., more specifically using a ‘.pdf parsing library or an optical character recognition (OCR), preferably by using at least one machine learning model configured for performing OCR, such as a Visual Language Model (VLM).
The term “document containing the cancer data”, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to any arbitrary document comprising the cancer data. Said document may comprise the cancer data in an arbitrary form and/or an arbitrary format. The document containing the cancer data may be provided from at least one laboratory, at least one patient medical record, specifically in a structured and/or an unstructured form, e.g. in from of at least one hospital electronic medical records system and/or insurance claims systems, at least one handwritten note, at least one spoken dictation, any other reliable source.
The term “transform”, or any grammatical variation thereof, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a process of converting data from one format, structure and/or value set to another. Of particularly interest for the present invention may be that the document format may be converted to into computer-readable text, specifically continuous text or running text, wherein the structure of the document may be conserved during the transformation process. Thereby, sections comprised by the documents may still be identifiable.
Step ii. may comprise vectorizing the entire computer-readable report on the cancer data of the patient, or at least a part thereof. The term “vectorize”, as well as any grammatical variations thereof, as used herein and as specifically used in the context of language models, is
a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to the transformation of arbitrary data into vectors, thereby representing the data by corresponding vectors. The term “vector”, as used herein and as specifically used in the context of language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a mathematical representation of data or at least one data object, such as a text, image or audio data object, in a high-dimensional space, such as in a space having a dimensionality of at least 3, more specifically of at least 10 or even at least 100. The number of dimensions typically, in the context of language models, may even range from a few hundreds to tens of thousands or even beyond, depending on the data to be represented. In this vector space, each dimension may correspond to a feature of the data. A vector's position in this space may represent its characteristics. Thereby, data including words, phrases, entire documents, images, audio, and other types of data can be vectorized.
The term “vector store”, as used herein and as specifically used in the context of language models, more specifically in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a computerized and/or computer-accessible collection of vectors, such as on at least one computer-readable storage medium comprising the vectors. The vector store specifically may comprise at least one of a vector database, a vector library and a vector plugin. Specifically, the vector store may comprise a vector database, having database properties, such as Create-Read-Update-Delete (CRUD). Thereby, a short query time and a high scalability may be achieved, and query search for the closest vectors may be sped up. Other options, however, are also feasible.
The term “tokenize”, as well as any grammatical variations thereof, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to the process of breaking down a data object, such as a text data object, an image data object or an audio data object, into a plurality of smaller units or elements, called “tokens”. Consequently, the term “token”, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to unit or element of a data object, such as of a text data object, an image data object or an audio data object. As an
example, a text data object, such as the computer readable report on the cancer data of the patient, may be broken, in the process of tokenizing, down into tokens in the form of one or more of passages, paragraphs, phrases, sentences, words, sub-words, punctuation marks, characters. The specific choice of the tokens may depend on the specific tokenization strategy. Tokenization, thus, may refer to a process of splitting a text into tokens, which may be helpful in completing machine learning tasks including text summarization, text classification, sentiment analysis, machine translation, and named entity recognition. By breaking down the text into tokens, i.e., by tokenization, it may be possible to apply statistical and machine learning techniques to analyze and process natural language data.
The term “embed”, as well as any grammatical variations thereof, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to the process of generating, from an input, such as a text input, e.g. typed and/or prespecified, an image input or an audio input, specifically from tokens as defined above, one or more of a numeric, vector, or spatial representation of the input, specifically by using at least one encoder. Thus, specifically, the embedding may imply a vectorization. The embedding may comprise mapping the input, such as the tokens, to a multidimensional vector representation.
Creating a vector store on the report on cancer data in step ii. further may comprise segmenting the report on cancer data. The term “segment”, as well as any grammatical variations thereof, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to dividing and/or splitting the data into smaller, manageable units or segments before processing and storing them as vectors. Segmenting may comprise splitting said data into at least two deviating portions. The data may be divided and/or split in accordance to splitting criteria, such as a document structure, e.g. paragraphs, number of sentences; a data type, e.g. text, image; or specific features, e.g. time intervals, categories. Each segment may be stored individually in the vector store.
Step ii. specifically may be performed automatically, e.g. when a computer program implementing the computer-implemented method is started and/or once the computer-readable report on the cancer data of the patient has been retrieved.
Step ii. may be performed by using at least one machine learning model configured for creating the vector store, preferably wherein the at least one machine learning model configured
for creating a vector store is distinct from the trained language model used in step iv.. The at least one machine learning model configured for creating the vector store may be a model and the trained language model may be a further model, wherein the model and the further model may differ from each other. The machine learning model configured for creating the vector store may be a trained machine learning model.
The at least one machine learning model configured for creating the vector store may comprise a model selected from the group consisting of:
- E5-large-v2 Embeddings;
- PubMedBERT Embeddings;
- GTE-large-en-v 1.5;
- FLAN-T5-base.
As outlined above, in step iii., computer-readable question data comprising information on a user question, specifically a context-related user question, is retrieved. The retrieval process may comprise one or more often input, an upload, a reading or any other type of providing the computer-readable question data, specifically to at least one processor.
The term “user question”, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a request for at least one specific item of information and/or a command for generating at least one specific item of information, initiated by a user. The user question may be a question, a request or a command. Consequently, the user question may be or may be also referred to as a user query. The user question may be an input to a computer or a computer system, such as via at least one user interface. The user may be identical to the patient as defined above or may be distinct from the patient. Specifically, the user may be a user of a software implementing the method. The user specifically may be a healthcare practitioner, such as a doctor, or a laboratory specialist tasked with writing clinical summaries of test results. The retrieval process, as an example, may comprise prompting the user to input the user question, such as in a text format or in an audio format. The method specifically may comprise providing a chat window to the user, e.g. on a user interface such as a graphical user interface, allowing the user to input the user question, e.g. as text data or audio data. The user question may then be transformed in one or more steps.
The term “computer-readable question data”, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to computer-readable data representing the user question. Again, as will be outlined in further detail below, this computer-readable representation may be the result of a vectorization or encoding of the user question. Thus, again, the user question may optionally be tokenized, such as by using the tokenization strategy or tokenization scheme as used in step ii., and the user question and/or the tokens generated thereof may be embedded, thereby mapping the user question into a multidimensional space.
The term “context”, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to the surrounding information or text that the model considers when processing language. By considering more information at once, the model can better understand the comprehensive perspective and generate more relevant responses. The term “comprehensive perspective” may refer to an understanding of a situation by knowing at least one relevant factor, at least one influence and/or at least one relationship between a plurality of details. For a language model, context may comprise a preceding text, a user intent, a topic and/or or conversation history. In the method as presently proposed, the context specifically may be the healthcare situation of the patient, as reflected e.g. by the computer-readable report on cancer data of the patient. Alternatively or in addition, the context may apply to other fields in healthcare or outside healthcare where expertise is scarce. The context may further be customized by selecting at least one of
- a level of detail of the answer;
- a use of technical terms;
- a length of the answer;
- a user role;
- a number for data input points;
- a type of data input.
The computer-readable question data of step iii may be provided by a user input into an input module. The term “input module”, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a hardware component configured to provide the question data. The input module may be selected from at least one of a storage device, specifically an external storage device, a scanner, a prompt on a computer
device, specifically a mobile device. The input may be typed and/or transcribed from an audio-to-text prompt, such as without typing. Alternatively or in addition, the input may be pre-specified.
The term “pre-specified”, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to at least one predefined item, such as an item that may be selected from a predefined list items. Such a pre-specified input or user question may be indicated to the user. The user may select said pre-specified input that is forwarded or input into the language model. Alternatively, a pre-specified input may be at least one standard user question that may be processed before requesting user input. Said pre-specified input may be forwarded or input into the language model automatically. For example, a report on cancer data may be first analyzed by using a standard set of pre-speci- fied user questions, such as 5 to 10 pre-specified user questions. This may ensure that the user, such as an oncologist, receives the most relevant results. Then the input module may allow the user to ask at least one further user question on the report on cancer data and/or other related topics.
The method may comprise providing a plurality of pre-defined user questions, wherein, in step iii. , the user selects the user question from the plurality of pre-defined user questions. The term “pre-defined user questions”, as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a plurality of question that are selected in advanced of at least one step of the method or the method being performed. The pre-defined user questions may be indicated to the user. The user may select at least one of the pre-defined user questions.
The report on cancer data may comprise genomic data of the patient. The term “genomic data”, as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to information on at least a portion of a deoxyribonucleic acid of a biological sample, preferably isolated from a cell, more preferably from a cancer cell. The genomic data may include at least one gene or biomarker associated with the cancer. The term “biological sample” refers to a sample of a body fluid, to a sample of separated cells or to a sample from a tissue or an organ comprising or suspected to comprise cancer driving cell population. Samples of body fluids can be obtained by well-known techniques and include, preferably samples of blood. Tissue or organ samples, such as bone marrow samples, may be obtained by, e.g., biopsy. Separated cells
may be obtained from the body fluids or the tissues or organs by separating techniques such as centrifugation or cell sorting.
The answer to the user question may comprise at least one of information on at least one biomarker or a plurality of biomarkers, such as a driver mutation indicating that the patient may benefit from at least one specific therapy; information indicating genomic clues to resistance that may imply the patient’s lack of benefit from at least one specific therapy; information on potential pathogenic or likely pathogenic DNA alterations in the patient’s genome that may raise the possibility of inherited germline alterations; information on an indication of a disease recurrence or a probability of a lack of the disease recurrence, preferably which may imply a need of a patient for a further therapy or a lack of a need for further therapy or cessation of therapy. The term “therapy” may refer to at least one potential intervention in order to fight a disease, specifically the cancer, such as a pharmacotherapy, a diet, a nutrition, an exercise, a surgery.
The term “predetermined therapy”, as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a medical treatment being at least one attempted remediation of at least one health problem, typically following a medical diagnosis. The predetermined therapy may in the case of cancer be selected to kill and/or inhibit cancer cells.
The term “driver mutation”, as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a somatic mutation at in at least one signal transduction pathway and, thereby, giving the tumour cell a growth advantage and promoting the proliferation of the cancer cell.
The term “resistance”, as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to an ability of the cancer cells to survive, adapt and/or continue growing despite the predetermined therapy.
The term “pathogenic DNA alterations”, as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limi-
tation, to a genetic alteration that increases a susceptibility and/or predisposition of an organism to a certain disease or disorder. The term pathogenic DNA may refer to likely pathogenic DNA alterations. The user may further ask for information concerning at least one further alteration.
The term “inherited germline alterations”, as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to an alteration in a reproductive cell that is propagated to the living organism and may confer susceptibility and/or predisposition of an organism to a certain disease or disorder.
In step iv., at least one trained language model is used, specifically at least one trained large language model, for automatically generating, by using the vector store, an answer to the user question, specifically a context-related answer to the user question. In order to define the terms as used herein and in order to describe the options for the trained language model, the following definitions and descriptions are given. Step iv. may comprise using the trained language model for tokenizing and embedding the computer-readable question data of step iii..
Thus, generally, the trained language model may be a trainable model pre-trained, trained or fine-tuned for natural language processing. Generally, the term “trained model”, as used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a mathematical model which was trained on at least one training data set and which is configured for predicting at least one target variable for at least one input variable. The trained model may comprise, e.g., at least one model selected from the group consisting of: a linear regression model, e.g. comprising transformed features, such as log-transformed or polynomial; at least one non-linear Artificial Neural Network (ANN), in particular at least one deep learning architecture such as Convolutional NN, Recurrent NN, Long Short Term Memory NN, and the like; at least one Support Vector Machine (SVM); at least one kernel based method; Tree regression; Random Forest.
Consequently, the term “trained language model”, as used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a trained model configured for natural language processing. More specifically, the trained language model may be a or may comprise at least one probabilistic model for language processing, such as a probabilistic model trained for one or more tasks selected from the group consisting of natural language generation, natural language classification,
machine translation, optical character recognition, handwriting recognition, and information retrieval. Other tasks in the field of language processing, including language generation, may be possible in addition or alternatively.
An input of the trained language model may be, at least in part, in natural language. An output of the trained language model may be, at least in part, a vector, specifically when requesting data from the vector store. An output of the trained language model may be, at least in part, in natural language, specifically when providing data to the internet agent.
The trained language model specifically may be or may comprise at least one large language model. The term “large language model” (LLM), as used herein, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a mathematical, i.e. a computational, model configured and trained for general- purpose language processing tasks, such as general-purpose language generation, text classification, text summarization, and the like. Large language models are trained to perform these language processing tasks by using huge amount of text documents and a dedicated training process, such as a self-supervised and semi- supervised training process. The large language model specifically may be configured for generating text and performing other generated tasks of generative artificial intelligence. The large language model may be configured for using text input or tokens as defined above, and for processing the input in order to generate text output, specifically by prediction based on statistical means. The text may be natural text. The large language model specifically may be or may comprise at least one artificial neural network. The large language model may be configured to verify the generated answer in order to minimize hallucinations.
The large language model specifically may comprise at least one transformer and/or may comprise the at least one transformer architecture. The trained language model may be selected from the group consisting of:
- Mistral Small 3.1;
- Qwen2.5 32B VL;
- Qwen3 32B;
- Gemma3 27B;
- Mixtral-8x7B;
- Llama 3 70B;
- Phi-3 -medium.
Further, the trained language model may be a model, such as - Medicine-chat;
BioGPT-Large-PubMedQA; and/or OpenBioLLM-70B.
The term “transformer”, as used herein and as specifically used in the context of large language models, is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a mathematical answers or calculation of model having a deep learning or artificial neural network architecture. The transformer is configured for converting text into a numerical representation and/or for using text converted into numerical representations, i.e. for using tokenized text. The transformer further has a layer architecture for processing these tokens. In each layer, the tokens are contextualized by using a so-called attention mechanism amplifying signals for key tokens and diminishing signals for less important tokens. As an example, the large language model using at least one transformer and/or transformer architecture maybe you all may comprise at least one pretrained model or architecture, such as at least one generative pre-trained transformer (GPT) and/or at least one Bidirectional Encoder Representations from Transformers (BERT). Generally, as the skilled person will know, the large language model specifically may comprise at least one transformer having an architecture selected from the group consisting of a decoder-only model, and encoder-only model, and encoder-decoder-model. Examples of models which may be used in the trained language model of step iv. will be given in further detail below.
In step iv., as outlined above, the at least one trained language model, specifically the at least one large language model, is used for automatically generating, by using the vector store, an answer to the user question, specifically on the basis of the vector store, i.e. the tokenized and embedded computer-readable report on cancer data of the patient. The term “automatically” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a process which is performed completely by means of at least one computer and/or computer network and/or machine, in particular without manual action and/or interaction with a user. Thus, herein, the trained language model may, specifically without user interaction, generate the answer to the user question. The answer, generally, may be a single answer or may comprise a plurality of answer components.
The vector store may comprise information in form of vectors, natural language and/or metadata, preferably comprising information about a text passage to which a specific entry refers. Using the vector store may comprise performing a semantic search within the vector store based on the user question, particularly in order to find segments of the cancer report
relevant to answer the user question. In the semantic research a plurality of relevant entries in the vector store may be determined. The natural language will be provided to the trained language model. The metadata may be provided to an outputting module, such as a frontend.
Step iv. may comprise context-customizing the answer to the user. The term “context-customizing” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to an adaption of the answer to a context that determines the at least one characteristic of the answer, preferably at least one characteristic of the answer related to a style of the answer. The at least one characteristic of the answer may be selected from at least one of: a level of detail of the answer, a use of technical terms, a length of the answer, a level of detail of the answer; a use of technical terms; a length of the answer; a user role; a number for data input points; a type of data input.
The context-customizing may comprise taking into account at least one item of information on a role of the user, specifically a categorization of the user, more specifically a categorization into a category selected from the group consisting of: a layman patient or caregiver; a general medical practitioner; an oncology nurse or oncologist; a cancer genomicist; a laboratory director, a clinical specialist, a pathologist or a medical director; a clinical trialist, a monitor, a medical science liaison, a medical manager or another role involved in clinical trial management.
The context-customizing may comprise taking into account at least one item of information on the intended purpose of the answer, specifically information categorizing the purpose into a category selected from the group consisting of: a patient-doctor consultation; a preparation of a tumor board; evaluating patient or subject suitability for a clinical trial, assigning subjects to clinical trial treatment arms.
In step v., as outlined above, a literature search is automatically performed in at least one external data source on at least one subject related to the user question. The external data source may comprise at least one of: the internet; a publication server, specifically a medical publication server and/or pharmaceutical publication server; a data base comprising results of cancer genomics research, a data base comprising medical studies and/or pharmaceutical studies; a data base comprising associations between DNA alterations and sensitivity or resistance to at least one specific therapy.
The term “literature search” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a process of
identifying and analyzing information through a systematic research process. A literature search may comprise gathering and collecting information relevant to a specific topic, such as the subject related to the user question. The purpose is to summarize existing findings, identify gaps, and provide a foundation for new research. A well-conducted literature search ensures a comprehensive understanding of the topic and supports evidence-based practice and decision-making. A literature search may be a search of at least one external resource, such as a published literature and/or a preknown database. Examples of a preknown database may be found in OncoKb (cf. https://www.oncokb.org/ - retrieved June 20, 2025), CKB (cf. https://ckbhome.genomenon.com/ - retrieved June 20, 2025) and/or ClinVar (cf. https://www.ncbi.nlm.nih.gov/clinvar/ - retrieved June 20, 2025) among others.
The term “external data source” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to an organized collection of structured information or data. The external database may, typically, be stored in an external computer system. The external computer system may be external to the computer system running the at least one trained language model. Thereby, the external database may be accessed via a computer network, specifically the internet. The external data source may be or may comprise, consequently, an online data source. The external database may comprise information from academic journals, books, and/or papers. The external data source may be a trusted external data source that is preselected in order to provide trusted information.
The automatically performed literature search in step v. may comprise using at least one internet agent, wherein the internet agent is a machine learning model configured for performing internet searches. The term “internet agent” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a software entity configured to interact with external online resources. Interacting may comprise performing at least one task, gathering information and/or disseminating information. Interacting may be performed by using the internet. The internet agent may be configured to understand and/or generate natural text. The internet agent may provide generated data to the outputting module.
The at least one subject on which the literature search may be performed in step v. is automatically chosen by the trained language model. The at least one subject on which the literature search may be performed in step v. may comprise recent medical and/or pharmaceutical studies, data sets or publications on at least one topic related to at least one aspect of the cancer data of the patient.
The answer to the user question may comprise a summary of at least one portion of the report on cancer data of the patient having relevance to the user question. The term "relevant" as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to the degree to which at least one portion of information, data and/or content is applicable, useful and/or pertinent to at least one specific context, question and/or purpose. The at least one portion of the report may be considered relevant when it addresses at least one aspect of a user question, in particular wherein the portion provides information that contributes to answering the question. Additionally, a portion may be relevant if it contains information that is consistent with and/or supports the findings or conclusions derived from a further portion of the report or external data sources. A portion may also be relevant if it contains information that is not consistent and/or contradicting to one or more findings and/or one or more conclusions derived from a further portion of the report and/or the external data source. The trained language model may be configured for retrieval-augmented generation (RAG), such that, in step iv., a connection to at least one portion of the report on cancer data of the patient is made on which the context-related answer to the user question is based.
The trained language model and/or the internet agent may be further configured, such that a connection to at least one portion of at least one document of the external data source is made on which the item of literature information is based, specifically a portion of at least one internet publication or reference to a data set.
In step vi., as outlined above, the information on the health condition of a patient is created, by enriching the answer obtained in step iv. with at least one item of literature information obtained by the literature search in step v.. Step vi. may also be performed by the trained language model used in step iv..
The term “item of literature information” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to data generated in a literature search, preferably on the at least one subject related to the user question. The item of literature information may comprise information on the at least one subject related to the user question. The item of literature information may further comprise information on the source, such as the external data source, from which the information on the at least one subject related to the user question is derived.
The term “enriching” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a process of enhancing and/or improving something by adding value, quality and/or additional elements. Enriching may comprise analyzing the answer obtained in step iv. by comparing the answer obtained in step iv. to the at least one item of literature information. Comparing may comprise identifying at least one deviation between the answer obtained in step iv. and the at least one item of literature information.
Enriching may comprise correcting said at least one identified deviation, preferably in a manner that the answer obtained in step iv. is adapted to the at least one item of literature information in a manner that said identified deviation is compensated, preferably by using the trained language model and/or the internet agent. Alternatively or in addition, enriching may comprise providing an indication indicating that said identified at least one deviation is identified. The indication may comprise information on details of said identified at least one deviation, such as a portion of the answer obtained in step iv. that is related to said identified at least one deviation. Alternatively or in addition, enriching may comprise discarding at least the portion of the answer obtained in step iv. that is related to said identified at least one deviation. Consequently, the external literature search may add value and thereby enrich the answer by adding information to address the identified deviation; discarding information to address the identified deviation, and/or at least calling attention to the fact that there is an identified deviation.
The information on the health condition of a patient may further comprise information, generated by the trained language model, on the source of the item of literature information.
The information on the source of the item of literature information may contain information on at least one of a peer-reviewed status of a publication forming basis of the literature information; published guidelines from at least one expert, preferably in the field of cancer; a number of subjects in a trial forming basis of the literature information; a phase of a clinical trial forming basis of the literature information; a molecular biological plausibility of a connection between a laboratory finding and the potential association to mechanism or drugs that may benefit or not benefit .
In step vi., the information on the health condition of a patient may be enriched by the portion of the report on cancer data of the patient on which the context-related answer to the user question is based. Consequently, the information on the health condition provides an indica-
tion of the source of the answer obtained in step iv. Particularly thereby, at least one hallucination, such as a false statement, that may be comprised by the information on the health condition may be identified, preferably by the user. The hallucinations may be introduced into the information on the health condition by the trained language mode and/or the literature search. Because the source of the report or literature source used by the model to generate its response is displayed, the user can compare the answer provided by the language model to the source of the answer to identify discrepancies or hallucinations. The user may further iterate from said general result and ask for more details on certain aspects of the results. For example, in the case of an identified driver mutation that implies sensitivity to a certain therapy, the user may ask for the molecular biology of the mutation, e.g. the user may ask for sequence and/or structure alterations that convey altered function and/or sensitivity to certain drugs. Alternatively, the user may ask for clinical trial results that prove patients with those mutations benefit from specific drugs. As another example, in the case of an identified DNA alteration in tumor suppressor genes, the user may ask for more information on the occurrence of that alteration in inherited cancer syndromes. In the case of a genomic signature that would predict response to one or more cancer immunotherapies, the user may ask for more information on the influence of one or more further alterations identified in the test results, which may reduce the likelihood of a patient or the user responding to cancer immunotherapies, to provide more patient or user-specific context to the treating physician. In step vi., the information on the health condition of a patient may be enriched by the portion of the document report on cancer data of the patient on which the item of literature information is based.
Steps iii. and iv., and optionally also step vi., may be performed by using a chat bot based on the trained language model used in step iii.. The term “chat bot” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a software application configured to simulate human conversation through at least one text and/or voice interaction. A chat bot may be integrated into at least one of a website, a mobile app, a messaging platform, an interface. It has an area for text input and output, which can be used to communicate with the system in natural language. A chat bot may comprise at least one text input area and/or at least one text output area, wherein the respective area may be configured to communicate with the user in natural language.
The method may further may comprise: vii. outputting the information on the health condition of a patient enriched with the at least one item of literature information.
The outputting may be performed by at least one of: providing the information on the health condition of a patient enriched with the at least one item of literature information to a frontend; displaying the information on the health condition of a patient enriched with the at least one item of literature information on a display; transmitting the information on the health condition of a patient enriched with the at least one item of literature information via at least one interface; storing the information on the health condition of a patient enriched with the at least one item of literature information in at least one data storage device.
The information on the health condition of a patient further may comprise quality information characterizing at least one of the quality of the computer-readable report on the cancer data; a specimen quality of at least one specimen of the patient forming basis of the report; a specimen quantity of at least one specimen of the patient forming the basis of the report; quality information characterizing the completeness of the report or limitations of the report; quality information identifying anomalous results in the report; quality information on the reputation of a laboratory generating the report; quality information identifying certification of a laboratory generating the report; quality information identifying one or more of validation and at least one ISO standard for laboratories.
The term "quality" as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a characteristic or attribute of at least one object, process and/or result that indicates its degree of excellence, reliability and/or suitability for a specific purpose. Quality may exemplarily relate to the accuracy, completeness and/or consistency of data.
The quality of the report on cancer data, which is typically from on a laboratory, may be represented or signified in the report on cancer data itself, such as by a College of American Pathologists, CAP, certification and/or a Clinical Laboratory Improvement Amendments, CLIA, certification of the laboratory, as well as by a tumor DNA input content along one or more metrics of the sequencing depth and uniformity and so on. The method may comprise to evaluate one or more of said quality criteria and consider them in the context of the report on cancer data. This may be used to highlight to the user if the laboratory own quality metrics have been or not been met, which has implications on the interpretation of the report on cancer data.
Further, the quality of the generated answer may be considered. This quality may be associated with an accuracy of the answer, such that, preferably, the answer is correct with respect to the report on cancer data, even if report on cancer data is out of date and the answer is
based on one or more external sources. The quality may further be associated with a relevance of the answer such that only clinically-relevant information is provided and no non- clinically-relevant info is comprised. The quality may further be associated with a completeness, such that all clinically-relevant information pertaining to the user’ s question is provided in the response.
In a further aspect, a training method of training a language model, specifically a large language model, for use in step iv. of the method as elsewhere disclosed herein is disclosed. The method of training a language model may be a computer-implemented method. The method comprises the following steps, which specifically may be performed in the given order. However, two or more or even all of the method steps may be performed at least partially simultaneously. Further, one or more or even all of the method steps may be performed once or repeatedly. The method may also comprise further method steps which are not listed herein. For said aspect, reference may be made to any definition, Embodiment, claim and/or further aspect herein.
The method may comprise:
- providing a foundation model;
- providing a test report set, the test report set comprising at least one computer-readable report on cancer data of a patient;
- defining a test question set, the test question set comprising a plurality of training questions relating to the test report of the test question set, wherein the answers to the test questions are known;
- fine-tuning the foundation model by using the test report set in step i. of the method, by using the test question set in step iii. of the method, and by comparing the answers generated in step iv. with the known answers to the test questions and adapting parameters of the foundation model.
The term “foundation model” or “base model” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a pre-trained language model. The pre-trained language model may be trained, typically, on broad data sets in a manner that it may be applied across a wide range of use cases. The foundation model may be or may comprise a language model, preferably a large language model. A foundation model may further be trained and/or fine-tuned for a specific use case. The fine-tuned foundation model may be the trained language model as elsewhere described herein.
The foundation model may comprise a transformers model, specifically selected from the group consisting of:
- Mixtral-8x7B;
- Llama 3 70B;
- Phi-3 -medium.
The foundation model may be an already find-tuned model, such as
- Medicine-chat;
- BioGPT-Large-PubMedQA; and/or
- OpenBioLLM-70B.
The term “test report set” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a preselected report set the is evaluated in order to fine-tune the foundation model. The test report set comprising at least one computer-readable report on cancer data of a patient may comprise a plurality of computer-readable report on cancer data, such as, typically, 10, 100, 200 or 1000, computer-readable report on cancer data. The plurality of computer-readable report on cancer data may be reports related to the same patient and/or to differing patients. A further test report set comprising even further data may be used to fine-tune an already finetuned foundation model.
The term “test question set” as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a preselected question set with which the cancer data reports are evaluated, in order to fine-tune the foundation model. The answers to this "test question set" may comprise the ground truth to which the foundation models are fine-tuned or trained.
The test questions may comprise at least one positive test question, wherein the positive test question is answerable on the basis of the report on cancer data of the patient. A test question that is answerable on the basis of the report on cancer data of the patient may be a test question of which the answer is comprised by the report on cancer data of the patient. The foundation model may, consequently, generate a substantive answer to the positive test question without having to hallucinate.
The test questions may comprise at least one negative test question, wherein the negative test question is not answerable on the basis of the report on cancer data of the patient. A test
question that is not answerable on the basis of the report on cancer data of the patient may be a test question of which the answer is not comprised by the report on cancer data of the patient. The foundation model may, consequently, only generate a substantive answer to the negative test question by speculating or hallucinating. In order to not hallucinate, the foundation model may have to indicate that it cannot provide an answer to the negative test question. The foundation model may perform a self-check by generating a plurality of answers, such as 3 or more answers assessing the consistency of said plurality of answers by using BERT and/or ngram methods, and judging the confidence in giving a correct answer based on said consistency. Inconsistent results may be flagged as likely hallucinations, inducing the model to answer "I don't know" or asking for more details.
The term “hallucinate”, or any grammatical variation thereof, as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a phenomenon where a language model generates information that is not accurate and/or grounded in reality. External hallucination may occur when the language model generates text that comprises false and/or fabricated information that appears to be grounded in external reality but is not. External hallucination may comprise making up at least one fact, detail or source that does not exist. Internal hallucination may occur when the language model generates text that is inconsistent and/or incoherent within the context of the conversation or text itself. Internal hallucination may comprise generating text contradicting itself and/or generating text that comprises irrelevant and/or nonsensical content.
The term “fine-tuning”, or any grammatical variation thereof, as used herein is a broad term and is to be given its ordinary and customary meaning to a person of ordinary skill in the art and is not to be limited to a special or customized meaning. The term specifically may refer, without limitation, to a process in which a pre-trained model, such as the foundation model, is further trained on a specific dataset in order to adapt the foundation model to a specific task. Fine-tuning may commonly be used on language models in order to improve the performance of the language model on a specific task. Fine-tuning may comprise at least one optimization or tuning process, wherein a best parameter combination of the foundation model is determined.
In order to fine-tune the foundation model the test report set may be retrieved by the foundation model. The foundation model may further retrieve the test questions. The foundation may generate answers to the test questions. These answers may be compared with the known answers to the test questions, preferably in a training environment. Fine-tune the model may comprise selecting a loss function. The loss function may determine a cross-entropy loss,
preferably when fine tuning text generation or classification. Alternatively or in addition, fine-tuning the model may comprise selecting a metric. A typical metric used to fine tune for text-generation may be Perplexity. A typical metric used to fine tune for classification may be Accuracy, Precision, Recall, and Fl Score. Alternatively or in addition, fine-tune the model may comprise selecting an optimizer, typically AdamW, SGD with Momentum, Lion, Lomo, Amos, Sophia or Confidence-guided Adaptive Memory Efficient Optimization (CAME) and/or variants thereof.
In a further aspect, a method of generating a proposal for a medical treatment of a patient is disclosed. The method comprises the following steps, which specifically may be performed in the given order. However, two or more or even all of the method steps may be performed at least partially simultaneously. Further, one or more or even all of the method steps may be performed once or repeatedly. The method may also comprise further method steps which are not listed herein. For said aspect, reference may be made to any definition, Embodiment, claim and/or further aspect herein.
The method comprises using the method of generating information on a health condition of a patient as elsewhere disclosed herein. The method comprises, in step i., providing a computer-readable report on cancer data of the patient to the method of generating context-related information on a health condition of a patient, wherein the method further comprises providing, in step iii ., a question relating to at least one possible treatment of the patient.
In a further aspect of the present invention, an alternative computer-implemented method of generating information, specifically context-related information, on a health condition of a patient is proposed. The method comprises the following steps, which specifically may be performed in the given order. However, two or more or even all of the method steps may be performed at least partially simultaneously. Further, one or more or even all of the method steps may be performed once or repeatedly. The method may also comprise further method steps which are not listed herein. For said aspect, reference may be made to any definition, Embodiment, claim and/or further aspect herein.
The alternative method comprises the following steps: i. retrieving, specifically by at least one of a computer, a computer system or a computer network, a computer-readable report on cancer data of the patient; ii. retrieving, specifically by at least one of a computer, a computer system or a computer network, at least one item of literature information obtained by a literature search;
iii. retrieving, specifically by at least one of a computer, a computer system or a computer network, computer-readable question data comprising information on a context-related user question; iv. using, specifically by at least one of a computer, a computer system or a computer network, at least one trained language model, specifically at least one trained large language model, for automatically generating, a context-related answer to the user question based on the computer-readable report on cancer data of the patient and the at least one item of literature information.
The at least one trained language model may request further context related data for enriching the answer obtained in step iv., such as by performing a further literature search in order to generate a at least one further item of literature information or by requesting a further information from the user.
In a further aspect, a computer system for generating information on a health condition of a patient, specifically by performing the method of generating information on a health condition of a patient as elsewhere disclosed herein is disclosed. For said aspect, reference may be made to any definition, Embodiment, claim and/or further aspect herein.
The computer system comprises:
I. a retrieving module configured for retrieving a computer-readable report on cancer data of the patient;
II. an embedding module configured for creating a vector store on the report on cancer data, by
• tokenizing the report and creating tokens;
• embedding the tokens, thereby creating vectors; and
• storing the vectors in the vector store;
III. an input module configured for providing computer-readable question data comprising information on a context-related user question;
IV. a processing module comprising at least one trained language model, specifically at least one trained large language model, the processing module being configured for automatically generating, by using the vector store, a context-related answer to the user question, preferably and self-checking that answer for possible hallucination;
V. a searching module configured for automatically performing a literature search in at least one external data source using on at least one subject related to the user question; and
VI. an evaluation module configured for creating the information on the health condition of a patient, by enriching the answer obtained in step iv. with at least one item of literature information obtained by the literature search in step v., preferably and displaying the source from which the information on the health condition is derived from by using a display, preferably in a manner that the user can see where the information on the health condition is obtained from.
In a further aspect, a computer program comprising instructions is disclosed which, when the program is executed by a computer or a computer network, cause the computer or computer system to perform at least one of the method of generating information on a health condition of a patient as elsewhere disclosed herein; the training method as elsewhere disclosed herein; the method of generating a proposal for a medical treatment of a patient as elsewhere disclosed herein. For said aspect, reference may be made to any definition, Embodiment, claim and/or further aspect herein.
In a further aspect, a computer-readable storage medium is disclosed, specifically a nontransient computer-readable medium, comprising instructions which, when the instructions are executed by a computer or a computer system, cause the computer or computer network to perform at least one of the method of generating information on a health condition of a patient as elsewhere disclosed herein; the training method as elsewhere disclosed herein; the method of generating a proposal for a medical treatment of a patient as elsewhere disclosed herein. For said aspect, reference may be made to any definition, Embodiment, claim and/or further aspect herein.
As used herein, the “computer-readable storage medium” specifically may refer to non-tran- sitory data storage means, such as a hardware storage medium having stored thereon computer-executable instructions. The stored computer-executable instruction may be associate with the computer program. The computer-readable data carrier or storage medium specifically may be or may comprise a storage medium such as a random-access memory (RAM) and/or a read-only memory (ROM).
The proposed computer-implemented method of generating information on a health condition of a patient, training method of training a language model, method of generating a proposal for a medical treatment of a patient, computer system for generating information on a health condition of a patient, computer program and computer-readable storage medium provide many advantages over known devices and methods.
Specific answers on cancer reports may be generated that help in understanding the cancer report, preferably wherein an experience of the user is considered when generating the answer such that the answer can be understood by the user. This may particularly be achieved by providing context-related answers and/or using a chat bot.
Answers to specific questions on cancer reports may be generated that may comprise reduced hallucination content introduced by a language model used to generate the answers. This may particularly be achieved by employing self-check methods and a literature search.
It is further desirable that answers to specific questions on cancer reports are generated that comprise up-to-date information. This may particularly be achieved by employing a literature search.
It is further desirable that answers to specific questions about the cancer reports provide context about the quality of the test that was used to generate the result. This information is typically included in the report from the laboratory, but often not used by the user. The disclosed method includes this information when formulating its answer, so the user receives context as to "why" some results are not reported, limited or why care should be taken when interpreting some results.
It is further desirable that answers to specific questions on cancer reports can be expanded upon through natural language processing methods to allow the user to ask more general questions, or more specific questions about aspects of previous answers, in a way that mimics interactions between users and cancer testing experts.
It is further desirable to retrain foundation models on specialized topics because smaller data sets can be used for retraining a foundation model; rather than the extremely large data sets required for foundation model training. The addition of quantum computing approaches may allow even smaller data sets to be used.
The disclosed approach may be applied to other fields where complex testing, such as in rare, inherited and pediatric diseases, among others, requires scarce expertise that is difficult to scale in order to support many patients outside leading academic centers. The disclosed approach to training language models may also be applied to fields outside healthcare where scarce specialist expertise is required.
It may further be an advantage that the language model may be accessible from any computer and does not require integration into any lab and/or hospital IT-system. Thereby, time and money may be saved.
Summarizing and without excluding further possible embodiments, the following embodiments may be envisaged:
Embodiment 1 : A computer-implemented method of generating information on a health condition of a patient, comprising: i. retrieving a computer-readable report on cancer data of the patient; ii. creating a vector store on the report on cancer data, by
• tokenizing the report and creating tokens;
• embedding the tokens, thereby creating vectors; and
• storing the vectors in the vector store; iii. retrieving computer-readable question data comprising information on a user question, specifically a context-related user question; iv. using at least one trained language model, specifically at least one large language model, for automatically generating, by using the vector store, an answer to the user question, specifically a context-related answer to the user question; v. automatically performing a literature search in at least one external data source on at least one subject related to the user question; and vi. creating the information on the health condition of a patient, by enriching the answer obtained in step iv. with at least one item of literature information obtained by the literature search in step v..
Embodiment 2: The method according to the preceding Embodiment, wherein step i. comprises retrieving a document containing the cancer data of the patient, specifically a .pdf document, wherein step i. further comprises transforming the document into a computer- readable format, specifically a computer-readable format usable in step ii., more specifically an optical character recognition (OCR), preferably by using at least one machine learning model configured for performing OCR.
Embodiment 3 : The method according to any one of the preceding Embodiments, wherein step iv. comprises using the trained language model for tokenizing and embedding the computer-readable question data of step iii..
Embodiment 4: The method according to any one of the preceding Embodiments, wherein creating a vector store on the report on cancer data in step ii. further comprises
• segmenting the report on cancer data.
Embodiment 5: The method according to any one of the preceding Embodiments, wherein the trained language model comprises a transformers architecture, specifically wherein the trained language model is selected from the group consisting of:
- Mixtral-8x7B [6]
- Llama 3 70B [7]
- Phi-3 -medium [8]
Embodiment 6: The method according to any one of the preceding Embodiments, wherein the automatically performing the literature search in step v. comprises using at least one internet agent, wherein the internet agent is a machine learning model configured for performing internet searches.
Embodiment 7: The method according to any one of the preceding Embodiments, further comprising: vii. outputting the information on the health condition of a patient enriched with the at least one item of literature information.
Embodiment 8: The method according to any one of the preceding Embodiments, wherein the external data source comprises at least one of: the internet; a publication server, specifically a medical publication server and/or pharmaceutical publication server; a data base comprising results of cancer genomics research, a data base comprising medical studies and/or pharmaceutical studies; a data base comprising associations between DNA alterations and sensitivity or resistance to at least one specific therapy.
Embodiment 9: The method according to the preceding Embodiment, wherein the outputting is performed by at least one of: providing the information on the health condition of a patient enriched with the at least one item of literature information to a frontend; displaying the information on the health condition of a patient enriched with the at least one item of literature information on a display; transmitting the information on the health condition of a patient enriched with the at least one item of literature information via at least one interface; storing the information on the health condition of a patient enriched with the at least one item of literature information in at least one data storage device.
Embodiment 10: The method according to any one of the preceding Embodiments, wherein the computer-readable question data of step iii is provided by a user input into an input module.
Embodiment 11 : The method according to any one of the preceding Embodiments, wherein steps iii. and iv., and optionally also step vi., are performed by using a chat bot based on the trained language model used in step iii..
Embodiment 12: The method according to any one of the preceding Embodiments, wherein step vi. is also performed by the trained language model used in step iv..
Embodiment 13: The method according to any one of the preceding Embodiments, wherein step ii. is performed by using at least one machine learning model configured for creating the vector store, preferably wherein the at least one machine learning model configured for creating a vector store is distinct from the trained language model used in step iv..
Embodiment 14: The method according to the preceding Embodiment, wherein the at least one machine learning model configured for creating the vector store comprises a model selected from the group consisting of:
- E5-large-v2 Embeddings;
- PubMedBERT Embeddings;
- GTE-large-en-v 1.5;
- FLAN-T5-base.
Embodiment 15: The method according to any one of the preceding Embodiments, wherein the method comprises providing a plurality of pre-defined user questions, wherein, in step iii., the user selects the user question from the plurality of pre-defined user questions.
Embodiment 16: The method according to any one of the preceding Embodiments, wherein the report on cancer data comprises genomic data of the patient.
Embodiment 17: The method according to the preceding Embodiment, wherein the answer to the user question comprises at least one of: information on at least one biomarker or a plurality of biomarkers, such as a driver mutation indicating that the patient may benefit from at least one specific therapy; information indicating genomic clues to resistance that imply the patient’s lack of benefit from at least one specific therapy; information on potential pathogenic or likely pathogenic DNA alterations in the patient’s genome that raises the possibility of inherited germline alterations; information on an indication of a disease recurrence or a probability of a lack of the disease recurrence, preferably which may imply a need of a patient for a further therapy or a lack of a need for further therapy or cessation of therapy.
Embodiment 18: The method according to any one of the preceding Embodiments, wherein step iv. comprises context-customizing the answer to the user.
Embodiment 19: The method according to the preceding Embodiment, wherein the contextcustomizing comprises taking into account at least one item of information on a role of the user, specifically a categorization of the user, more specifically a categorization into a category selected from the group consisting of: a layman patient or caregiver; a general medical practitioner; an oncology nurse or oncologist; a cancer genomicist; a laboratory director, a clinical specialist, a pathologist or a medical director; a clinical trialist, a monitor, a medical science liaison, a medical manager or another role involved in clinical trial management.
Embodiment 20: The method according to any one of the two preceding Embodiments, wherein the context-customizing comprises taking into account at least one item of information on the intended purpose of the answer, specifically information categorizing the purpose into a category selected from the group consisting of: a patient-doctor consultation; a preparation of a tumor board; evaluating patient or subject suitability for a clinical trial, or assigning subjects to clinical trial treatment arms.
Embodiment 21 : The method according to any one of the preceding Embodiments, wherein the information on the health condition of a patient further comprises quality information characterizing at least one of: the quality of the computer-readable report on the cancer data; a specimen quality of at least one specimen of the patient forming basis of the report; a specimen quantity of at least one specimen of the patient forming the basis of the report; quality information characterizing the completeness of the report; quality information identifying anomalous results in the report; quality information on the reputation of a laboratory generating the report; quality information identifying certification of a laboratory generating the report; quality information identifying one or more of validation and at least one ISO standard for laboratories.
Embodiment 22: The method according to any one of the preceding Embodiments, wherein the at least one subject on which the literature search is performed in step v. is automatically chosen by the trained language model.
Embodiment 23 : The method according to anyone of the preceding Embodiments, wherein the at least one subject on which the literature search is performed in step v. comprises recent medical and/or pharmaceutical studies on at least one topic related to at least one aspect of the cancer data of the patient.
Embodiment 24: The method according to any one of the preceding Embodiments, wherein the information on the health condition of a patient further comprises information, generated by the trained language model, on the source of the item of literature information.
Embodiment 25: The method according to the preceding Embodiment, wherein the information on the source of the item of literature information contains information on at least one of: a peer-reviewed status of a publication forming basis of the literature information; published guidelines from at least one expert; a number of subjects in a trial forming basis of the literature information; a phase of a clinical trial forming basis of the literature information; a molecular biological plausibility of a connection between a laboratory finding and the potential association to mechanism or drugs that may benefit or not benefit.
Embodiment 26: The method according to any one of the preceding Embodiments, wherein the answer to the user question comprises a summary of at least one portion of the report on cancer data of the patient having relevance to the user question.
Embodiment 27: The method according to any one of the preceding Embodiments, wherein the trained language model is configured for retrieval-augmented generation (RAG), such that, in step iv., a connection to at least one portion of the report on cancer data of the patient is made on which the context-related answer to the user question is based.
Embodiment 28: The method according to the preceding Embodiment, wherein, in step vi., the information on the health condition of a patient is enriched by the portion of the report on cancer data of the patient on which the context-related answer to the user question is based.
Embodiment 29: The method according to the preceding Embodiment, wherein the trained language model is further configured, such that a connection to at least one portion of at least one document of the external data source is made on which the item of literature information is based, specifically a portion of at least one internet publication.
Embodiment 30: The method according to the preceding Embodiment, wherein, in step vi., the information on the health condition of a patient is enriched by the portion of the document report on cancer data of the patient on which the item of literature information is based.
Embodiment 31 : A training method of training a language model, specifically a large language model, for use in step iv. of the method according to any one of the preceding Embodiments, the method comprising:
- providing a foundation model;
- providing a test report set, the test report set comprising at least one computer-readable report on cancer data of a patient;
- defining a test question set, the test question set comprising a plurality of training questions relating to the test report of the test question set, wherein the answers to the test questions are known;
- fine-tuning the foundation model by using the test report set in step i. of the method, by using the test question set in step iii. of the method, and by comparing the answers generated in step iv. with the known answers to the test questions and adapting parameters of the foundation model.
Embodiment 32: The training method according to the preceding Embodiment, wherein the foundation model comprises a transformers model, specifically selected from the group consisting of:
- Mixtral-8x7B;
- Llama 3 70B;
- Phi-3 -medium.
Embodiment 33: The training method according to any one of the preceding Embodiments referring to a training method, wherein the test questions comprise at least one positive test question, wherein the positive test question is answerable on the basis of the report on cancer data of the patient.
Embodiment 34: The training method according to any one of the preceding Embodiments referring to a training method, wherein the test questions comprise at least one negative test question, wherein the negative test question is not answerable on the basis of the report on cancer data of the patient.
Embodiment 35: A method of generating a proposal for a medical treatment of a patient, the method comprising using the method of generating information on a health condition of a patient according to any one of the preceding Embodiments referring to a method of generating information on a health condition of a patient, wherein the method comprises, in step i., providing a computer-readable report on cancer data of the patient to the method of generating context-related information on a health condition of a patient, wherein the method further comprises providing, in step iii., a question relating to at least one possible treatment of the patient.
Embodiment 36: A computer system for generating information on a health condition of a patient, specifically by performing the method according to any one of the preceding Embodiments referring to a computer-implemented method of generating information on a health condition of a patient, the computer system comprising:
I. a retrieving module configured for retrieving a computer-readable report on cancer data of the patient;
II. an embedding module configured for creating a vector store on the report on cancer data, by
• tokenizing the report and creating tokens;
• embedding the tokens, thereby creating vectors; and
• storing the vectors in the vector store;
III. an input module configured for providing computer-readable question data comprising information on a context-related user question;
IV. a processing module comprising at least one trained language model, specifically at least one trained large language model, the processing module being configured for automatically generating, by using the vector store, a context-related answer to the user question;
V. a searching module configured for automatically performing a literature search in at least one external data source using on at least one subject related to the user question; and
VI. an evaluation module configured for creating the information on the health condition of a patient, by enriching the answer obtained in step iv. with at least one item of literature information obtained by the literature search in step v..
Embodiment 37: A computer program comprising instructions which, when the program is executed by a computer or a computer network, cause the computer or computer system to perform at least one of: the method according to any one of the preceding Embodiments referring to a computer-implemented method of generating information on a health condition of a patient; the training method according to any one of the preceding Embodiments referring to a training method of training a language model; the method according to any one of the preceding Embodiments referring to a method of generating a proposal for a medical treatment of a patient.
Embodiment 38: A computer-readable storage medium, specifically a non-transient computer-readable medium, comprising instructions which, when the instructions are executed by a computer or a computer system, cause the computer or computer network to perform at least one of: the method according to any one of the preceding Embodiments referring to a computer-implemented method of generating information on a health condition of a patient;
the training method according to any one of the preceding Embodiments referring to a training method of training a language model; the method according to any one of the preceding Embodiments referring to a method of generating a proposal for a medical treatment of a patient.
Short description of the Figures
Further optional features and embodiments will be disclosed in more detail in the subsequent description of embodiments, preferably in conjunction with the dependent claims. Therein, the respective optional features may be realized in an isolated fashion as well as in any arbitrary feasible combination, as the skilled person will realize. The scope of the invention is not restricted by the preferred embodiments. The embodiments are schematically depicted in the Figures. Therein, identical reference numbers in these Figures refer to identical or functionally comparable elements.
In the Figures:
Figure 1 shows an exemplary computer-implemented method of generating information on a health condition of a patient;
Figure 2 shows an exemplary training method of training a language model;
Figure 3 shows an exemplary method of generating a proposal for a medical treatment of a patient; and
Figure 4 shows an exemplary a computer system for generating information on a health condition of a patient.
Detailed description of the embodiments
In Figure 1, an exemplary computer-implemented method of generating information on a health condition of a patient 110 is shown. The exemplary computer-implemented method of generating information on a health condition of a patient 110 comprises: i. (denote by reference number 112) retrieving a computer-readable report on cancer data of the patient; ii. (denote by reference number 114) creating a vector store on the report on cancer data, by
• tokenizing the report and creating tokens;
• embedding the tokens, thereby creating vectors; and
• storing the vectors in the vector store; iii. (denote by reference number 116) retrieving computer-readable question data comprising information on a user question, specifically a context-related user question; iv. (denote by reference number 118) using at least one trained language model, specifically at least one large language model, for automatically generating, by using the vector store, an answer to the user question, specifically a context- related answer to the user question; v. (denote by reference number 120) automatically performing a literature search in at least one external data source on at least one subject related to the user question; and vi. (denote by reference number 122) creating the information on the health condition of a patient, by enriching the answer obtained in step iv. with at least one item of literature information obtained by the literature search in step v..
Step i. (denote by reference number 112) may comprise retrieving a document containing the cancer data of the patient, specifically a .pdf document, wherein step i. (denote by reference number 112) further comprises transforming the document into a computer-readable format, specifically a computer-readable format usable in step ii., more specifically an optical character recognition (OCR), preferably by using at least one machine learning model configured for performing OCR. The report on cancer data may comprise genomic data of the patient.
Step ii. (denote by reference number 114) may be performed by using at least one machine learning model configured for creating the vector store, preferably wherein the at least one machine learning model configured for creating a vector store is distinct from the trained language model used in step iv. (denote by reference number 118). Creating a vector store on the report on cancer data in step ii. (denote by reference number 114) further may comprise segmenting the report on cancer data. The at least one machine learning model configured for creating the vector store may comprise a model selected from the group consisting of
- E5-large-v2 Embeddings;
- PubMedBERT Embeddings;
- GTE-large-en-v 1.5;
- FLAN-T5-base.
The computer-readable question data of step iii. (denote by reference number 116) may be provided by a user input into an input module. The method may comprise providing a plurality of pre-defined user questions, wherein, in step iii. (denote by reference number 116), the user selects the user question from the plurality of pre-defined user questions.
Step iv. (denote by reference number 118) may comprise using the trained language model for tokenizing and embedding the computer-readable question data of step iii. (denote by reference number 116). The trained language model may comprise a transformers architecture, specifically wherein the trained language model is selected from the group consisting of
- Mixtral-8x7B;
- Llama 3 70B;
- Phi-3 -medium.
The answer to the user question may comprise at least one of information on at least one biomarker or a plurality of biomarkers, such as a driver mutation indicating that the patient may benefit from at least one specific therapy; information indicating genomic clues to resistance that may imply the patient’s lack of benefit from at least one specific therapy; information on potential pathogenic or likely pathogenic DNA alterations in the patient’s genome that may raise the possibility of inherited germline alterations; information on an indication of a disease recurrence or a probability of a lack of the disease recurrence, preferably which may imply a need of a patient for a further therapy or a lack of a need for further therapy or cessation of therapy.. The answer to the user question may comprise a summary of at least one portion of the report on cancer data of the patient having relevance to the user question.
Step iv. (denote by reference number 118) may comprise context-customizing the answer to the user. The context-customizing may comprise taking into account at least one item of information on a role of the user, specifically a categorization of the user, more specifically a categorization into a category selected from the group consisting of a layman patient or caregiver; a general medical practitioner; an oncology nurse or oncologist; a cancer genomi- cist; a laboratory director, a clinical specialist, a pathologist or a medical director; a clinical trialist, a monitor, a medical science liaison, a medical manager or another role involved in clinical trial management. The context-customizing may comprise taking into account at least one item of information on the intended purpose of the answer, specifically information categorizing the purpose into a category selected from the group consisting of a patientdoctor consultation; a preparation of a tumor board; evaluating patient or subject suitability for a clinical trial, or assigning subjects to clinical trial treatment arms.
The trained language model may be configured for retrieval-augmented generation (RAG), such that, in step iv. (denote by reference number 118), a connection to at least one portion of the report on cancer data of the patient is made on which the context-related answer to the user question is based.
The automatically performing the literature search in step v. (denote by reference number 120) may comprise using at least one internet agent, wherein the internet agent is a machine learning model configured for performing internet searches. The at least one subject on which the literature search may be performed in step v. (denote by reference number 120) may be automatically chosen by the trained language model. The trained language model may be further configured, such that a connection to at least one portion of at least one document of the external data source is made on which the item of literature information is based, specifically a portion of at least one internet publication.
The at least one subject on which the literature search may be performed in step v. comprises recent medical and/or pharmaceutical studies on at least one topic related to at least one aspect of the cancer data of the patient. The external data source may comprise at least one of the internet; a publication server, specifically a medical publication server and/or pharmaceutical publication server; a data base comprising results of cancer genomics research, a data base comprising medical studies and/or pharmaceutical studies; a data base comprising associations between DNA alterations and sensitivity or resistance to at least one specific therapy.
In step vi. (denote by reference number 122), the information on the health condition of a patient may be enriched by the portion of the report on cancer data of the patient on which the context-related answer to the user question is based. In step vi. (denote by reference number 120), the information on the health condition of a patient may be enriched by the portion of the document report on cancer data of the patient on which the item of literature information is based.
Step vi. (denote by reference number 120), may also be performed by the trained language model and/or the internet agent used in step iv.. Steps iii. and iv., and optionally also step vi., may be performed by using a chat bot based on the trained language model used in step iii..
The method 110 may further comprise: vii. (denote by reference number 124) outputting the information on the health condition of a patient enriched with the at least one item of literature information.
The outputting may be performed by at least one of: providing the information on the health condition of a patient enriched with the at least one item of literature information to a frontend; displaying the information on the health condition of a patient enriched with the at least one item of literature information on a display; transmitting the information on the health condition of a patient enriched with the at least one item of literature information via at least one interface; storing the information on the health condition of a patient enriched with the at least one item of literature information in at least one data storage device.
The information on the health condition of a patient further may comprise quality information characterizing at least one of the quality of the computer-readable report on the cancer data; a specimen quality of at least one specimen of the patient forming basis of the report; a specimen quantity of at least one specimen of the patient forming the basis of the report; quality information characterizing the completeness of the report; quality information identifying anomalous results in the report; quality information on the reputation of a laboratory generating the report; quality information identifying certification of a laboratory generating the report; quality information identifying one or more of validation and at least one ISO standard for laboratories.
The information on the health condition of a patient may further comprise information, generated by the trained language model, on the source of the item of literature information. The information on the source of the item of literature information may contain information on at least one of a peer-reviewed status of a publication forming basis of the literature information; published guidelines from at least one expert;_a number of subjects in a trial forming basis of the literature information; a phase of a clinical trial forming basis of the literature information; a molecular biological plausibility of a connection between a laboratory finding and the potential association to mechanism or drugs that may benefit or not benefit .
In Figure 2, an exemplary training method of training a language model 126, specifically a large language model, for use in step iv. (denote by reference number 1118) of the method 110 as elsewhere disclosed herein is shown. The method 126 may comprise:
- (denote by reference number 128) providing a foundation model;
- (denote by reference number 130) providing a test report set, the test report set comprising at least one computer-readable report on cancer data of a patient;
- (denote by reference number 132) defining a test question set, the test question set comprising a plurality of training questions relating to the test report of the test question set, wherein the answers to the test questions are known;
- (denote by reference number 134) fine-tuning the foundation model by using the test report set in step i. of the method, by using the test question set in step iii. of the
method, and by comparing the answers generated in step iv. with the known answers to the test questions and adapting parameters of the foundation model.
The foundation model may comprise a transformers model, specifically selected from the group consisting of:
- Mixtral-8x7B;
- Llama 3 70B;
- Phi-3 -medium.
The test questions may comprise at least one positive test question, wherein the positive test question is answerable on the basis of the report on cancer data of the patient. The test questions may comprise at least one negative test question, wherein the negative test question is not answerable on the basis of the report on cancer data of the patient.
In Figure 3, an exemplary method of generating a proposal for a medical treatment of a patient 136 is shown. The method 136 comprises using the method of generating information on a health condition of a patient 110 as elsewhere disclosed herein. The method 136 comprises, in step i. patient (denote by reference number 112), providing a computer-readable report on cancer data of the patient (denote by reference number 138) to the method of generating context-related information on a health condition of a patient, wherein the method further comprises providing, in step iii. (denote by reference number 116), a question relating to at least one possible treatment of the patient (denote by reference number 140).
In Figure 4, an exemplary computer system 142 for generating information on a health condition of a patient, specifically by performing the method of generating information on a health condition of a patient 110 as elsewhere disclosed herein, is shown.
The computer system 142 comprises:
I. a retrieving module 144 configured for retrieving a computer-readable report on cancer data of the patient;
II. an embedding module 146 configured for creating a vector store, preferably in a vector module 148, on the report on cancer data, by
• tokenizing the report and creating tokens;
• embedding the tokens, thereby creating vectors; and
• storing the vectors in the vector store;
III. an input module 150 configured for providing computer-readable question data comprising information on a context-related user question;
IV. a processing module 152 comprising at least one trained language model, specifically at least one trained large language model, the processing module being configured for automatically generating, by using the vector store, a context- related answer to the user question;
V. a searching module 154 configured for automatically performing a literature search in at least one external data source using on at least one subject related to the user question; and
VI. an evaluation module 156 configured for creating the information on the health condition of a patient, by enriching the answer obtained in step iv. with at least one item of literature information obtained by the literature search in step v..
The computer system 142 further may comprise an OCR module 160 configured for outputting the information on the health condition of a patient enriched with the at least one item of literature information.
The computer system 142 further may comprise an outputting module 160 configured for outputting the information on the health condition of a patient enriched with the at least one item of literature information. The outputting module 160 may be a frontend.
Further, a computer program comprising instructions is disclosed (not shown) which, when the program is executed by a computer or a computer network, cause the computer or computer system to perform at least one of the method of generating information on a health condition of a patient as elsewhere disclosed herein; the training method as elsewhere disclosed herein; the method of generating a proposal for a medical treatment of a patient as elsewhere disclosed herein. Further, a computer-readable storage medium is disclosed (not shown), specifically a non-transient computer-readable medium, comprising instructions which, when the instructions are executed by a computer or a computer system, cause the computer or computer network to perform at least one of the method of generating information on a health condition of a patient as elsewhere disclosed herein; the training method as elsewhere disclosed herein; the method of generating a proposal for a medical treatment of a patient as elsewhere disclosed herein.
List of reference numbers computer-implemented method of generating information on a health condition of a patient retrieving a computer-readable report on cancer data creating a vector store on the report on cancer data retrieving computer-readable question data comprising information on a user question using at least one trained language model automatically performing a literature search in at least one external data source creating the information on the health condition of a patient outputting the information on the health condition of a patient training method of training a language model providing a foundation model providing a test report set defining a test question set fine-tuning the foundation model method of generating a proposal for a medical treatment of a patient providing a computer-readable report on cancer data of the patient providing a question relating to at least one possible treatment of the patient computer system retrieving module embedding module vector module input module processing module searching module evaluation module
OCR module outputting module
Claims
1. A computer-implemented method of generating information on a health condition of a patient, comprising: i. retrieving a computer-readable report on cancer data of the patient; ii. creating a vector store on the report on cancer data, by
• tokenizing the report and creating tokens;
• embedding the tokens, thereby creating vectors; and
• storing the vectors in the vector store; iii. retrieving computer-readable question data comprising information on a user question, specifically a context-related user question; iv. using at least one trained language model, specifically at least one large language model, for automatically generating, by using the vector store, an answer to the user question, specifically a context-related answer to the user question; v. automatically performing a literature search in at least one external data source on at least one subject related to the user question; and vi. creating the information on the health condition of a patient, by enriching the answer obtained in step iv. with at least one item of literature information obtained by the literature search in step v..
2. The method according to the preceding claim, wherein the external data source comprises at least one of: the internet; a publication server; a data base comprising results of cancer genomics research, a data base comprising medical studies and/or pharmaceutical studies; a data base comprising associations between DNA alterations and sensitivity or resistance to at least one specific therapy.
3. The method according to any one of the preceding claims, wherein the report on cancer data comprises genomic data of the patient, wherein the answer to the user question comprises at least one of: information on at least one biomarker or a plurality of biomarkers; information indicating genomic clues to resistance that imply the patient’s lack of benefit from at least one specific therapy; information on at least one potential pathogenic or likely pathogenic DNA alteration in the patient’s genome that raises the
possibility of inherited germline alterations; information on an indication of a disease recurrence or a probability of a lack of the disease recurrence.
4. The method according to any one of the preceding claims, wherein step iv. comprises context-customizing the answer to the user, wherein the context-customizing comprises taking into account at least one item of information on a categorization of the user.
5. The method according to the preceding claim, wherein the context-customizing comprises taking into account at least one item of information on the intended purpose of the answer, wherein the information categorizing the purpose into a category is selected from the group consisting of: a patient-doctor consultation; a preparation of a tumor board; evaluating patient or subject suitability for a clinical trial, assigning subjects to clinical trial treatment arms.
6. The method according to any one of the preceding claims, wherein the information on the health condition of a patient further comprises quality information characterizing at least one of: the quality of the computer-readable report on the cancer data; a specimen quality of at least one specimen of the patient forming basis of the report; a specimen quantity of at least one specimen of the patient forming the basis of the report; quality information characterizing the completeness of the report; quality information identifying anomalous results in the report; quality information on the reputation of a laboratory generating the report; quality information identifying certification of a laboratory generating the report; quality information identifying one or more of validation and at least one ISO standard for laboratories.
7. The method according to any one of the preceding claims, wherein the at least one subject on which the literature search is performed in step v. comprises recent medical and/or pharmaceutical studies on at least one topic related to at least one aspect of the cancer data of the patient.
8. The method according to any one of the preceding claims, wherein the information on the health condition of a patient further comprises information, generated by the trained language model, on the source of the item of literature information, wherein the information on the source of the item of literature information contains information on at least one of: a peer-reviewed status of a publication forming basis of the literature information; published guidelines from at least one expert; a number of subjects in a trial forming basis of the literature information; a phase of a clinical trial forming basis
of the literature information; a molecular biological plausibility of a connection between a laboratory finding and the potential association to mechanisms or drugs that benefit or not benefit a particular patient’s medical care.
9. The method according to any one of the preceding claims, wherein the answer to the user question comprises a summary of at least one portion of the report on cancer data of the patient having relevance to the user question.
10. The method according to the preceding claim, wherein the trained language model is further configured, such that a connection to at least one portion of at least one document of the external data source is made on which the item of literature information is based.
11. A training method of training a language model, specifically a large language model, for use in step iv. of the method according to any one of the preceding claims, the method comprising:
- providing a foundation model;
- providing a test report set, the test report set comprising at least one computer- readable report on cancer data of a patient;
- defining a test question set, the test question set comprising a plurality of training questions relating to the test report of the test question set, wherein the answers to the test questions are known;
- fine-tuning the foundation model by using the test report set in step i. of the method, by using the test question set in step iii. of the method, and by comparing the answers generated in step iv. with the known answers to the test questions and adapting parameters of the foundation model
- performing the method according to any one of the preceding claims, wherein the fine-tuned foundation model is used as the at least one trained language model in step iv..
12. A method of generating a proposal for a medical treatment of a patient, the method comprising using the method of generating information on a health condition of a patient according to any one of the preceding claims referring to a method of generating information on a health condition of a patient, wherein the method comprises, in step i., providing a computer-readable report on cancer data of the patient to the method of generating context-related information on a health condition of a patient, wherein the method further comprises providing, in step iii., a question relating to at least one possible treatment of the patient.
-SO-
13. A computer system (142) for generating information on a health condition of a patient, specifically by performing the method according to any one of the preceding claims referring to a computer-implemented method of generating information on a health condition of a patient, the computer system comprising:
I. a retrieving module (144) configured for retrieving a computer-readable report on cancer data of the patient;
II. an embedding module (146) configured for creating a vector store on the report on cancer data, by
• tokenizing the report and creating tokens;
• embedding the tokens, thereby creating vectors; and
• storing the vectors in the vector store;
III. an input module (145) configured for providing computer-readable question data comprising information on a context-related user question;
IV. a processing module (152) comprising at least one trained language model, specifically at least one trained large language model, the processing module being configured for automatically generating, by using the vector store, a context- related answer to the user question;
V. a searching module (154) configured for automatically performing a literature search in at least one external data source using on at least one subject related to the user question; and
VI. an evaluation module (156) configured for creating the information on the health condition of a patient, by enriching the answer obtained in step iv. with at least one item of literature information obtained by the literature search in step v..
14. A computer program comprising instructions which, when the program is executed by a computer or a computer network, cause the computer or computer system to perform at least one of: the method according to any one of the preceding claims referring to a computer-implemented method of generating information on a health condition of a patient; the training method according to any one of the preceding claims referring to a training method of training a language model; the method according to any one of the preceding claims referring to a method of generating a proposal for a medical treatment of a patient.
15. A computer-readable storage medium, specifically a non-transient computer-readable medium, comprising instructions which, when the instructions are executed by a computer or a computer system, cause the computer or computer network to perform at least one of: the method according to any one of the preceding claims referring to a computer-implemented method of generating information on a health condition of a patient; the training method according to any one of the preceding claims referring to
a training method of training a language model; the method according to any one of the preceding claims referring to a method of generating a proposal for a medical treatment of a patient.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP24184785.4 | 2024-06-26 | ||
| EP24184785 | 2024-06-26 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2026003128A1 true WO2026003128A1 (en) | 2026-01-02 |
Family
ID=91700257
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2025/067992 Pending WO2026003128A1 (en) | 2024-06-26 | 2025-06-25 | Method and system for generating information on a health condition of a patient |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2026003128A1 (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190108898A1 (en) | 2017-10-09 | 2019-04-11 | Peter Gulati | System and method for increasing efficiency of medical laboratory data interpretation, real time clinical decision support, and patient communications |
| US20210082561A1 (en) | 2019-09-13 | 2021-03-18 | RAD AI, Inc. | Method and system for automatically generating a radiology impression |
| US20210118559A1 (en) | 2019-10-22 | 2021-04-22 | Tempus Labs, Inc. | Artificial intelligence assisted precision medicine enhancements to standardized laboratory diagnostic testing |
| US20240029848A1 (en) | 2022-07-22 | 2024-01-25 | Pangaea Data Limited | Systems and methods for generating a text report and simulating health care journey |
| US20240095455A1 (en) | 2022-07-14 | 2024-03-21 | Cadence Solutions, Inc. | Systems and methods for question-answering using a multi-modal end to end learning system |
-
2025
- 2025-06-25 WO PCT/EP2025/067992 patent/WO2026003128A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190108898A1 (en) | 2017-10-09 | 2019-04-11 | Peter Gulati | System and method for increasing efficiency of medical laboratory data interpretation, real time clinical decision support, and patient communications |
| US20210082561A1 (en) | 2019-09-13 | 2021-03-18 | RAD AI, Inc. | Method and system for automatically generating a radiology impression |
| US20210118559A1 (en) | 2019-10-22 | 2021-04-22 | Tempus Labs, Inc. | Artificial intelligence assisted precision medicine enhancements to standardized laboratory diagnostic testing |
| US20240095455A1 (en) | 2022-07-14 | 2024-03-21 | Cadence Solutions, Inc. | Systems and methods for question-answering using a multi-modal end to end learning system |
| US20240029848A1 (en) | 2022-07-22 | 2024-01-25 | Pangaea Data Limited | Systems and methods for generating a text report and simulating health care journey |
Non-Patent Citations (12)
| Title |
|---|
| ALBERT Q. JIANG ET AL., MIXTRAL OF EXPERTS, vol. 2401, 2024, pages 04088, Retrieved from the Internet <URL:https://arxiv.org/abs/2401.04088> |
| DAIXUAN CHENGSHAOHAN HUANGFURU WEI: "Adapting Large Language Models via Reading Comprehension", THE TWELFTH INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, 2024, Retrieved from the Internet <URL:https://openreview.net/forum?id=y886UXPEZ0> |
| GU, YUTINN, ROBERTCHENG, HAOLUCAS, MICHAELUSUYAMA, NAOTOLIU, XIAODONGNAUMANN, TRISTANGAO, JIANFENGPOON, HOIFUNG: "Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing", ACM TRANSACTIONS ON COMPUTING FOR HEALTHCARE, vol. 3, 2021, pages 2637 - 8051, Retrieved from the Internet <URL:https://arxiv.org/abs/2007.15779> |
| HYUNG WON CHUNG ET AL., SCALING INSTRUCTION-FINETUNED LANGUAGE MODELS, vol. 2210, 2022, pages 11416, Retrieved from the Internet <URL:https://arxiv.org/abs/2210.11416> |
| LIANG WANGNAN YANGXIAOLONG HUANGBINXING JIAOLINJUN YANGDAXIN JIANGRANGAN MA-JUMDERFURU WEI, TEXT EMBEDDINGS BY WEAKLY-SUPERVISED CONTRASTIVE PRE-TRAINING, vol. 2212, 2024, pages 03533, Retrieved from the Internet <URL:https://arxiv.org/pdf/2212.03533> |
| MARAH ABDIN ET AL., PHI-3 TECHNICAL REPORT: A HIGHLY CAPABLE LANGUAGE MODEL LOCALLY ON YOUR PHONE, vol. 2404, 2024, pages 14219, Retrieved from the Internet <URL:https://arxiv.org/abs/2404.14219> |
| META-LLAMA/LLAMA3, 20 June 2024 (2024-06-20), Retrieved from the Internet <URL:https://github.com/meta-llama/llama3/tree/main> |
| OPENBIOLLMS: ADVANCING OPEN-SOURCE LARGE LANGUAGE MODELS FOR HEALTHCARE AND LIFE SCIENCES, 20 June 2024 (2024-06-20), Retrieved from the Internet <URL:https://huggingface.co/aadityalLlama3-OpenBioLLM-70B> |
| RENQIAN LUOLIAI SUNYINGCE XIATAO QINSHENG ZHANGHOIFUNG POONTIE-YAN LIU: "BioGPT: generative pre-trained transformer for biomedical text generation and mining", BRIEFINGS IN BIOINFORMATICS, vol. 23, November 2022 (2022-11-01), pages 409, Retrieved from the Internet <URL:https://doi.org/10.1093/bib/bbac409> |
| ZEHAN LIXIN ZHANGYANZHAO ZHANGDINGKUN LONGPENGJUN XIEMEISHAN ZHANG, TOWARDS GENERAL TEXT EMBEDDINGS WITH MULTI-STAGE CONTRASTIVE LEARNING, vol. 2308, 2023, pages 03281, Retrieved from the Internet <URL:https://arxiv.org/abs/2308.03281> |
| ZHU LIBING ET AL.: "Testing and Validation of a Custom Retrained Large Language Model for the Supportive Care of HN Patients with External Knowledge Base", CANCERS, vol. 16, no. 13, 24 June 2024 (2024-06-24), pages 2311, XP093239606, DOI: 10.3390/cancers16132311 |
| ZHU LIBING ET AL: "Testing and Validation of a Custom Retrained Large Language Model for the Supportive Care of HN Patients with External Knowledge Base", CANCERS, vol. 16, no. 13, 24 June 2024 (2024-06-24), CH, pages 2311, XP093239606, ISSN: 2072-6694, DOI: 10.3390/cancers16132311 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Yang et al. | A large language model for electronic health records | |
| US12462911B2 (en) | Clinical concept identification, extraction, and prediction system and related methods | |
| Sidey-Gibbons et al. | Machine learning in medicine: a practical introduction | |
| US20240078448A1 (en) | Prognostic score based on health information | |
| Zeng et al. | Identifying breast cancer distant recurrences from electronic health records using machine learning | |
| Ball et al. | TextHunter–a user friendly tool for extracting generic concepts from free text in clinical research | |
| US11714964B2 (en) | Text processing method and apparatus | |
| López-Úbeda et al. | Natural language processing in pathology: current trends and future insights | |
| Chabou et al. | Combination of conditional random field with a rule based method in the extraction of PICO elements | |
| Bannour et al. | Privacy-preserving mimic models for clinical named entity recognition in French | |
| Vithanage et al. | Adapting Generative Large Language Models for Information Extraction from Unstructured Electronic Health Records in Residential Aged Care: A Comparative Analysis of Training Approaches | |
| Kim et al. | DigChem: Identification of disease-gene-chemical relationships from Medline abstracts | |
| Rolando et al. | A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches | |
| Alkhoury et al. | Enhancing biomarker based oncology trial matching using large language models | |
| Mahalakshmi et al. | A real-time medical report analysis and ai-powered diagnosis: A cloud-based solution for improved patient care | |
| Xu et al. | Anatomical entity recognition with a hierarchical framework augmented by external resources | |
| Hofer et al. | Integration of feature vectors from raw laboratory, medication and procedure names improves the precision and recall of models to predict postoperative mortality and acute kidney injury | |
| US20230260656A1 (en) | Cohort stratification into endotypes | |
| Kongburan et al. | Enhancing predictive power of cluster-boosted regression with text-based indexing | |
| WO2026003128A1 (en) | Method and system for generating information on a health condition of a patient | |
| Gísladóttir | Leveraging Large Language Models to Enable Drug Safety Research | |
| Bertolini et al. | Predicting Cancer Patients’ Survival Using Random Forests | |
| Sahu | Language models for classification of patient text messages | |
| Santin et al. | Natural language processing in the classification of radiology reports in benign gallbladder diseases | |
| Seddik Abdelsalam Tawfik Abdelrahman | Text Mining for Precision Medicine: Natural Language Processing, Machine Learning and Information Extraction for Knowledge Discovery in the Health Domain |