[go: up one dir, main page]

WO2020222202A1 - System and method for phrase comparison consolidation and reconciliation - Google Patents

System and method for phrase comparison consolidation and reconciliation Download PDF

Info

Publication number
WO2020222202A1
WO2020222202A1 PCT/IB2020/054165 IB2020054165W WO2020222202A1 WO 2020222202 A1 WO2020222202 A1 WO 2020222202A1 IB 2020054165 W IB2020054165 W IB 2020054165W WO 2020222202 A1 WO2020222202 A1 WO 2020222202A1
Authority
WO
WIPO (PCT)
Prior art keywords
phrase
semantic
term
similarity
syntactic
Prior art date
Application number
PCT/IB2020/054165
Other languages
French (fr)
Inventor
Ron TENENBAUM
Michael MERRY
Original Assignee
The Clinician Pte. Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Clinician Pte. Ltd filed Critical The Clinician Pte. Ltd
Priority to SG11202111653XA priority Critical patent/SG11202111653XA/en
Priority to AU2020265819A priority patent/AU2020265819A1/en
Publication of WO2020222202A1 publication Critical patent/WO2020222202A1/en
Priority to IL287674A priority patent/IL287674A/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/20ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/174Form filling; Merging
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present invention relates to a computerized system and method for comparison, consolidation and reconciliation of phrases, and more particularly questions.
  • the invention is particularly suited for use in domains having a high degree of semantic / ontological structure, such as medicine, law and engineering.
  • the invention may also be used in other, less formalised or more generic fields or contexts, for example to process content posed to online question-and-answer forums.
  • The“syntax” of a phrase relates to its structure, without regard to its meaning.
  • Syntactic analysis identifies each word in the sentence as being a particular part of speech (e.g., article, subject, relative pronoun), and determines the relationships of each of these with each other (e.g. the relative pronoun relating to the subject of the previous clause); these will of course vary from one language to the next.
  • “Semantics” relates to the meaning, or definition, of the specific words within a phrase or sentence.“Ontologies” are terms used to group together concepts, phenomena, articles, et cetera that share common properties or characteristics. So ontologies can be thought of as“high-level” semantics in describing any number of members of a given category; with each of those members also having a specific semantic definition peculiar to it. The relationship between ontologies and semantics, that is, between the high-level category and the particular object, is often represented using a“subject” PREDICATE“object” structure; for example,“knee IS A joint”.
  • Domain-specific ontologies are ontologies developed for, and used within, particular technical domains.
  • the systematised nomenclature of medicine clinical terms (SNOMED-CT) is an example of a domain-specific ontology in the medical field (Lee, Keizer, Lau, & Cornet, 2014).
  • Semantic analysis is also used in other ways in the medical field, again as an extension of keyword-based information retrieval.
  • US 2014/0316797 A1 discloses a system for creating a consolidated snapshot of a patient’s medication information by querying a range of sources.
  • US20120221347A1 discloses a system for reconciling data regarding a given patient sourced from a range of distinct databases; and also discloses using semantic analysis (mining of variables) to identify the relevant clinical question being posed and propose a number of reporting options (see for example [0099] and [0234]).
  • US8898798B2 discloses the use of semantic analysis for mining of large anonymized patient data banks, to determine ontological population trends and the like; as well as using markers (demographic, genotypic, phenotypic) to identify groups of potentially suitable candidates for clinical trials et cetera.
  • ZEDOCTM is one example of a question-retrieval system used in the medical field, with a particular focus on Patient Experience Measures (PREMs) and Patient-Reported Outcome Measures (PROMs).
  • ZEDOCTM contains a large database of questions (over 1000 questionnaires each averaging around 10 questions each) categorized by subject matter, i.e. specific medical domains and sub-domains.
  • Patient information is entered, and the system compares this to the questions in the database and identifies those which are relevant. For each subject matter category (corresponding to each of the identified questions), the system then creates a questionnaire. The set of questionnaires is put into a bundle and sent to the patient, who fills in each of them.
  • semantic analysis in a medical context has certain limitations. Due to the breadth and complexity of the medical field, a given semantic or ontological term is bound to return a large number of“hits” from across a range of subject matter categories; which will result in the system generating and presenting to the patient a large number of questionnaires. Across these, there is likely to be much repetition: both in a domain- specific context (such as asking about pain scores in different ways) and in a non-domain- specific context (such as asking about things like name, age, and gender).
  • a computerized system for comparing a first and second phrase comprising at least a storage and a processor; the processor configured to, for each phrase: parse the phrase into distinct syntactic elements, each distinct syntactic element comprising a word or group of words; for the word or group of words corresponding to at least one of the distinct syntactic elements, query a semantic database to identify a corresponding semantic or ontological term (“the semantic term”); retrieve from the storage a phrase template, wherein the phrase template is divided by type of semantic or ontological term and by type of syntactic element; insert the semantic term and at least one of the remaining distinct syntactic elements into the portion of the phrase template that corresponds, respectively, to their type; the processor further configured to: use a similarity metric stored in the storage to quantify the similarity between the phrase templates corresponding to the first and second phrase; and if the similarity between the phrase templates is above a threshold value, consolidate or reconcile the first and second phrase; or
  • the first and second phrase pertain to a technical domain having a high degree of semantic / ontological structure.
  • the technical domain is one of: medicine, law, or engineering.
  • the first and second phrase comprise a first and second sentence; and more preferably, a first and second question.
  • the distinct syntactic elements comprise one or more of: noun, verb, adverb, article, subject, object, pronoun, relative pronoun.
  • the distinct syntactic elements further comprise one or more domain-specific syntactic categories corresponding to common concepts or phrase formats within the technical domain.
  • the concepts preferably include one or more of:“time / duration”,“frequency”,“severity”, and“clinical finding”; and the formats preferably include one or more of:“type of question” and“type of answer”.
  • one or more of the domain-specific syntactic categories are further subdivided into subcategories.
  • the concept“clinical finding” is preferably further subdivided into the sub-concepts of“diagnosis” and“location” /“body structure”.
  • the semantic database corresponds to the relevant technical domain.
  • the semantic database is preferably provided by the systematized nomenclature of medicine clinical terms (SNOMED-CT).
  • the semantic database is stored in the storage.
  • the semantic database is located externally of the system, such as hosted or provided by a third party.
  • a plurality of the distinct syntactic elements are queried in the semantic database, and a corresponding semantic or ontological term is identified for each, such that there are a plurality of“semantic terms”.
  • the phrase template is configured appropriately to the technical domain to which the first and second phrase pertain.
  • the phrase template is configured with regard to the one or more domain- specific syntactic categories (and optionally subcategories), that is, concepts commonly arising within the technical domain and / or common formats of phrases occurring within the domain.
  • the concepts preferably include one or more of:“time / duration”,“frequency”,“severity”, and“clinical finding”; and the formats preferably include one or more of:“type of question” and“type of answer”.
  • the domain-specific syntactic categories (and optionally subcategories) relating to common concepts at least partly define the types of semantic or ontological term into which the phrase template is divided; and the domain-specific syntactic categories (and optionally subcategories) relating to common phrase formats at least partly define the types of syntactic element into which the phrase template is divided.
  • the plurality of semantic terms are inserted into the portion of the phrase template that corresponds to their respective type.
  • a plurality of the remaining distinct syntactic elements are also inserted into the portion of the phrase template that corresponds to their respective type.
  • the similarity metric is configured appropriately to the technical domain to which the first and second phrase pertain.
  • the similarity metric assigns a ranked weighting to each of the respective types of semantic or ontological term and syntactic element within the first and second phrase template.
  • the semantic or ontological terms generally have a higher ranked weighting than the syntactic elements.
  • the consolidation or reconciliation of the first and second phrase comprises one or more of: identifying or labelling the first or second phrase as a duplicate; deleting one of the first or second phrase; or merging the first and second phrase.
  • merging the first and second phrase comprises replacing the at least one semantic term in the second phrase with one of: the corresponding at least one semantic term in the first phrase; or the word or group of words corresponding to the at least one semantic term in the first phrase.
  • the computerized system is configured for comparing a plurality of phrases.
  • the storage comprises a plurality of phrase templates and similarity metrics corresponding to a plurality of technical domains; and the system either contains in storage, or is configured to query, a plurality of semantic databases corresponding to a plurality of technical domains.
  • a computerized interface platform comprising a computerized system substantially as described above, the interface platform comprising a at least a storage (“the platform storage”) and a processor (“the platform processor”), the platform processor configured to: query an information database in the platform storage for relevant information; retrieve from a phrase database in the platform storage a first and second phrase relating to the relevant information; using the computerized system, compare phrase templates corresponding to the first and second phrase and, if the similarity between the phrase templates is above a threshold value, consolidate or reconcile the first and second phrase; or, if the similarity between the phrase templates is below the threshold value, leave the first and second phrases unchanged; and send the consolidated phrase or the first or second phrases (as the case may be) to an end user.
  • the computerized interface platform is configured for use in the technical field of medicine, and the relevant information is information pertaining to a particular patient (being the end user).
  • a plurality of consolidated phrases or first and second phrases is sent to the end user as one or more questionnaire(s).
  • the platform processor is further configured to receive from the end user responses to the consolidated phrases or the first or second phrases (as the case may be), and to store said responses in the platform storage.
  • a computerized method of comparing a first and second phrase using a system comprising at least a storage and a processor, the method comprising the steps of the processor, for each phrase: parsing the phrase into distinct syntactic elements, each distinct syntactic element comprising a word or group of words; for the word or group of words corresponding to at least one of the distinct syntactic elements, querying a semantic database to identify a corresponding semantic or ontological term (“the semantic term”); inserting the semantic term and at least one of the remaining distinct syntactic elements into the phrase template retrieved from the storage, wherein the phrase template is divided by type of semantic or ontological term and by type of syntactic element, and the semantic term and the at least one of the remaining distinct syntactic elements are inserted, respectively, into the portion of the phrase template that corresponds to their type; the method further comprising the steps of the processor: using a similarity metric to quantify the similarity between the phrase templates corresponding to the first and second
  • the present invention provides a number of advantages over the prior art, including:
  • FIGURE 1A is a schematic showing the elements and operation of the existing ZEDOCTM system
  • FIGURE 1B is a schematic showing the elements and operation of the ZEDOCTM system when combined with the system of the present invention
  • FIGURE 2 is a schematic showing how the system of the present invention processes an inputted phrase
  • FIGURE 3 is a schematic showing an alternative embodiment of the system of the present invention
  • FIGURE 4 is a schematic showing how the system of the present invention may be used in a broader clinical context.
  • FIG 1A is a schematic showing the operation of the existing ZEDOCTM question retrieval system (generally indicated by 100).
  • the processing portion (104) of the system (known as ZEDOCTM Core) first requests the relevant information from the database (102) to be able to create the questionnaires to send to the patient.
  • This information (which could be in the form of keywords entered by a user, such as a clinician) will typically include things like symptoms reported by the patient during their most recent consultation; medications the patient is taking; or lifestyle habits of the patient.
  • the processor then sweeps the database (102) for questions / questionnaires having relevance to the inputted patient information.
  • the set of identified questionnaires is put into a bundle and sent to the patient (108), such as via email (106).
  • the patient then responds to all of the questionnaires, for example by using a smartphone application (110) and the responses are then returned to the ZEDOCTM system and stored in its server.
  • the ZEDOCTM system is guided only by the information / keywords from the database, which it then compares to existing questionnaires. It has no ability to compare questions within the respective questionnaires to each other, by assessing them syntactically or semantically. Due to the size and complexity of the database, a search for any given term is likely to result in a large number of“hits”; meaning a large number of questionnaires will be sent to the patient. Within these, there is likely to be much repetition of questions, asking for the same information but in slightly different ways.
  • Figure 1B schematically illustrates the ZEDOCTM system (100) used in combination with the system of the present invention (generally indicated by 200); which, although not shown, comprises at least a storage and a processor.
  • the questionnaires (202) are generated by ZEDOCTM, they are passed to the system of the present invention. This compares all of the questions in all of the questionnaires to each other, and, where these are sufficiently similar, reconciles / consolidates the questions by merging them, labelling them as duplicates, or deleting some of them - this process is generally indicated by (204). An example of part of this process (generally indicated by (300) is shown in Figure 2.
  • the system outputs the consolidated set of questionnaires (206) to the ZEDOCTM system for sending to the patient as per Figure 1A.
  • a much smaller number of questions (and / or questionnaires) will ultimately be sent to the patient.
  • the process employed by the system (200) may be algorithmically represented as follows:
  • ZEDOCTM presents the questions identified by its database to the system of the present invention, for each question (302) the system firstly applies“syntactic parsing” (304) to break the sentence down into its distinct syntactic elements or“building blocks”; each of which might include one word, or a number of words.
  • syntactic parsing There are known algorithms for syntactic parsing that can be used in the system of the present invention, one example of which is SyntaxNet developed by Google.
  • additional“domain-specific syntactic categories” can be set up, for instance reflecting common themes, concepts, or phrase formats within a given domain - such as, for medical questions, the themes of time / duration, frequency, severity, and / or clinical finding, and the formats of“type of question” and / or“type of answer”.
  • the phrase“in the last 14 days” is identified as belonging to the domain-specific syntactic category of time / duration; while“extreme” would be identified as belonging to the domain-specific syntactic category of “severity”. (Note, this may alternatively or additionally be done at the phrase template stage, discussed below).
  • syntactic parsing is completed, selected ones of the syntactic elements are cross- referenced to a semantic database (306), as schematically indicated by arrows (308).
  • the elements referred to the semantic database will be verbs, nouns, verb/noun phrases, or adverbial phrases.
  • other syntactic elements may also be so referred.
  • the system may be configured to refer to the semantic database any elements that have not, in the syntactic analysis step, been identified as belonging to one or more predetermined syntactic categories.
  • the semantic database (306) should be an ontological database suited to the technical domain at issue - for instance, for the medical domain, the SNOMED-CT semantic database is particularly appropriate.
  • the phrase “knee pain” matches the SNOMED-CT term 30989003 Knee pain.
  • the terms“pain” and“knee” individually also match SNOMED-CT terms, as does the term“day”.
  • a word like“extreme” would, if present, be matched to the SNOMED term 12565001 Extreme (qualifier value).
  • the semantic database (306) may be stored in the system’s storage. However, it is equally possible for the semantic database to be hosted / provided by an external source, and queried remotely by the system’s processor. For example, many ontologies / ontological databases are hosted on university websites.
  • phrase template (schematically indicated by 314), which is divided by type of semantic or ontological term and by type of syntactic element.
  • the specifics of how a given template is structured / divided are informed by the relevant technical field or domain: in particular, by common concepts (and sub-concepts) that arise within that domain, and common phrase formats (and sub-formats) occurring within that domain.
  • the division of the phrase template may be informed by the“domain-specific syntactic categories” discussed above, into which the question has already been parsed.
  • common concepts include time / duration, frequency, and severity; and / or clinical finding (including the sub-concepts of“diagnosis” and“location” /“body structure”).
  • semantic or ontological division of the phrase template will be along these lines (or at least include these).
  • Common formats include“type of question” and“type of answer”; and thus the syntactic division of the phrase template will be along these lines (or at least include these).
  • phrases template (314) Some or all of the semantic terms of the phrase (identified by reference to the semantic database as above) are inserted into the phrase template (314), as schematically indicated by arrow (310); as are some or all of the remaining syntactic elements, as schematically indicated by arrow (312); wherein each term / element is inserted into its appropriate“type” within the phrase template (314).
  • the term“usually” has been identified at the syntactic analysis stage as corresponding to the syntactic category of“frequency”, and thus is inserted into the appropriate place in the phrase template.
  • the term“pain” has been identified at the semantic analysis stage as corresponding to a semantic or ontological term of the type “clinical finding” (and subtype“diagnosis”), and thus is also inserted into the appropriate place in the phrase template; as is the term“knee”, which has been identified as corresponding to a semantic or ontological term of the type“clinical finding” (and subtype“location” /“body structure”).
  • the system is configured to then analyze the similarity of each question to each of the other questions, to thereby detect duplication and, if necessary or desired, effect consolidation / reconciliation between the questions.
  • the similarity metric may be of any suitable kind (having regard to the technical domain), which the skilled person will readily identify; one example being the use of cosine similarity.
  • each type of term and element within the respective phrase templates is assigned a ranked weighting. Particularly in semantically / ontologically highly structured domains such as medicine, it is desirable for the similarity between phrases to be assessed primarily by reference to the semantic / ontological terms therein, and accordingly these should have a higher ranked weighting than the remaining (syntactic) terms.
  • steps are taken to consolidate or reconcile these. This can include identifying or labelling the phrases as duplicates, merging the phrases together, or deleting one of the phrases altogether.
  • One way of merging phrases together would be to replace a given term in one of the phrases with the corresponding term from another of the phrases, to form a“hybrid” phrase.
  • a“hybrid” phrase For instance, where a pair of semantic terms in two phrases have been found to be ontologically similar, the relevant term in the first phrase might be replaced by the relevant term in the other. This could either be accomplished by substituting in the formal semantic / ontological term identified by the semantic database, or by substituting in the informal term that was originally used in the first phrase.
  • This system is able to rank questions in an order of similarity that is highly plausible and also consistent. This was established by selecting a small number of questions from ZEDOCTM and submitting these for comparison by, on the one hand, seven human trial candidates, and on the other hand the system of the invention. The question number had to be kept very small in order to be manageable by the human candidates. Even so, the results show the efficacy of the system.
  • the system of the invention offers a number of significant advantages. Most fundamentally, due to using both semantic and syntactic analysis, it is capable of identifying questions that are phrased differently but have the same meaning; thereby eliminating or significantly ameliorating the problem of duplicated or repetitive questions being posed to users (such as patients). Furthermore, it is capable of comparing, and in particular assessing the similarity between, an immense number of individual questions or questionnaires stored in a database in a very short period of time. It is capable of doing so across technical domains (so long as there is sufficient ontological commonality) as well as across different languages. It is also capable of doing this not only with a high degree of accuracy, but also with consistency and objectivity. These are all things which could never feasibly be accomplished manually or in the absence of such a computerized system.
  • FIG. 3 One possible alternative use for the invention is represented schematically in Figure 3: namely, as a questionnaire recommender and adapter system.
  • This would receive the user’s search query (either in the form of a keyword(s) or a phrase / question) and identify questionnaires which, while not identical, are potentially useful; and further would optionally recommend or implement a replacement or adaptation to the identified questionnaire, so as to match the user’s query.
  • search for questionnaires regarding shoulder pain and retrieve two, one which is for hip pain/mobility (404) and the other for knee pain/mobility (406).
  • hip pain/mobility 404
  • knee pain/mobility 406
  • These are ontologically similar (due to the similarity of joints, as opposed to, for example, stomach pain), and the hip/knee concept can be replaced with“shoulder” to create an adapted questionnaire (406) customized to the user’s search query.
  • Such a questionnaire recommender and adapter system might use much the same question parsing and reconciliation process as described above; wherein the similarity metric is again configured to place primary emphasis on the similarity of the semantic / ontological terms, and lesser emphasis on syntactic structure of the questions.
  • questions relating to shoulder pain, knee pain, or hip pain would all be recognized as ontologically similar and weighted accordingly; then also weighted by syntactic similarity.
  • the similarity metric may be configured to give prevalence to the syntactic similarity between the sentences, i.e. look first to the structure / overall format of the question, and then to the similarity of the semantic terms therein.
  • Figure 4 details how the system (100) of the present invention fits in to the wider clinical context (502).
  • the details for what the clinicians (504) need to know is specified by information from the Clinical Data Repository (506) which is processed by the Electronic Health Record (EHR) (508).
  • EHR Electronic Health Record
  • This information is passed through the hospital integration engines (510) to the ZEDOCTM (100) integration engine and passed to the ZEDOCTM Core system (104).
  • Reconciliation of the questionnaires takes place by the system (200) of the present invention, as described above.
  • the reconciled questionnaires are asked to the patient (108), and the data is passed back to the ZEDOCTM Core (104) system. From there, the data is passed back through the integration engines (510) back to the EHR system (508) and into the Clinical Data Repository (506) where it can be consumed by clinicians (504) and other hospital staff.
  • non-adherence to questionnaires is a cost-factor for PROMs delivery. Higher adherence decreases the cost of delivery (as fewer in-clinic staff may be needed, and fewer follow ups are needed to get patient responses).
  • the system of the present invention provides a value increase. The increase in information from the clinically-validated questionnaires will help improve clinical care and outcomes. Additionally, as this can also be used for PREMs, higher adherence rates will result in improved organizational efficiency and quality, as better information is provided to hospitals with regards to patient experience.
  • the system of the present invention also mitigates the need to identify and examine newly created PROMs instruments in a progressively expanding pool that spans across multiple health domains.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Public Health (AREA)
  • Primary Health Care (AREA)
  • Medical Informatics (AREA)
  • Epidemiology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a system for comparing a first and second phrase, a method of using said system, and an interface platform comprising said system. The system is configured to parse each phrase into distinct syntactic elements; query a semantic database to identify a semantic or ontological term ("the semantic term") corresponding to at least one of said distinct syntactic elements; and insert the semantic term and at least one of the remaining distinct syntactic elements into an appropriate portion of a phrase template. The system is then configured to use a similarity metric to quantify the similarity between the first and second phrase templates; and: if the similarity between the phrase templates is above a threshold value, consolidate or reconcile the first and second phrase; or, if the similarity between the phrase templates is below the threshold value, leave the first and second phrases unchanged. In using both semantic and syntactic analysis to determine the similarity between the first and second phrase, the system enables improved detection of phrases that are in substance similar or repetitive, thereby reducing the number and repetitiveness of phrases (such as questions) ultimately posed to a user and in turn improving user participation and related outcomes.

Description

System and Method for Phrase Comparison , Consolidation and Reconciliation
FIELD OF THE INVENTION
The present invention relates to a computerized system and method for comparison, consolidation and reconciliation of phrases, and more particularly questions. The invention is particularly suited for use in domains having a high degree of semantic / ontological structure, such as medicine, law and engineering. However, the invention may also be used in other, less formalised or more generic fields or contexts, for example to process content posed to online question-and-answer forums.
BACKGROUND
There is a need in many industries, such as the medical industry, to use automated research and data collection processes, often involving questionnaires for people to fill in and return for analysis. As the world becomes increasingly automated, these types of research and data collection questionnaires are often automatically generated. Due to the variety and complexity of natural language, there are multiple ways of asking any given question, so when delivering automated questionnaires, it is likely that there are multiple identical or similar questions that exist across multiple questionnaires.
The“syntax” of a phrase, such as a sentence, relates to its structure, without regard to its meaning. Syntactic analysis identifies each word in the sentence as being a particular part of speech (e.g., article, subject, relative pronoun), and determines the relationships of each of these with each other (e.g. the relative pronoun relating to the subject of the previous clause); these will of course vary from one language to the next.
“Semantics” relates to the meaning, or definition, of the specific words within a phrase or sentence.“Ontologies” are terms used to group together concepts, phenomena, articles, et cetera that share common properties or characteristics. So ontologies can be thought of as“high-level” semantics in describing any number of members of a given category; with each of those members also having a specific semantic definition peculiar to it. The relationship between ontologies and semantics, that is, between the high-level category and the particular object, is often represented using a“subject” PREDICATE“object” structure; for example,“knee IS A joint”.
“Domain-specific ontologies” are ontologies developed for, and used within, particular technical domains. The systematised nomenclature of medicine clinical terms (SNOMED-CT) is an example of a domain-specific ontology in the medical field (Lee, Keizer, Lau, & Cornet, 2014).
It is known to use semantic analysis to automatically process (e.g. extract, create, or answer) questions in different contexts and applications. For instance, the process known as“question retrieval” focuses on taking information provided by a user and comparing this against a database of pre-existing topics. This is used to, for example, match questions posted on online question-and-answer forums to previously-asked questions / answers.
A more refined version of question retrieval is used in the medical field. Medicine, being highly formal and structured, lends itself to semantic analysis, and ontologies have accordingly been used in question retrieval in the medical field as an extension of keyword-based information retrieval (Khan, Mcleod, & Hovy, 2004). That is to say, the ontology of a given search term is identified, and related terms within that ontology are then identified and information relating to these is returned to the user as search results.
Semantic analysis is also used in other ways in the medical field, again as an extension of keyword-based information retrieval. For instance, US 2014/0316797 A1 discloses a system for creating a consolidated snapshot of a patient’s medication information by querying a range of sources. Similarly, US20120221347A1 discloses a system for reconciling data regarding a given patient sourced from a range of distinct databases; and also discloses using semantic analysis (mining of variables) to identify the relevant clinical question being posed and propose a number of reporting options (see for example [0099] and [0234]). US8898798B2 discloses the use of semantic analysis for mining of large anonymized patient data banks, to determine ontological population trends and the like; as well as using markers (demographic, genotypic, phenotypic) to identify groups of potentially suitable candidates for clinical trials et cetera. ZEDOC™ is one example of a question-retrieval system used in the medical field, with a particular focus on Patient Experience Measures (PREMs) and Patient-Reported Outcome Measures (PROMs). ZEDOC™ contains a large database of questions (over 1000 questionnaires each averaging around 10 questions each) categorized by subject matter, i.e. specific medical domains and sub-domains. Patient information is entered, and the system compares this to the questions in the database and identifies those which are relevant. For each subject matter category (corresponding to each of the identified questions), the system then creates a questionnaire. The set of questionnaires is put into a bundle and sent to the patient, who fills in each of them.
However, semantic analysis (including question retrieval) in a medical context has certain limitations. Due to the breadth and complexity of the medical field, a given semantic or ontological term is bound to return a large number of“hits” from across a range of subject matter categories; which will result in the system generating and presenting to the patient a large number of questionnaires. Across these, there is likely to be much repetition: both in a domain- specific context (such as asking about pain scores in different ways) and in a non-domain- specific context (such as asking about things like name, age, and gender).
This imposes a response burden on the patient, particularly if they have the same experience with multiple clinicians. Typically, patients will end up becoming frustrated and discouraged on having to answer the same or similar questions repeatedly; and are unlikely to complete all the questions, leading to gaps in the information ultimately received by the clinician. If a patient is being asked to complete multiple different questionnaires, it makes for a better user experience and better outcomes, to ask the same question once, rather than multiple times.
Drafting a customized questionnaire for each patient would avoid this problem, but would of course be unacceptably time-consuming and inefficient, and would defeat the purpose of automated question retrieval systems which aim to relieve clinicians of the need to do this.
Equally, manually perusing and comparing pre-prepared questionnaire templates, such as those on ZEDOC™, and comparing these against each other would be a completely infeasible task when it is considered that there may be thousands of potentially relevant questions. Comparing each of these to all of the others would entail making in the region of 50 million comparisons - more than could be done in the average lifetime. This is without even considering humans’ cognitive and analytical limitations. Humans will almost inevitably be familiar with a particular, narrow field rather than across fields, even if they are related. Thus they will have a limited ability to assess the relevance of a questionnaire using the same ontology but from a different field. Obviously, there are also practical limitations such as language barriers, as well as the fact that human comparison will inevitably be subjective, and the outcome will vary from person to person.
An automated way of identifying similar, equivalent, or overlapping questions is therefore necessary.
It is accordingly an object of the present invention to provide a computerized system and method for comparison, consolidation and reconciliation of phrases that addresses some of the problems of the prior art, or at least to provide the public with a useful choice.
STATEMENTS OF THE INVENTION
According to one aspect of the invention, there is provided a computerized system for comparing a first and second phrase, the system comprising at least a storage and a processor; the processor configured to, for each phrase: parse the phrase into distinct syntactic elements, each distinct syntactic element comprising a word or group of words; for the word or group of words corresponding to at least one of the distinct syntactic elements, query a semantic database to identify a corresponding semantic or ontological term (“the semantic term”); retrieve from the storage a phrase template, wherein the phrase template is divided by type of semantic or ontological term and by type of syntactic element; insert the semantic term and at least one of the remaining distinct syntactic elements into the portion of the phrase template that corresponds, respectively, to their type; the processor further configured to: use a similarity metric stored in the storage to quantify the similarity between the phrase templates corresponding to the first and second phrase; and if the similarity between the phrase templates is above a threshold value, consolidate or reconcile the first and second phrase; or, if the similarity between the phrase templates is below the threshold value, leave the first and second phrases unchanged.
Preferably, the first and second phrase pertain to a technical domain having a high degree of semantic / ontological structure.
More preferably, the technical domain is one of: medicine, law, or engineering.
Preferably, the first and second phrase comprise a first and second sentence; and more preferably, a first and second question.
Preferably, the distinct syntactic elements comprise one or more of: noun, verb, adverb, article, subject, object, pronoun, relative pronoun.
Preferably, the distinct syntactic elements further comprise one or more domain-specific syntactic categories corresponding to common concepts or phrase formats within the technical domain. For example, where the domain is medicine and the phrase is a question, the concepts preferably include one or more of:“time / duration”,“frequency”,“severity”, and“clinical finding”; and the formats preferably include one or more of:“type of question” and“type of answer”.
More preferably, one or more of the domain-specific syntactic categories are further subdivided into subcategories. For example, the concept“clinical finding” is preferably further subdivided into the sub-concepts of“diagnosis” and“location” /“body structure”.
Preferably, the semantic database corresponds to the relevant technical domain. For example, where the domain is medicine, the semantic database is preferably provided by the systematized nomenclature of medicine clinical terms (SNOMED-CT). Preferably, the semantic database is stored in the storage. Alternatively, the semantic database is located externally of the system, such as hosted or provided by a third party.
Preferably, a plurality of the distinct syntactic elements (or words or groups of words corresponding thereto) are queried in the semantic database, and a corresponding semantic or ontological term is identified for each, such that there are a plurality of“semantic terms”.
Preferably, the phrase template is configured appropriately to the technical domain to which the first and second phrase pertain.
More preferably, the phrase template is configured with regard to the one or more domain- specific syntactic categories (and optionally subcategories), that is, concepts commonly arising within the technical domain and / or common formats of phrases occurring within the domain. For example, where the domain is medicine and the phrase is a question, the concepts preferably include one or more of:“time / duration”,“frequency”,“severity”, and“clinical finding”; and the formats preferably include one or more of:“type of question” and“type of answer”.
Preferably, the domain-specific syntactic categories (and optionally subcategories) relating to common concepts at least partly define the types of semantic or ontological term into which the phrase template is divided; and the domain-specific syntactic categories (and optionally subcategories) relating to common phrase formats at least partly define the types of syntactic element into which the phrase template is divided.
Preferably, the plurality of semantic terms are inserted into the portion of the phrase template that corresponds to their respective type.
Preferably, a plurality of the remaining distinct syntactic elements are also inserted into the portion of the phrase template that corresponds to their respective type.
Preferably, the similarity metric is configured appropriately to the technical domain to which the first and second phrase pertain.
Preferably, the similarity metric assigns a ranked weighting to each of the respective types of semantic or ontological term and syntactic element within the first and second phrase template. Preferably, the semantic or ontological terms generally have a higher ranked weighting than the syntactic elements.
Preferably, the consolidation or reconciliation of the first and second phrase comprises one or more of: identifying or labelling the first or second phrase as a duplicate; deleting one of the first or second phrase; or merging the first and second phrase.
Preferably, merging the first and second phrase comprises replacing the at least one semantic term in the second phrase with one of: the corresponding at least one semantic term in the first phrase; or the word or group of words corresponding to the at least one semantic term in the first phrase.
Preferably, the computerized system is configured for comparing a plurality of phrases.
Preferably, the storage comprises a plurality of phrase templates and similarity metrics corresponding to a plurality of technical domains; and the system either contains in storage, or is configured to query, a plurality of semantic databases corresponding to a plurality of technical domains.
According to another aspect of the invention, there is provided a computerized interface platform comprising a computerized system substantially as described above, the interface platform comprising a at least a storage (“the platform storage”) and a processor (“the platform processor”), the platform processor configured to: query an information database in the platform storage for relevant information; retrieve from a phrase database in the platform storage a first and second phrase relating to the relevant information; using the computerized system, compare phrase templates corresponding to the first and second phrase and, if the similarity between the phrase templates is above a threshold value, consolidate or reconcile the first and second phrase; or, if the similarity between the phrase templates is below the threshold value, leave the first and second phrases unchanged; and send the consolidated phrase or the first or second phrases (as the case may be) to an end user.
Preferably, the computerized interface platform is configured for use in the technical field of medicine, and the relevant information is information pertaining to a particular patient (being the end user).
Preferably, a plurality of consolidated phrases or first and second phrases (as the case may be) is sent to the end user as one or more questionnaire(s).
Preferably, the platform processor is further configured to receive from the end user responses to the consolidated phrases or the first or second phrases (as the case may be), and to store said responses in the platform storage.
According to another aspect of the invention, there is provided a computerized method of comparing a first and second phrase using a system comprising at least a storage and a processor, the method comprising the steps of the processor, for each phrase: parsing the phrase into distinct syntactic elements, each distinct syntactic element comprising a word or group of words; for the word or group of words corresponding to at least one of the distinct syntactic elements, querying a semantic database to identify a corresponding semantic or ontological term (“the semantic term”); inserting the semantic term and at least one of the remaining distinct syntactic elements into the phrase template retrieved from the storage, wherein the phrase template is divided by type of semantic or ontological term and by type of syntactic element, and the semantic term and the at least one of the remaining distinct syntactic elements are inserted, respectively, into the portion of the phrase template that corresponds to their type; the method further comprising the steps of the processor: using a similarity metric to quantify the similarity between the phrase templates corresponding to the first and second phrase; and if the similarity between the phrase templates is above a threshold value, consolidating or reconciling the first and second phrase.
The present invention provides a number of advantages over the prior art, including:
Enabling the identification, comparison and reconciliation / consolidation of potentially relevant questions from a vast pool of pre-prepared questions, with a high degree of uniformity / consistency and to a high level of accuracy;
Enabling this across multiple fields / domains using the same or similar ontology, and across multiple different languages;
As a result, reducing the amount of repetition in questions posed to a user, and / or the number of questionnaires sent to the user for completion;
In turn increasing questionnaire response rates, and hence improving the amount and quality of information provided by the user;
In turn increasing data accuracy, desired outcomes and efficiency of systems, and decreasing cost of service delivery; and
At the very least, providing the public with a useful choice.
BRIEF DESCRIPTION OF FIGURES
Further aspects and advantages of the invention will become apparent with reference to the accompanying Figures, which are given by way of example only and in which:
FIGURE 1A is a schematic showing the elements and operation of the existing ZEDOC™ system;
FIGURE 1B is a schematic showing the elements and operation of the ZEDOC™ system when combined with the system of the present invention;
FIGURE 2 is a schematic showing how the system of the present invention processes an inputted phrase;
FIGURE 3 is a schematic showing an alternative embodiment of the system of the present invention; and FIGURE 4 is a schematic showing how the system of the present invention may be used in a broader clinical context.
DETAILED DESCRIPTION OF FIGURES
Figure 1A is a schematic showing the operation of the existing ZEDOC™ question retrieval system (generally indicated by 100). The processing portion (104) of the system (known as ZEDOC™ Core) first requests the relevant information from the database (102) to be able to create the questionnaires to send to the patient. This information (which could be in the form of keywords entered by a user, such as a clinician) will typically include things like symptoms reported by the patient during their most recent consultation; medications the patient is taking; or lifestyle habits of the patient.
The processor then sweeps the database (102) for questions / questionnaires having relevance to the inputted patient information.
The set of identified questionnaires is put into a bundle and sent to the patient (108), such as via email (106). The patient then responds to all of the questionnaires, for example by using a smartphone application (110) and the responses are then returned to the ZEDOC™ system and stored in its server.
In doing this, the ZEDOC™ system is guided only by the information / keywords from the database, which it then compares to existing questionnaires. It has no ability to compare questions within the respective questionnaires to each other, by assessing them syntactically or semantically. Due to the size and complexity of the database, a search for any given term is likely to result in a large number of“hits”; meaning a large number of questionnaires will be sent to the patient. Within these, there is likely to be much repetition of questions, asking for the same information but in slightly different ways. For example,“On a scale of 1 to 10, have you had pain in the last 14 days?” and“In the last 14 days, what level (1-10) has your pain been?” are two syntactically different but semantically equivalent sentences (i.e., questions that have the same meaning) where the answers can most likely be interchanged. Question-retrieval systems with no syntactic or semantic analysis capability will not be able to pick up on this, so both questions (and as many more equivalent questions as are found in the database) will be sent to the patient. This will inevitably result in frustration and low response rates.
Figure 1B schematically illustrates the ZEDOC™ system (100) used in combination with the system of the present invention (generally indicated by 200); which, although not shown, comprises at least a storage and a processor.
Once the questionnaires (202) are generated by ZEDOC™, they are passed to the system of the present invention. This compares all of the questions in all of the questionnaires to each other, and, where these are sufficiently similar, reconciles / consolidates the questions by merging them, labelling them as duplicates, or deleting some of them - this process is generally indicated by (204). An example of part of this process (generally indicated by (300) is shown in Figure 2. The system outputs the consolidated set of questionnaires (206) to the ZEDOC™ system for sending to the patient as per Figure 1A. Thus, with the use of the system of the present invention, a much smaller number of questions (and / or questionnaires) will ultimately be sent to the patient.
The process employed by the system (200) may be algorithmically represented as follows:
Data: Set of questions
Result: Reconciled set of questions with duplicates removed
Create a domain-specific syntactic template (example above) on the basis of domain-specific ontology (e,g., SNOMED-CT);
Collect all questions that are to be delivered to a patient in one place;
forall Questions do
Apply syntactic parsing (existing methodologies) to break the
sentence down into syntactic structures;
For each noun- or verb-phrase, look up this term in the corresponding domain-specific ontology;
Using the semantic knowledge, insert each noun- or verb-phrase into the appropriate part of the syntactic template (e.g., "Knee pain” is a “Clinical Finding”);
end
Calculate a similarity metric across all questions; Fur theme questions where the similarity metric is above a threshold, label them as
duplicates:
(Optional): An individual can review these questions to determine
whether or not these are appropriate to reconcile;
return (The set of questions without the duplicates)
Once ZEDOC™ presents the questions identified by its database to the system of the present invention, for each question (302) the system firstly applies“syntactic parsing” (304) to break the sentence down into its distinct syntactic elements or“building blocks”; each of which might include one word, or a number of words. There are known algorithms for syntactic parsing that can be used in the system of the present invention, one example of which is SyntaxNet developed by Google.
Common syntactic elements into which the sentence might be broken include generic parts of speech such as noun, verb, article, subject, object, pronoun, relative pronoun, et cetera. Thus, in the example in Figure 2, the system recognizes the terms“pain” and“knee” as belonging to the same syntactic element (noun); and also potentially recognizes that they may alternatively form the phrase“knee pain”.
It is also within the scope of the invention for other types of elements to be identified during the syntactic parsing step, tailored to the particular domain or subject-matter in question. That is to say, additional“domain-specific syntactic categories” can be set up, for instance reflecting common themes, concepts, or phrase formats within a given domain - such as, for medical questions, the themes of time / duration, frequency, severity, and / or clinical finding, and the formats of“type of question” and / or“type of answer”. Thus, for example the phrase“in the last 14 days” is identified as belonging to the domain-specific syntactic category of time / duration; while“extreme” would be identified as belonging to the domain-specific syntactic category of “severity”. (Note, this may alternatively or additionally be done at the phrase template stage, discussed below).
Once syntactic parsing is completed, selected ones of the syntactic elements are cross- referenced to a semantic database (306), as schematically indicated by arrows (308). Typically, the elements referred to the semantic database will be verbs, nouns, verb/noun phrases, or adverbial phrases. However, other syntactic elements may also be so referred. For instance, the system may be configured to refer to the semantic database any elements that have not, in the syntactic analysis step, been identified as belonging to one or more predetermined syntactic categories. For example, if severity is a predetermined syntactic category, then the term “extreme” will be identified as relating to this category and thus will not be referred to the semantic database; while the term“last 14 days” will be identified as falling outside of this category and thus will be referred to the semantic database.
The semantic database (306) should be an ontological database suited to the technical domain at issue - for instance, for the medical domain, the SNOMED-CT semantic database is particularly appropriate. In the Figure 2 example, the phrase “knee pain” matches the SNOMED-CT term 30989003 Knee pain. The terms“pain” and“knee” individually also match SNOMED-CT terms, as does the term“day”. Similarly, a word like“extreme” would, if present, be matched to the SNOMED term 12565001 Extreme (qualifier value).
The semantic database (306) may be stored in the system’s storage. However, it is equally possible for the semantic database to be hosted / provided by an external source, and queried remotely by the system’s processor. For example, many ontologies / ontological databases are hosted on university websites.
Once the semantic analysis stage has been completed, the system then retrieves from storage a phrase template (schematically indicated by 314), which is divided by type of semantic or ontological term and by type of syntactic element. The specifics of how a given template is structured / divided are informed by the relevant technical field or domain: in particular, by common concepts (and sub-concepts) that arise within that domain, and common phrase formats (and sub-formats) occurring within that domain. In this regard, the division of the phrase template may be informed by the“domain-specific syntactic categories” discussed above, into which the question has already been parsed. For instance, in a medical context, as noted above common concepts include time / duration, frequency, and severity; and / or clinical finding (including the sub-concepts of“diagnosis” and“location” /“body structure”). Thus the semantic or ontological division of the phrase template will be along these lines (or at least include these). Common formats include“type of question” and“type of answer”; and thus the syntactic division of the phrase template will be along these lines (or at least include these).
Some or all of the semantic terms of the phrase (identified by reference to the semantic database as above) are inserted into the phrase template (314), as schematically indicated by arrow (310); as are some or all of the remaining syntactic elements, as schematically indicated by arrow (312); wherein each term / element is inserted into its appropriate“type” within the phrase template (314). For instance, in the Figure 2 example, the term“usually” has been identified at the syntactic analysis stage as corresponding to the syntactic category of“frequency”, and thus is inserted into the appropriate place in the phrase template. The term“pain” has been identified at the semantic analysis stage as corresponding to a semantic or ontological term of the type “clinical finding” (and subtype“diagnosis”), and thus is also inserted into the appropriate place in the phrase template; as is the term“knee”, which has been identified as corresponding to a semantic or ontological term of the type“clinical finding” (and subtype“location” /“body structure”).
Once each question that has been passed to the system by ZEDOC™ has been fed into a dedicated phrase template in the above manner, the system is configured to then analyze the similarity of each question to each of the other questions, to thereby detect duplication and, if necessary or desired, effect consolidation / reconciliation between the questions.
It does this by applying a similarity metric to the questions. The similarity metric may be of any suitable kind (having regard to the technical domain), which the skilled person will readily identify; one example being the use of cosine similarity. As part of the similarity metric, each type of term and element within the respective phrase templates is assigned a ranked weighting. Particularly in semantically / ontologically highly structured domains such as medicine, it is desirable for the similarity between phrases to be assessed primarily by reference to the semantic / ontological terms therein, and accordingly these should have a higher ranked weighting than the remaining (syntactic) terms.
If a pair of phrases are found to have a similarity over a threshold value, steps are taken to consolidate or reconcile these. This can include identifying or labelling the phrases as duplicates, merging the phrases together, or deleting one of the phrases altogether.
One way of merging phrases together would be to replace a given term in one of the phrases with the corresponding term from another of the phrases, to form a“hybrid” phrase. For instance, where a pair of semantic terms in two phrases have been found to be ontologically similar, the relevant term in the first phrase might be replaced by the relevant term in the other. This could either be accomplished by substituting in the formal semantic / ontological term identified by the semantic database, or by substituting in the informal term that was originally used in the first phrase.
The system of the invention as described above has been shown to reduce the number of resulting questions / questionnaires significantly. For instance, trials have shown that use of the system of the present invention in conjunction with the ZEDOC™ system produced significantly fewer questions / questionnaires than when the same query was run through the ZEDOC™ system without using the system of the present invention.
The system has also been shown to have a high level of accuracy across the full range of difficulty levels, when tested on randomly chosen questions from a database of standard clinical PROMs. In particular, there is strong evidence that:
• This system is able to clearly identify identical questions
• This system is able to clearly identify entirely dissimilar questions
• This system is able to rank questions in an order of similarity that is highly plausible and also consistent. This was established by selecting a small number of questions from ZEDOC™ and submitting these for comparison by, on the one hand, seven human trial candidates, and on the other hand the system of the invention. The question number had to be kept very small in order to be manageable by the human candidates. Even so, the results show the efficacy of the system.
The questions were placed into pairs, some pairs being very similar to each other, others entirely dissimilar, and still others with an intermediate degree of similarity. 250 question-pairs were randomly selected from these three categories.
As expected, the entirely similar and entirely dissimilar questions were identified with near- perfect accuracy by both the system and the human candidates. Regarding the more difficult, intermediate questions, overall ranking of similarity was similar as between the system and the human candidates. However, between the human candidates there was clearly subjectivity in their assessment of the intermediate questions’ similarity, and hence variation in the similarity rankings assigned to these questions. In contrast, the system of the invention maintained significant consistency in its similarity ranking of these intermediate questions, demonstrating its efficacy and objectivity.
It will be appreciated that the system of the invention offers a number of significant advantages. Most fundamentally, due to using both semantic and syntactic analysis, it is capable of identifying questions that are phrased differently but have the same meaning; thereby eliminating or significantly ameliorating the problem of duplicated or repetitive questions being posed to users (such as patients). Furthermore, it is capable of comparing, and in particular assessing the similarity between, an immense number of individual questions or questionnaires stored in a database in a very short period of time. It is capable of doing so across technical domains (so long as there is sufficient ontological commonality) as well as across different languages. It is also capable of doing this not only with a high degree of accuracy, but also with consistency and objectivity. These are all things which could never feasibly be accomplished manually or in the absence of such a computerized system.
Significantly reducing the number of questions a person has to respond to, also significantly reduces the amount of time required to complete a whole questionnaire. Shorter questionnaires generally have higher adherence rates, and with the removal of repetition (which inevitably makes people disengage), adherence rates will also increase. As such, more and better information (i.e. the responses from more users) will be provided which will ultimately improve the desired outcomes.
It should be appreciated that, while the description of Figures 1B and 2 has referred specifically to comparing questions in medical questionnaires, the present invention is by no means limited to this. The system described herein has equal application in a wide variety of technical domains, though preferably those with a high degree of semantic / ontological structure, such as medicine, law and engineering. It can also be used to compare and reconcile not just questions, but also other types of sentences / phrases; and such comparison may not only be for consolidation purposes, but could also be for many other purposes. For example, in a research or assessment context to, for example, identify past examination questions that are similar to that now being composed. There can be as few or as many questions (or phrases) as is applicable in a given situation; and these can be already nested within questionnaires or other documents, or stored as stand-alone questions yet to be converted into questionnaires.
One possible alternative use for the invention is represented schematically in Figure 3: namely, as a questionnaire recommender and adapter system. This would receive the user’s search query (either in the form of a keyword(s) or a phrase / question) and identify questionnaires which, while not identical, are potentially useful; and further would optionally recommend or implement a replacement or adaptation to the identified questionnaire, so as to match the user’s query. For example, one could search for questionnaires regarding shoulder pain, and retrieve two, one which is for hip pain/mobility (404) and the other for knee pain/mobility (406). These are ontologically similar (due to the similarity of joints, as opposed to, for example, stomach pain), and the hip/knee concept can be replaced with“shoulder” to create an adapted questionnaire (406) customized to the user’s search query.
Such a questionnaire recommender and adapter system might use much the same question parsing and reconciliation process as described above; wherein the similarity metric is again configured to place primary emphasis on the similarity of the semantic / ontological terms, and lesser emphasis on syntactic structure of the questions. Thus questions relating to shoulder pain, knee pain, or hip pain would all be recognized as ontologically similar and weighted accordingly; then also weighted by syntactic similarity.
Alternatively, in some cases it might be appropriate for the similarity metric to be configured to give prevalence to the syntactic similarity between the sentences, i.e. look first to the structure / overall format of the question, and then to the similarity of the semantic terms therein.
Figure 4 details how the system (100) of the present invention fits in to the wider clinical context (502). In particular, the details for what the clinicians (504) need to know is specified by information from the Clinical Data Repository (506) which is processed by the Electronic Health Record (EHR) (508). This information is passed through the hospital integration engines (510) to the ZEDOC™ (100) integration engine and passed to the ZEDOC™ Core system (104). Reconciliation of the questionnaires takes place by the system (200) of the present invention, as described above. The reconciled questionnaires are asked to the patient (108), and the data is passed back to the ZEDOC™ Core (104) system. From there, the data is passed back through the integration engines (510) back to the EHR system (508) and into the Clinical Data Repository (506) where it can be consumed by clinicians (504) and other hospital staff.
In a clinical context, non-adherence to questionnaires is a cost-factor for PROMs delivery. Higher adherence decreases the cost of delivery (as fewer in-clinic staff may be needed, and fewer follow ups are needed to get patient responses). As such, the system of the present invention provides a value increase. The increase in information from the clinically-validated questionnaires will help improve clinical care and outcomes. Additionally, as this can also be used for PREMs, higher adherence rates will result in improved organizational efficiency and quality, as better information is provided to hospitals with regards to patient experience. The system of the present invention also mitigates the need to identify and examine newly created PROMs instruments in a progressively expanding pool that spans across multiple health domains.
It will of course be realized that while the foregoing has been given by way of illustrative example of this invention, all such and other modifications and variations thereto as would be apparent to persons skilled in the art are deemed to fall within the broad scope and ambit of this invention as is hereinbefore described. If any reference numeral(s) is/are used in a claim or claims then such reference numeral(s) should not be considered as limiting the scope of that respective claim or claims(s) to any particular embodiment of the drawings.
It is acknowledged that the term‘comprise’ may, under varying jurisdictions, be attributed with either an exclusive or an inclusive meaning. For the purpose of this specification, and unless otherwise noted, the term‘comprise’ shall have an inclusive meaning - i.e. it will be taken to mean an inclusion of not only the listed components it directly references, but also other non- specified components or elements. This rationale will also be used when the term‘comprised’ or 'comprising' is used in relation to one or more steps in a method or process.

Claims

1. A computerized system for comparing a first and second phrase, the system comprising at least a storage and a processor; the processor configured to, for each phrase: parse the phrase into distinct syntactic elements, each distinct syntactic element comprising a word or group of words; for the word or group of words corresponding to at least one of the distinct syntactic elements, query a semantic database to identify a corresponding semantic or ontological term (“the semantic term”); retrieve from the storage a phrase template, wherein the phrase template is divided by type of semantic or ontological term and by type of syntactic element; insert the semantic term and at least one of the remaining distinct syntactic elements into the portion of the phrase template that corresponds, respectively, to their type; the processor further configured to: use a similarity metric stored in the storage to quantify the similarity between the phrase templates corresponding to the first and second phrase; and if the similarity between the phrase templates is above a threshold value, consolidate or reconcile the first and second phrase; or, if the similarity between the phrase templates is below the threshold value, leave the first and second phrases unchanged.
2. The system of claim 1 , wherein the first and second phrase pertain to a technical domain having a high degree of semantic / ontological structure.
3. The system of claim 1 or claim 2, wherein the first and second phrase comprise a first and second sentence; and more preferably, a first and second question.
4. The system of any one of the preceding claims, wherein the distinct syntactic elements comprise one or more of: noun, verb, adverb, article, subject, object, pronoun, relative pronoun.
5. The system of claim 4 when dependent on claim 2, wherein the distinct syntactic elements further comprise one or more domain-specific syntactic categories corresponding to common concepts and / or phrase formats within the technical domain.
6. The system of claim 5, wherein one or more of the domain-specific syntactic categories is further subdivided into subcategories.
7. The system of any one of the preceding claims, wherein a plurality of the distinct syntactic elements (or words or groups of words corresponding thereto) are queried in the semantic database, and a corresponding semantic or ontological term is identified for each, such that there are a plurality of“semantic terms”.
8. The system of claim 5 or claim 6, wherein the phrase template is configured with regard to the one or more domain-specific syntactic categories and / or subcategories.
9. The system of claim 8, wherein the domain-specific syntactic categories and / or subcategories relating to common concepts at least partly define the types of semantic or ontological term into which the phrase template is divided; and the domain-specific syntactic categories and / or subcategories relating to common phrase formats at least partly define the types of syntactic element into which the phrase template is divided.
10. The system of claim 7, wherein the plurality of semantic terms are inserted into the portion of the phrase template that corresponds to their respective type, and a plurality of the remaining distinct syntactic elements are also inserted into the portion of the phrase template that corresponds to their respective type.
11. The system of any one of the preceding claims, wherein the similarity metric assigns a ranked weighting to each of the respective types of semantic or ontological term and syntactic element within the first and second phrase template.
12. The system of claim 11, wherein the semantic or ontological terms generally have a higher ranked weighting than the syntactic elements.
13. The system of any one of the preceding claims, wherein the consolidation or reconciliation of the first and second phrase comprises one or more of: identifying or labelling the first or second phrase as a duplication; deleting one of the first or second phrase; or merging the first and second phrase.
14. The system of claim 13, wherein merging the first and second phrase comprises replacing the at least one semantic term in the second phrase with one of: the corresponding at least one semantic term in the first phrase; or the word or group of words corresponding to the at least one semantic term in the first phrase.
15. The system of any one of the preceding claims, wherein the system is configured for comparing a plurality of phrases.
16. The system of any one of the preceding claims, wherein the storage comprises a plurality of phrase templates and similarity metrics corresponding to a plurality of technical domains; and the system either contains in the storage, or is configured to query, a plurality of semantic databases corresponding to a plurality of technical domains.
17. A computerized interface platform comprising the computerized system of claim 1, the interface platform comprising at least a storage (“the platform storage”) and a processor (“the platform processor”), the platform processor configured to: query an information database in the platform storage for relevant information; retrieve from a phrase database in the platform storage a first and second phrase relating to the relevant information; using the computerized system, compare phrase templates corresponding to the first and second phrase and, if the similarity between the phrase templates is above a threshold value, consolidate or reconcile the first and second phrase; or, if the similarity between the phrase templates is below the threshold value, leave the first and second phrases unchanged; and send the consolidated phrase or the first or second phrases (as the case may be) to an end user.
18. Preferably, the consolidated phrase or the first or second phrases (as the case may be) are sent to the end user as one or more questionnaire(s).
19. Preferably, the platform processor is further configured to receive from the end user responses to the consolidated phrase or the first or second phrases (as the case may be), and to store said responses in the platform storage.
20. A computerized method of comparing a first and second phrase using a system comprising at least a storage and a processor, the method comprising the steps of the processor, for each phrase: parsing the phrase into distinct syntactic elements, each distinct syntactic element comprising a word or group of words; for the word or group of words corresponding to at least one of the distinct syntactic elements, querying a semantic database to identify a corresponding semantic or ontological term (“the semantic term”); inserting the semantic term and at least one of the remaining distinct syntactic elements into the phrase template retrieved from the storage, wherein the phrase template is divided by type of semantic or ontological term and by type of syntactic element, and the semantic term and the at least one of the remaining distinct syntactic elements are inserted, respectively, into the portion of the phrase template that corresponds to their type; the method further comprising the steps of the processor: using a similarity metric to quantify the similarity between the phrase templates corresponding to the first and second phrase; and if the similarity between the phrase templates is above a threshold value, consolidating or reconciling the first and second phrase.
PCT/IB2020/054165 2019-05-02 2020-05-01 System and method for phrase comparison consolidation and reconciliation WO2020222202A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
SG11202111653XA SG11202111653XA (en) 2019-05-02 2020-05-01 System and method for phrase comparison consolidation and reconciliation
AU2020265819A AU2020265819A1 (en) 2019-05-02 2020-05-01 System and method for phrase comparison consolidation and reconciliation
IL287674A IL287674A (en) 2019-05-02 2021-10-28 System and method for phrase comparison consolidation and reconciliation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG10201903964X 2019-05-02
SG10201903964X 2019-05-02

Publications (1)

Publication Number Publication Date
WO2020222202A1 true WO2020222202A1 (en) 2020-11-05

Family

ID=73028865

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2020/054165 WO2020222202A1 (en) 2019-05-02 2020-05-01 System and method for phrase comparison consolidation and reconciliation

Country Status (4)

Country Link
AU (1) AU2020265819A1 (en)
IL (1) IL287674A (en)
SG (1) SG11202111653XA (en)
WO (1) WO2020222202A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11550777B2 (en) * 2020-07-29 2023-01-10 International Business Machines Corporation Determining metadata of a dataset
CN116564456A (en) * 2023-04-26 2023-08-08 北京欧应科技有限公司 Method, computing device and storage medium for follow-up data acquisition

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020032564A1 (en) * 2000-04-19 2002-03-14 Farzad Ehsani Phrase-based dialogue modeling with particular application to creating a recognition grammar for a voice-controlled user interface
US20030004716A1 (en) * 2001-06-29 2003-01-02 Haigh Karen Z. Method and apparatus for determining a measure of similarity between natural language sentences
US20030069880A1 (en) * 2001-09-24 2003-04-10 Ask Jeeves, Inc. Natural language query processing
US20050165600A1 (en) * 2004-01-27 2005-07-28 Kas Kasravi System and method for comparative analysis of textual documents
US20130297344A1 (en) * 1999-04-16 2013-11-07 Cardiocom, Llc Downloadable datasets for a patient monitoring system
US20150261745A1 (en) * 2012-11-29 2015-09-17 Dezhao Song Template bootstrapping for domain-adaptable natural language generation
CN108021547A (en) * 2016-11-04 2018-05-11 株式会社理光 A kind of generation method of natural language, spatial term device and electronic equipment
CN109062892A (en) * 2018-07-10 2018-12-21 东北大学 A kind of Chinese sentence similarity calculating method based on Word2Vec

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130297344A1 (en) * 1999-04-16 2013-11-07 Cardiocom, Llc Downloadable datasets for a patient monitoring system
US20020032564A1 (en) * 2000-04-19 2002-03-14 Farzad Ehsani Phrase-based dialogue modeling with particular application to creating a recognition grammar for a voice-controlled user interface
US20030004716A1 (en) * 2001-06-29 2003-01-02 Haigh Karen Z. Method and apparatus for determining a measure of similarity between natural language sentences
US20030069880A1 (en) * 2001-09-24 2003-04-10 Ask Jeeves, Inc. Natural language query processing
US20050165600A1 (en) * 2004-01-27 2005-07-28 Kas Kasravi System and method for comparative analysis of textual documents
US20150261745A1 (en) * 2012-11-29 2015-09-17 Dezhao Song Template bootstrapping for domain-adaptable natural language generation
CN108021547A (en) * 2016-11-04 2018-05-11 株式会社理光 A kind of generation method of natural language, spatial term device and electronic equipment
CN109062892A (en) * 2018-07-10 2018-12-21 东北大学 A kind of Chinese sentence similarity calculating method based on Word2Vec

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11550777B2 (en) * 2020-07-29 2023-01-10 International Business Machines Corporation Determining metadata of a dataset
CN116564456A (en) * 2023-04-26 2023-08-08 北京欧应科技有限公司 Method, computing device and storage medium for follow-up data acquisition
CN116564456B (en) * 2023-04-26 2024-04-30 北京欧应科技有限公司 Method, computing device and storage medium for follow-up data acquisition

Also Published As

Publication number Publication date
AU2020265819A1 (en) 2021-12-02
SG11202111653XA (en) 2021-11-29
IL287674A (en) 2021-12-01

Similar Documents

Publication Publication Date Title
US12272434B2 (en) Deduplication of medical concepts from patient information
EP3234823B1 (en) Differential medical diagnosis apparatus adapted in order to determine an optimal sequence of diagnostic tests for identifying a pathology by adopting diagnostic appropriateness criteria
Humphreys et al. Evaluating the coverage of controlled health data terminologies: report on the results of the NLM/AHCPR large scale vocabulary test
US9965548B2 (en) Analyzing natural language questions to determine missing information in order to improve accuracy of answers
RU2662205C2 (en) Extension of clinical guidelines based on clinical expert recommendations
CN118013001A (en) Interactive knowledge interaction system based on knowledge base and large language model
US11636350B1 (en) Systems and methods for automated scribes based on knowledge graphs of clinical information
WO2021252958A1 (en) Medical literature recommender based on patient health information and user feedback
CN119230090B (en) Knowledge-graph-based medical records diagnosis and operation ICD coding method
US20210090691A1 (en) Cognitive System Candidate Response Ranking Based on Personal Medical Condition
Fritz et al. Evaluation of medical decision support systems (DDX generators) using real medical cases of varying complexity and origin
Chen et al. An approach for transgender population information extraction and summarization from clinical trial text
AU2020265819A1 (en) System and method for phrase comparison consolidation and reconciliation
Goel et al. Artificial Intelligence based Healthcare Chat Bot System
Batool et al. Automatic extraction and mapping of discharge summary’s concepts into SNOMED CT
Ojo et al. Enabling deeper linguistic-based text analytics—Construct development for the criticality of negative service experience
Fabry et al. QUESTO–An ontology for questionnaires
Shali et al. Bots using natural language processing in medical sector
Chen et al. Enhancing treatment decision-making for low back pain: a novel framework integrating large language models with retrieval-augmented generation technology
US12387109B1 (en) Systems and methods for automated scribes based on knowledge graphs of clinical information having weighted connections
US11748361B1 (en) Systems and methods for multi-dimensional ranking of experts
Rahmawati et al. Text Segmentation Methods for Annotation on eHealth Consultation with Interview Function Labels: A Comparative Study
Alodadi Characterization, Extraction, and Impact Assessment of Semantic Aspects in Online Physician Reviews
US20250125060A1 (en) Medical literature recommender based on patient health information user feedback
Denecke Accessing medical experiences and information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20798690

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020265819

Country of ref document: AU

Date of ref document: 20200501

Kind code of ref document: A

122 Ep: pct application non-entry in european phase

Ref document number: 20798690

Country of ref document: EP

Kind code of ref document: A1