WO2020222202A1

WO2020222202A1 - System and method for phrase comparison consolidation and reconciliation

Info

Publication number: WO2020222202A1
Application number: PCT/IB2020/054165
Authority: WO
Inventors: Ron TENENBAUM; Michael MERRY
Original assignee: The Clinician Pte. Ltd
Priority date: 2019-05-02
Filing date: 2020-05-01
Publication date: 2020-11-05
Also published as: AU2020265819A1; SG11202111653XA; IL287674A

Abstract

The present invention relates to a system for comparing a first and second phrase, a method of using said system, and an interface platform comprising said system. The system is configured to parse each phrase into distinct syntactic elements; query a semantic database to identify a semantic or ontological term ("the semantic term") corresponding to at least one of said distinct syntactic elements; and insert the semantic term and at least one of the remaining distinct syntactic elements into an appropriate portion of a phrase template. The system is then configured to use a similarity metric to quantify the similarity between the first and second phrase templates; and: if the similarity between the phrase templates is above a threshold value, consolidate or reconcile the first and second phrase; or, if the similarity between the phrase templates is below the threshold value, leave the first and second phrases unchanged. In using both semantic and syntactic analysis to determine the similarity between the first and second phrase, the system enables improved detection of phrases that are in substance similar or repetitive, thereby reducing the number and repetitiveness of phrases (such as questions) ultimately posed to a user and in turn improving user participation and related outcomes.

Description

System and Method for Phrase Comparison , Consolidation and Reconciliation

FIELD OF THE INVENTION

The present invention relates to a computerized system and method for comparison, consolidation and reconciliation of phrases, and more particularly questions. The invention is particularly suited for use in domains having a high degree of semantic / ontological structure, such as medicine, law and engineering. However, the invention may also be used in other, less formalised or more generic fields or contexts, for example to process content posed to online question-and-answer forums.

BACKGROUND

There is a need in many industries, such as the medical industry, to use automated research and data collection processes, often involving questionnaires for people to fill in and return for analysis. As the world becomes increasingly automated, these types of research and data collection questionnaires are often automatically generated. Due to the variety and complexity of natural language, there are multiple ways of asking any given question, so when delivering automated questionnaires, it is likely that there are multiple identical or similar questions that exist across multiple questionnaires.

The“syntax” of a phrase, such as a sentence, relates to its structure, without regard to its meaning. Syntactic analysis identifies each word in the sentence as being a particular part of speech (e.g., article, subject, relative pronoun), and determines the relationships of each of these with each other (e.g. the relative pronoun relating to the subject of the previous clause); these will of course vary from one language to the next.

“Semantics” relates to the meaning, or definition, of the specific words within a phrase or sentence.“Ontologies” are terms used to group together concepts, phenomena, articles, et cetera that share common properties or characteristics. So ontologies can be thought of as“high-level” semantics in describing any number of members of a given category; with each of those members also having a specific semantic definition peculiar to it. The relationship between ontologies and semantics, that is, between the high-level category and the particular object, is often represented using a“subject” PREDICATE“object” structure; for example,“knee IS A joint”.

“Domain-specific ontologies” are ontologies developed for, and used within, particular technical domains. The systematised nomenclature of medicine clinical terms (SNOMED-CT) is an example of a domain-specific ontology in the medical field (Lee, Keizer, Lau, & Cornet, 2014).

It is known to use semantic analysis to automatically process (e.g. extract, create, or answer) questions in different contexts and applications. For instance, the process known as“question retrieval” focuses on taking information provided by a user and comparing this against a database of pre-existing topics. This is used to, for example, match questions posted on online question-and-answer forums to previously-asked questions / answers.

A more refined version of question retrieval is used in the medical field. Medicine, being highly formal and structured, lends itself to semantic analysis, and ontologies have accordingly been used in question retrieval in the medical field as an extension of keyword-based information retrieval (Khan, Mcleod, & Hovy, 2004). That is to say, the ontology of a given search term is identified, and related terms within that ontology are then identified and information relating to these is returned to the user as search results.

Semantic analysis is also used in other ways in the medical field, again as an extension of keyword-based information retrieval. For instance, US 2014/0316797 A1 discloses a system for creating a consolidated snapshot of a patient’s medication information by querying a range of sources. Similarly, US20120221347A1 discloses a system for reconciling data regarding a given patient sourced from a range of distinct databases; and also discloses using semantic analysis (mining of variables) to identify the relevant clinical question being posed and propose a number of reporting options (see for example [0099] and [0234]). US8898798B2 discloses the use of semantic analysis for mining of large anonymized patient data banks, to determine ontological population trends and the like; as well as using markers (demographic, genotypic, phenotypic) to identify groups of potentially suitable candidates for clinical trials et cetera. ZEDOC™ is one example of a question-retrieval system used in the medical field, with a particular focus on Patient Experience Measures (PREMs) and Patient-Reported Outcome Measures (PROMs). ZEDOC™ contains a large database of questions (over 1000 questionnaires each averaging around 10 questions each) categorized by subject matter, i.e. specific medical domains and sub-domains. Patient information is entered, and the system compares this to the questions in the database and identifies those which are relevant. For each subject matter category (corresponding to each of the identified questions), the system then creates a questionnaire. The set of questionnaires is put into a bundle and sent to the patient, who fills in each of them.

However, semantic analysis (including question retrieval) in a medical context has certain limitations. Due to the breadth and complexity of the medical field, a given semantic or ontological term is bound to return a large number of“hits” from across a range of subject matter categories; which will result in the system generating and presenting to the patient a large number of questionnaires. Across these, there is likely to be much repetition: both in a domain- specific context (such as asking about pain scores in different ways) and in a non-domain- specific context (such as asking about things like name, age, and gender).

This imposes a response burden on the patient, particularly if they have the same experience with multiple clinicians. Typically, patients will end up becoming frustrated and discouraged on having to answer the same or similar questions repeatedly; and are unlikely to complete all the questions, leading to gaps in the information ultimately received by the clinician. If a patient is being asked to complete multiple different questionnaires, it makes for a better user experience and better outcomes, to ask the same question once, rather than multiple times.

Drafting a customized questionnaire for each patient would avoid this problem, but would of course be unacceptably time-consuming and inefficient, and would defeat the purpose of automated question retrieval systems which aim to relieve clinicians of the need to do this.

Equally, manually perusing and comparing pre-prepared questionnaire templates, such as those on ZEDOC™, and comparing these against each other would be a completely infeasible task when it is considered that there may be thousands of potentially relevant questions. Comparing each of these to all of the others would entail making in the region of 50 million comparisons - more than could be done in the average lifetime. This is without even considering humans’ cognitive and analytical limitations. Humans will almost inevitably be familiar with a particular, narrow field rather than across fields, even if they are related. Thus they will have a limited ability to assess the relevance of a questionnaire using the same ontology but from a different field. Obviously, there are also practical limitations such as language barriers, as well as the fact that human comparison will inevitably be subjective, and the outcome will vary from person to person.

An automated way of identifying similar, equivalent, or overlapping questions is therefore necessary.

It is accordingly an object of the present invention to provide a computerized system and method for comparison, consolidation and reconciliation of phrases that addresses some of the problems of the prior art, or at least to provide the public with a useful choice.

STATEMENTS OF THE INVENTION

According to one aspect of the invention, there is provided a computerized system for comparing a first and second phrase, the system comprising at least a storage and a processor; the processor configured to, for each phrase: parse the phrase into distinct syntactic elements, each distinct syntactic element comprising a word or group of words; for the word or group of words corresponding to at least one of the distinct syntactic elements, query a semantic database to identify a corresponding semantic or ontological term (“the semantic term”); retrieve from the storage a phrase template, wherein the phrase template is divided by type of semantic or ontological term and by type of syntactic element; insert the semantic term and at least one of the remaining distinct syntactic elements into the portion of the phrase template that corresponds, respectively, to their type; the processor further configured to: use a similarity metric stored in the storage to quantify the similarity between the phrase templates corresponding to the first and second phrase; and if the similarity between the phrase templates is above a threshold value, consolidate or reconcile the first and second phrase; or, if the similarity between the phrase templates is below the threshold value, leave the first and second phrases unchanged.

Preferably, the first and second phrase pertain to a technical domain having a high degree of semantic / ontological structure.

More preferably, the technical domain is one of: medicine, law, or engineering.

Preferably, the first and second phrase comprise a first and second sentence; and more preferably, a first and second question.

Preferably, the distinct syntactic elements comprise one or more of: noun, verb, adverb, article, subject, object, pronoun, relative pronoun.

Preferably, the distinct syntactic elements further comprise one or more domain-specific syntactic categories corresponding to common concepts or phrase formats within the technical domain. For example, where the domain is medicine and the phrase is a question, the concepts preferably include one or more of:“time / duration”,“frequency”,“severity”, and“clinical finding”; and the formats preferably include one or more of:“type of question” and“type of answer”.

More preferably, one or more of the domain-specific syntactic categories are further subdivided into subcategories. For example, the concept“clinical finding” is preferably further subdivided into the sub-concepts of“diagnosis” and“location” /“body structure”.

Preferably, the semantic database corresponds to the relevant technical domain. For example, where the domain is medicine, the semantic database is preferably provided by the systematized nomenclature of medicine clinical terms (SNOMED-CT). Preferably, the semantic database is stored in the storage. Alternatively, the semantic database is located externally of the system, such as hosted or provided by a third party.

Preferably, a plurality of the distinct syntactic elements (or words or groups of words corresponding thereto) are queried in the semantic database, and a corresponding semantic or ontological term is identified for each, such that there are a plurality of“semantic terms”.

Preferably, the phrase template is configured appropriately to the technical domain to which the first and second phrase pertain.

More preferably, the phrase template is configured with regard to the one or more domain- specific syntactic categories (and optionally subcategories), that is, concepts commonly arising within the technical domain and / or common formats of phrases occurring within the domain. For example, where the domain is medicine and the phrase is a question, the concepts preferably include one or more of:“time / duration”,“frequency”,“severity”, and“clinical finding”; and the formats preferably include one or more of:“type of question” and“type of answer”.

Preferably, the domain-specific syntactic categories (and optionally subcategories) relating to common concepts at least partly define the types of semantic or ontological term into which the phrase template is divided; and the domain-specific syntactic categories (and optionally subcategories) relating to common phrase formats at least partly define the types of syntactic element into which the phrase template is divided.

Preferably, the plurality of semantic terms are inserted into the portion of the phrase template that corresponds to their respective type.

Preferably, a plurality of the remaining distinct syntactic elements are also inserted into the portion of the phrase template that corresponds to their respective type.

Preferably, the similarity metric is configured appropriately to the technical domain to which the first and second phrase pertain.

Preferably, the similarity metric assigns a ranked weighting to each of the respective types of semantic or ontological term and syntactic element within the first and second phrase template. Preferably, the semantic or ontological terms generally have a higher ranked weighting than the syntactic elements.

Preferably, the consolidation or reconciliation of the first and second phrase comprises one or more of: identifying or labelling the first or second phrase as a duplicate; deleting one of the first or second phrase; or merging the first and second phrase.

Preferably, merging the first and second phrase comprises replacing the at least one semantic term in the second phrase with one of: the corresponding at least one semantic term in the first phrase; or the word or group of words corresponding to the at least one semantic term in the first phrase.

Preferably, the computerized system is configured for comparing a plurality of phrases.

Preferably, the storage comprises a plurality of phrase templates and similarity metrics corresponding to a plurality of technical domains; and the system either contains in storage, or is configured to query, a plurality of semantic databases corresponding to a plurality of technical domains.

According to another aspect of the invention, there is provided a computerized interface platform comprising a computerized system substantially as described above, the interface platform comprising a at least a storage (“the platform storage”) and a processor (“the platform processor”), the platform processor configured to: query an information database in the platform storage for relevant information; retrieve from a phrase database in the platform storage a first and second phrase relating to the relevant information; using the computerized system, compare phrase templates corresponding to the first and second phrase and, if the similarity between the phrase templates is above a threshold value, consolidate or reconcile the first and second phrase; or, if the similarity between the phrase templates is below the threshold value, leave the first and second phrases unchanged; and send the consolidated phrase or the first or second phrases (as the case may be) to an end user.

Preferably, the computerized interface platform is configured for use in the technical field of medicine, and the relevant information is information pertaining to a particular patient (being the end user).

Preferably, a plurality of consolidated phrases or first and second phrases (as the case may be) is sent to the end user as one or more questionnaire(s).

Preferably, the platform processor is further configured to receive from the end user responses to the consolidated phrases or the first or second phrases (as the case may be), and to store said responses in the platform storage.

According to another aspect of the invention, there is provided a computerized method of comparing a first and second phrase using a system comprising at least a storage and a processor, the method comprising the steps of the processor, for each phrase: parsing the phrase into distinct syntactic elements, each distinct syntactic element comprising a word or group of words; for the word or group of words corresponding to at least one of the distinct syntactic elements, querying a semantic database to identify a corresponding semantic or ontological term (“the semantic term”); inserting the semantic term and at least one of the remaining distinct syntactic elements into the phrase template retrieved from the storage, wherein the phrase template is divided by type of semantic or ontological term and by type of syntactic element, and the semantic term and the at least one of the remaining distinct syntactic elements are inserted, respectively, into the portion of the phrase template that corresponds to their type; the method further comprising the steps of the processor: using a similarity metric to quantify the similarity between the phrase templates corresponding to the first and second phrase; and if the similarity between the phrase templates is above a threshold value, consolidating or reconciling the first and second phrase.

The present invention provides a number of advantages over the prior art, including:

Enabling the identification, comparison and reconciliation / consolidation of potentially relevant questions from a vast pool of pre-prepared questions, with a high degree of uniformity / consistency and to a high level of accuracy;

Enabling this across multiple fields / domains using the same or similar ontology, and across multiple different languages;

As a result, reducing the amount of repetition in questions posed to a user, and / or the number of questionnaires sent to the user for completion;

In turn increasing questionnaire response rates, and hence improving the amount and quality of information provided by the user;

In turn increasing data accuracy, desired outcomes and efficiency of systems, and decreasing cost of service delivery; and

At the very least, providing the public with a useful choice.

BRIEF DESCRIPTION OF FIGURES

Further aspects and advantages of the invention will become apparent with reference to the accompanying Figures, which are given by way of example only and in which:

FIGURE 1A is a schematic showing the elements and operation of the existing ZEDOC™ system;

FIGURE 1B is a schematic showing the elements and operation of the ZEDOC™ system when combined with the system of the present invention;

FIGURE 2 is a schematic showing how the system of the present invention processes an inputted phrase;

FIGURE 3 is a schematic showing an alternative embodiment of the system of the present invention; and FIGURE 4 is a schematic showing how the system of the present invention may be used in a broader clinical context.

DETAILED DESCRIPTION OF FIGURES

Figure 1A is a schematic showing the operation of the existing ZEDOC™ question retrieval system (generally indicated by 100). The processing portion (104) of the system (known as ZEDOC™ Core) first requests the relevant information from the database (102) to be able to create the questionnaires to send to the patient. This information (which could be in the form of keywords entered by a user, such as a clinician) will typically include things like symptoms reported by the patient during their most recent consultation; medications the patient is taking; or lifestyle habits of the patient.

The processor then sweeps the database (102) for questions / questionnaires having relevance to the inputted patient information.

The set of identified questionnaires is put into a bundle and sent to the patient (108), such as via email (106). The patient then responds to all of the questionnaires, for example by using a smartphone application (110) and the responses are then returned to the ZEDOC™ system and stored in its server.

In doing this, the ZEDOC™ system is guided only by the information / keywords from the database, which it then compares to existing questionnaires. It has no ability to compare questions within the respective questionnaires to each other, by assessing them syntactically or semantically. Due to the size and complexity of the database, a search for any given term is likely to result in a large number of“hits”; meaning a large number of questionnaires will be sent to the patient. Within these, there is likely to be much repetition of questions, asking for the same information but in slightly different ways. For example,“On a scale of 1 to 10, have you had pain in the last 14 days?” and“In the last 14 days, what level (1-10) has your pain been?” are two syntactically different but semantically equivalent sentences (i.e., questions that have the same meaning) where the answers can most likely be interchanged. Question-retrieval systems with no syntactic or semantic analysis capability will not be able to pick up on this, so both questions (and as many more equivalent questions as are found in the database) will be sent to the patient. This will inevitably result in frustration and low response rates.

Figure 1B schematically illustrates the ZEDOC™ system (100) used in combination with the system of the present invention (generally indicated by 200); which, although not shown, comprises at least a storage and a processor.

Once the questionnaires (202) are generated by ZEDOC™, they are passed to the system of the present invention. This compares all of the questions in all of the questionnaires to each other, and, where these are sufficiently similar, reconciles / consolidates the questions by merging them, labelling them as duplicates, or deleting some of them - this process is generally indicated by (204). An example of part of this process (generally indicated by (300) is shown in Figure 2. The system outputs the consolidated set of questionnaires (206) to the ZEDOC™ system for sending to the patient as per Figure 1A. Thus, with the use of the system of the present invention, a much smaller number of questions (and / or questionnaires) will ultimately be sent to the patient.

The process employed by the system (200) may be algorithmically represented as follows:

Data: Set of questions

Result: Reconciled set of questions with duplicates removed

Create a domain-specific syntactic template (example above) on the basis of domain-specific ontology (e,g., SNOMED-CT);

Collect all questions that are to be delivered to a patient in one place;

forall Questions do

Apply syntactic parsing (existing methodologies) to break the

sentence down into syntactic structures;

For each noun- or verb-phrase, look up this term in the corresponding domain-specific ontology;

Using the semantic knowledge, insert each noun- or verb-phrase into the appropriate part of the syntactic template (e.g., "Knee pain” is a “Clinical Finding”);

end

Calculate a similarity metric across all questions; Fur theme questions where the similarity metric is above a threshold, label them as

duplicates:

(Optional): An individual can review these questions to determine

whether or not these are appropriate to reconcile;

return (The set of questions without the duplicates)

Once ZEDOC™ presents the questions identified by its database to the system of the present invention, for each question (302) the system firstly applies“syntactic parsing” (304) to break the sentence down into its distinct syntactic elements or“building blocks”; each of which might include one word, or a number of words. There are known algorithms for syntactic parsing that can be used in the system of the present invention, one example of which is SyntaxNet developed by Google.

Common syntactic elements into which the sentence might be broken include generic parts of speech such as noun, verb, article, subject, object, pronoun, relative pronoun, et cetera. Thus, in the example in Figure 2, the system recognizes the terms“pain” and“knee” as belonging to the same syntactic element (noun); and also potentially recognizes that they may alternatively form the phrase“knee pain”.

It is also within the scope of the invention for other types of elements to be identified during the syntactic parsing step, tailored to the particular domain or subject-matter in question. That is to say, additional“domain-specific syntactic categories” can be set up, for instance reflecting common themes, concepts, or phrase formats within a given domain - such as, for medical questions, the themes of time / duration, frequency, severity, and / or clinical finding, and the formats of“type of question” and / or“type of answer”. Thus, for example the phrase“in the last 14 days” is identified as belonging to the domain-specific syntactic category of time / duration; while“extreme” would be identified as belonging to the domain-specific syntactic category of “severity”. (Note, this may alternatively or additionally be done at the phrase template stage, discussed below).

Once syntactic parsing is completed, selected ones of the syntactic elements are cross- referenced to a semantic database (306), as schematically indicated by arrows (308). Typically, the elements referred to the semantic database will be verbs, nouns, verb/noun phrases, or adverbial phrases. However, other syntactic elements may also be so referred. For instance, the system may be configured to refer to the semantic database any elements that have not, in the syntactic analysis step, been identified as belonging to one or more predetermined syntactic categories. For example, if severity is a predetermined syntactic category, then the term “extreme” will be identified as relating to this category and thus will not be referred to the semantic database; while the term“last 14 days” will be identified as falling outside of this category and thus will be referred to the semantic database.

The semantic database (306) should be an ontological database suited to the technical domain at issue - for instance, for the medical domain, the SNOMED-CT semantic database is particularly appropriate. In the Figure 2 example, the phrase “knee pain” matches the SNOMED-CT term 30989003 Knee pain. The terms“pain” and“knee” individually also match SNOMED-CT terms, as does the term“day”. Similarly, a word like“extreme” would, if present, be matched to the SNOMED term 12565001 Extreme (qualifier value).

The semantic database (306) may be stored in the system’s storage. However, it is equally possible for the semantic database to be hosted / provided by an external source, and queried remotely by the system’s processor. For example, many ontologies / ontological databases are hosted on university websites.

Once the semantic analysis stage has been completed, the system then retrieves from storage a phrase template (schematically indicated by 314), which is divided by type of semantic or ontological term and by type of syntactic element. The specifics of how a given template is structured / divided are informed by the relevant technical field or domain: in particular, by common concepts (and sub-concepts) that arise within that domain, and common phrase formats (and sub-formats) occurring within that domain. In this regard, the division of the phrase template may be informed by the“domain-specific syntactic categories” discussed above, into which the question has already been parsed. For instance, in a medical context, as noted above common concepts include time / duration, frequency, and severity; and / or clinical finding (including the sub-concepts of“diagnosis” and“location” /“body structure”). Thus the semantic or ontological division of the phrase template will be along these lines (or at least include these). Common formats include“type of question” and“type of answer”; and thus the syntactic division of the phrase template will be along these lines (or at least include these).

Some or all of the semantic terms of the phrase (identified by reference to the semantic database as above) are inserted into the phrase template (314), as schematically indicated by arrow (310); as are some or all of the remaining syntactic elements, as schematically indicated by arrow (312); wherein each term / element is inserted into its appropriate“type” within the phrase template (314). For instance, in the Figure 2 example, the term“usually” has been identified at the syntactic analysis stage as corresponding to the syntactic category of“frequency”, and thus is inserted into the appropriate place in the phrase template. The term“pain” has been identified at the semantic analysis stage as corresponding to a semantic or ontological term of the type “clinical finding” (and subtype“diagnosis”), and thus is also inserted into the appropriate place in the phrase template; as is the term“knee”, which has been identified as corresponding to a semantic or ontological term of the type“clinical finding” (and subtype“location” /“body structure”).

Once each question that has been passed to the system by ZEDOC™ has been fed into a dedicated phrase template in the above manner, the system is configured to then analyze the similarity of each question to each of the other questions, to thereby detect duplication and, if necessary or desired, effect consolidation / reconciliation between the questions.

It does this by applying a similarity metric to the questions. The similarity metric may be of any suitable kind (having regard to the technical domain), which the skilled person will readily identify; one example being the use of cosine similarity. As part of the similarity metric, each type of term and element within the respective phrase templates is assigned a ranked weighting. Particularly in semantically / ontologically highly structured domains such as medicine, it is desirable for the similarity between phrases to be assessed primarily by reference to the semantic / ontological terms therein, and accordingly these should have a higher ranked weighting than the remaining (syntactic) terms.

If a pair of phrases are found to have a similarity over a threshold value, steps are taken to consolidate or reconcile these. This can include identifying or labelling the phrases as duplicates, merging the phrases together, or deleting one of the phrases altogether.

One way of merging phrases together would be to replace a given term in one of the phrases with the corresponding term from another of the phrases, to form a“hybrid” phrase. For instance, where a pair of semantic terms in two phrases have been found to be ontologically similar, the relevant term in the first phrase might be replaced by the relevant term in the other. This could either be accomplished by substituting in the formal semantic / ontological term identified by the semantic database, or by substituting in the informal term that was originally used in the first phrase.

The system of the invention as described above has been shown to reduce the number of resulting questions / questionnaires significantly. For instance, trials have shown that use of the system of the present invention in conjunction with the ZEDOC™ system produced significantly fewer questions / questionnaires than when the same query was run through the ZEDOC™ system without using the system of the present invention.

The system has also been shown to have a high level of accuracy across the full range of difficulty levels, when tested on randomly chosen questions from a database of standard clinical PROMs. In particular, there is strong evidence that:

• This system is able to clearly identify identical questions

• This system is able to clearly identify entirely dissimilar questions

• This system is able to rank questions in an order of similarity that is highly plausible and also consistent. This was established by selecting a small number of questions from ZEDOC™ and submitting these for comparison by, on the one hand, seven human trial candidates, and on the other hand the system of the invention. The question number had to be kept very small in order to be manageable by the human candidates. Even so, the results show the efficacy of the system.

The questions were placed into pairs, some pairs being very similar to each other, others entirely dissimilar, and still others with an intermediate degree of similarity. 250 question-pairs were randomly selected from these three categories.

As expected, the entirely similar and entirely dissimilar questions were identified with near- perfect accuracy by both the system and the human candidates. Regarding the more difficult, intermediate questions, overall ranking of similarity was similar as between the system and the human candidates. However, between the human candidates there was clearly subjectivity in their assessment of the intermediate questions’ similarity, and hence variation in the similarity rankings assigned to these questions. In contrast, the system of the invention maintained significant consistency in its similarity ranking of these intermediate questions, demonstrating its efficacy and objectivity.

It will be appreciated that the system of the invention offers a number of significant advantages. Most fundamentally, due to using both semantic and syntactic analysis, it is capable of identifying questions that are phrased differently but have the same meaning; thereby eliminating or significantly ameliorating the problem of duplicated or repetitive questions being posed to users (such as patients). Furthermore, it is capable of comparing, and in particular assessing the similarity between, an immense number of individual questions or questionnaires stored in a database in a very short period of time. It is capable of doing so across technical domains (so long as there is sufficient ontological commonality) as well as across different languages. It is also capable of doing this not only with a high degree of accuracy, but also with consistency and objectivity. These are all things which could never feasibly be accomplished manually or in the absence of such a computerized system.

Significantly reducing the number of questions a person has to respond to, also significantly reduces the amount of time required to complete a whole questionnaire. Shorter questionnaires generally have higher adherence rates, and with the removal of repetition (which inevitably makes people disengage), adherence rates will also increase. As such, more and better information (i.e. the responses from more users) will be provided which will ultimately improve the desired outcomes.

It should be appreciated that, while the description of Figures 1B and 2 has referred specifically to comparing questions in medical questionnaires, the present invention is by no means limited to this. The system described herein has equal application in a wide variety of technical domains, though preferably those with a high degree of semantic / ontological structure, such as medicine, law and engineering. It can also be used to compare and reconcile not just questions, but also other types of sentences / phrases; and such comparison may not only be for consolidation purposes, but could also be for many other purposes. For example, in a research or assessment context to, for example, identify past examination questions that are similar to that now being composed. There can be as few or as many questions (or phrases) as is applicable in a given situation; and these can be already nested within questionnaires or other documents, or stored as stand-alone questions yet to be converted into questionnaires.

One possible alternative use for the invention is represented schematically in Figure 3: namely, as a questionnaire recommender and adapter system. This would receive the user’s search query (either in the form of a keyword(s) or a phrase / question) and identify questionnaires which, while not identical, are potentially useful; and further would optionally recommend or implement a replacement or adaptation to the identified questionnaire, so as to match the user’s query. For example, one could search for questionnaires regarding shoulder pain, and retrieve two, one which is for hip pain/mobility (404) and the other for knee pain/mobility (406). These are ontologically similar (due to the similarity of joints, as opposed to, for example, stomach pain), and the hip/knee concept can be replaced with“shoulder” to create an adapted questionnaire (406) customized to the user’s search query.

Such a questionnaire recommender and adapter system might use much the same question parsing and reconciliation process as described above; wherein the similarity metric is again configured to place primary emphasis on the similarity of the semantic / ontological terms, and lesser emphasis on syntactic structure of the questions. Thus questions relating to shoulder pain, knee pain, or hip pain would all be recognized as ontologically similar and weighted accordingly; then also weighted by syntactic similarity.

Alternatively, in some cases it might be appropriate for the similarity metric to be configured to give prevalence to the syntactic similarity between the sentences, i.e. look first to the structure / overall format of the question, and then to the similarity of the semantic terms therein.

Figure 4 details how the system (100) of the present invention fits in to the wider clinical context (502). In particular, the details for what the clinicians (504) need to know is specified by information from the Clinical Data Repository (506) which is processed by the Electronic Health Record (EHR) (508). This information is passed through the hospital integration engines (510) to the ZEDOC™ (100) integration engine and passed to the ZEDOC™ Core system (104). Reconciliation of the questionnaires takes place by the system (200) of the present invention, as described above. The reconciled questionnaires are asked to the patient (108), and the data is passed back to the ZEDOC™ Core (104) system. From there, the data is passed back through the integration engines (510) back to the EHR system (508) and into the Clinical Data Repository (506) where it can be consumed by clinicians (504) and other hospital staff.

In a clinical context, non-adherence to questionnaires is a cost-factor for PROMs delivery. Higher adherence decreases the cost of delivery (as fewer in-clinic staff may be needed, and fewer follow ups are needed to get patient responses). As such, the system of the present invention provides a value increase. The increase in information from the clinically-validated questionnaires will help improve clinical care and outcomes. Additionally, as this can also be used for PREMs, higher adherence rates will result in improved organizational efficiency and quality, as better information is provided to hospitals with regards to patient experience. The system of the present invention also mitigates the need to identify and examine newly created PROMs instruments in a progressively expanding pool that spans across multiple health domains.

It will of course be realized that while the foregoing has been given by way of illustrative example of this invention, all such and other modifications and variations thereto as would be apparent to persons skilled in the art are deemed to fall within the broad scope and ambit of this invention as is hereinbefore described. If any reference numeral(s) is/are used in a claim or claims then such reference numeral(s) should not be considered as limiting the scope of that respective claim or claims(s) to any particular embodiment of the drawings.

It is acknowledged that the term‘comprise’ may, under varying jurisdictions, be attributed with either an exclusive or an inclusive meaning. For the purpose of this specification, and unless otherwise noted, the term‘comprise’ shall have an inclusive meaning - i.e. it will be taken to mean an inclusion of not only the listed components it directly references, but also other non- specified components or elements. This rationale will also be used when the term‘comprised’ or 'comprising' is used in relation to one or more steps in a method or process.

Claims

1. A computerized system for comparing a first and second phrase, the system comprising at least a storage and a processor; the processor configured to, for each phrase: parse the phrase into distinct syntactic elements, each distinct syntactic element comprising a word or group of words; for the word or group of words corresponding to at least one of the distinct syntactic elements, query a semantic database to identify a corresponding semantic or ontological term (“the semantic term”); retrieve from the storage a phrase template, wherein the phrase template is divided by type of semantic or ontological term and by type of syntactic element; insert the semantic term and at least one of the remaining distinct syntactic elements into the portion of the phrase template that corresponds, respectively, to their type; the processor further configured to: use a similarity metric stored in the storage to quantify the similarity between the phrase templates corresponding to the first and second phrase; and if the similarity between the phrase templates is above a threshold value, consolidate or reconcile the first and second phrase; or, if the similarity between the phrase templates is below the threshold value, leave the first and second phrases unchanged.

2. The system of claim 1 , wherein the first and second phrase pertain to a technical domain having a high degree of semantic / ontological structure.

3. The system of claim 1 or claim 2, wherein the first and second phrase comprise a first and second sentence; and more preferably, a first and second question.

4. The system of any one of the preceding claims, wherein the distinct syntactic elements comprise one or more of: noun, verb, adverb, article, subject, object, pronoun, relative pronoun.

5. The system of claim 4 when dependent on claim 2, wherein the distinct syntactic elements further comprise one or more domain-specific syntactic categories corresponding to common concepts and / or phrase formats within the technical domain.

6. The system of claim 5, wherein one or more of the domain-specific syntactic categories is further subdivided into subcategories.

7. The system of any one of the preceding claims, wherein a plurality of the distinct syntactic elements (or words or groups of words corresponding thereto) are queried in the semantic database, and a corresponding semantic or ontological term is identified for each, such that there are a plurality of“semantic terms”.

8. The system of claim 5 or claim 6, wherein the phrase template is configured with regard to the one or more domain-specific syntactic categories and / or subcategories.

9. The system of claim 8, wherein the domain-specific syntactic categories and / or subcategories relating to common concepts at least partly define the types of semantic or ontological term into which the phrase template is divided; and the domain-specific syntactic categories and / or subcategories relating to common phrase formats at least partly define the types of syntactic element into which the phrase template is divided.

10. The system of claim 7, wherein the plurality of semantic terms are inserted into the portion of the phrase template that corresponds to their respective type, and a plurality of the remaining distinct syntactic elements are also inserted into the portion of the phrase template that corresponds to their respective type.

11. The system of any one of the preceding claims, wherein the similarity metric assigns a ranked weighting to each of the respective types of semantic or ontological term and syntactic element within the first and second phrase template.

12. The system of claim 11, wherein the semantic or ontological terms generally have a higher ranked weighting than the syntactic elements.

13. The system of any one of the preceding claims, wherein the consolidation or reconciliation of the first and second phrase comprises one or more of: identifying or labelling the first or second phrase as a duplication; deleting one of the first or second phrase; or merging the first and second phrase.

14. The system of claim 13, wherein merging the first and second phrase comprises replacing the at least one semantic term in the second phrase with one of: the corresponding at least one semantic term in the first phrase; or the word or group of words corresponding to the at least one semantic term in the first phrase.

15. The system of any one of the preceding claims, wherein the system is configured for comparing a plurality of phrases.

16. The system of any one of the preceding claims, wherein the storage comprises a plurality of phrase templates and similarity metrics corresponding to a plurality of technical domains; and the system either contains in the storage, or is configured to query, a plurality of semantic databases corresponding to a plurality of technical domains.

17. A computerized interface platform comprising the computerized system of claim 1, the interface platform comprising at least a storage (“the platform storage”) and a processor (“the platform processor”), the platform processor configured to: query an information database in the platform storage for relevant information; retrieve from a phrase database in the platform storage a first and second phrase relating to the relevant information; using the computerized system, compare phrase templates corresponding to the first and second phrase and, if the similarity between the phrase templates is above a threshold value, consolidate or reconcile the first and second phrase; or, if the similarity between the phrase templates is below the threshold value, leave the first and second phrases unchanged; and send the consolidated phrase or the first or second phrases (as the case may be) to an end user.

18. Preferably, the consolidated phrase or the first or second phrases (as the case may be) are sent to the end user as one or more questionnaire(s).

19. Preferably, the platform processor is further configured to receive from the end user responses to the consolidated phrase or the first or second phrases (as the case may be), and to store said responses in the platform storage.

20. A computerized method of comparing a first and second phrase using a system comprising at least a storage and a processor, the method comprising the steps of the processor, for each phrase: parsing the phrase into distinct syntactic elements, each distinct syntactic element comprising a word or group of words; for the word or group of words corresponding to at least one of the distinct syntactic elements, querying a semantic database to identify a corresponding semantic or ontological term (“the semantic term”); inserting the semantic term and at least one of the remaining distinct syntactic elements into the phrase template retrieved from the storage, wherein the phrase template is divided by type of semantic or ontological term and by type of syntactic element, and the semantic term and the at least one of the remaining distinct syntactic elements are inserted, respectively, into the portion of the phrase template that corresponds to their type; the method further comprising the steps of the processor: using a similarity metric to quantify the similarity between the phrase templates corresponding to the first and second phrase; and if the similarity between the phrase templates is above a threshold value, consolidating or reconciling the first and second phrase.