WO2020234388A1

WO2020234388A1 - Propensity score based assessment of patient data

Info

Publication number: WO2020234388A1
Application number: PCT/EP2020/064134
Authority: WO
Inventors: Fabian SCHMICH; Anna BAUER-MEHREN; Janick WEBERPALS; Fabian J. THEIS
Original assignee: F Hoffmann La Roche AG; Hoffmann La Roche Inc
Current assignee: F Hoffmann La Roche AG; Hoffmann La Roche Inc
Priority date: 2019-05-22
Filing date: 2020-05-20
Publication date: 2020-11-26
Anticipated expiration: 2021-11-22

Abstract

The invention relates to a computer-implemented method for assessing the suitability of patient data for accurately determining the efficacy, effectiveness and/or safety of a medical treatment. The method comprises: - providing (102) a data set (200) comprising a plurality of data records (214, 218, 222, 224) of patients, each data record comprising feature values and a label indicating a treatment status; - training (104) a neuronal network (410) on the data set for providing a trained neural network; - computing (106), by the trained neural network, a feature vector for each of at least a subset of the patients, the vector being a dimensionally reduced representation of the feature values; - computing (108) propensity scores by performing a regression analysis of each feature vector and the label of the said data record, the propensity score being indicative of the likelihood that a patient having a particular set of feature values will receive a particular medical treatment; - outputting (120, 122) whether or not the data records of the subset of patients are suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment as a function of the propensity scores.

Description

PROPENSITY SCORE BASED ASSESSMENT OF PATI ENT DATA

D e s c r i p t i o n

Field of the invention

The invention relates to the analysis of medical data, and in particular to the field of medical data analysis using propensity scores.

Background and related art

Propensity score (PS) techniques are popular analytical approaches to balance patient characteristics in non-experimental studies. Peter C. Austin defines a propensity score as a subject's probability of treatment, conditional on observed baseline covariates. Conditional on the true propensity score, treated and untreated subjects have similar distributions of observed baseline covariates. (Statistics in Medicine, 2009; 28:3083-3107, published online 15 September 2009 in Wiley InterScience, DOI : 10.1002/sim.3697). Propensity scores are used for identifying treated and untreated subjects having similar distributions of measured baseline covariates. Today, different approaches for computing a propensity score are used.

For example, according to one approach referred herein as "manual variable selection - MSS", a domain expert, e.g. a doctor or clinician, manually selects a set of patient-related features such as age, gender, and/or physiologic parameters that are considered to be of relevance for the efficacy of a particular treatment.

A problem with this approach is that the feature selection is highly subjective and as a consequence, the propensity scores used in different medical facilities may be based on a different set of patient-related features and hence may not be comparable. Furthermore, the manual selection of the relevant features used for computing the propensity score bears the risk that relevant features are neglected and/or that actually irrelevant features are included in the propensity score computation and may thus reduce the quality of the computed propensity score. Furthermore, the manual selection of features is burdensome so there is a tendency to select much fewer features than features that are available and could be used for improving the accuracy of the computed propensity scores.

According to another approach, high-dimensional propensity score algorithms are used to estimate treatment effects among prescription drug initiators. According to this approach, the original variable selection procedure based on the estimated bias of each variable using unadjusted associations between confounders and exposure (RRCE) and disease outcome (RRCD) is augmented by alternative strategies, e.g. the use of models considering more than 1,500 variables jointly (Lasso, Bayesian, etc.) and using prediction statistics or likelihood-ratio statistics for covariate prioritization (Sebastian Schneeweiss et al..,

"Variable Selection for Confounding Adjustment in High-dimensional Covariate Spaces When Analyzing Healthcare Databases", Epidemiology 2017;28: 2S7-248). A problem associated with these approaches is that the large number of evaluated features sometimes decreases the accuracy of the propensity score, presumably because some features actually act as noise.

As a consequence of the above-mentioned problems, groups of "similar" and hence "comparable" patients obtained based on state-of-the art approaches of computing a propensity score have often been observed not to be well balanced. As a result, these approaches often are not able to accurately determine whether patient groups having been identified based on a similar propensity score are non-biased and hence may be used as a valid data basis for determining the efficacy of a particular treatment in respect to a particular disease.

Summary

It is an objective of the present invention to provide for an improved method of assessing the suitability of patient data for accurately determining the efficacy, effectiveness and/or safety of a medical treatment and a corresponding system as specified in the independent claims. Embodiments of the invention are given in the dependent claims. Embodiments of the present invention can be freely combined with each other if they are not mutually exclusive.

In one aspect, the invention relates to a method for assessing the suitability of patient data for accurately determining the efficacy, effectiveness and/or safety of a medical treatment. The method comprises:

- providing a data set comprising a plurality of data records, each data record

representing a patient and comprising a plurality of patient-related feature values and a label indicating whether the patient has received a particular medical treatment;

- training a neuronal network on the data set for providing a trained neural network;

- computing, by the trained neural network, a feature vector for each of at least a subset of the data records of the data set, each feature vector being a dimensionally reduced representation of the feature values of the respective data record input to the trained neural network;

- computing a propensity score for each of the data records for which a feature vector was computed, the computation of the propensity score comprising performing a regression analysis of the feature vector and the label of the said data record, the propensity score being indicative of the likelihood that a patient having a particular set of feature values will receive a particular medical treatment;

- Identifying pairs of first and second data records having similar propensity scores, wherein the first data record of each pair represents a patient having received the particular medical treatment, wherein the second data record in each pair represents a patient not having received the particular medical treatment;

- creating a treated data record group selectively comprising all first data records of the identified data record pairs; this group is also referred to as "virtual" "treated" data record group;

- creating a control data record group selectively comprising all second data records of the identified data record pairs; this group is also referred to as "virtual" "control" data record group;

- statistically analyzing the treated data record group and the control data record group for determining if the treated data record group and the control data record group are similar;

- if the treated data record group and the control data record group are similar,

outputting that the treated data record group and the control data record group are suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment; and

- if the treated data record group and the control data record group are dissimilar, outputting that the treated data record group and the control data record group are not suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment.

This may be advantageous as applicant has observed that neural networks are able to more accurately determine whether or not a particular set of patient data is suited for being used in a database study for accurately determining the efficacy, effectiveness and/or safety of a medical treatment. Applicant has observed that using a neural network for computing a dimensionally reduced vector representation of patient-related feature values in combination with using regression analysis on this feature vector in respect to a label indicating whether a patient receives the treatment of interest are particularly well suited to compute a propensity score that more accurately represents a non-biased, feature- based probability that a patient receives the treatment of interest than state of the art approaches. According to embodiments, the determining if the treated data record group and the control data record group are similar comprises determining if the two groups comprises at least a minimum number of data records to support a statistically accurate and significant assessment of the efficacy, effectiveness and/or safety of a medical treatment. For example, if the propensity scores of the "treated" and "control" patients are so dissimilar that it is not possible to create virtual "treated" and "control" groups of sufficient size, then the method can comprise outputting that the data set or the subset of data records selected for feature vector computation does not allow creating virtual patient groups of sufficient size to perform a valid statistical assessment of the treatment effect.

According to embodiments, some or all steps of the computer-implemented method, in particular the computation of the feature vectors, the regression analysis, the propensity score computation, the creation of the "treated" and "control" data record groups, the statistical analysis of the similarity of the two groups and the outputting of the result are executed fully automatically. Hence, an automated quality assessment of any data set that is to be used for database studies may be provided. This may be beneficial as a manual feature selection for computing the propensity score and further manual steps for assessing the quality of a data set may be avoided.

In the following, the treated data record group and the control data record group will also be referred to as "virtual data record groups" and the respective patient groups will be referred to "virtual (treated and control) patient groups", because these virtual data record and patient groups are determined only computationally and do not (necessarily) correspond to ("empirical") patient groups defined by a clinic or a clinical study and are typically not identical to the "real/empirical" sub-groups of "treated" and "non-treated" (or "otherwise treated") patients represented by the data records in the data set.

According to some embodiments, the regression analysis performed for computing the propensity score can be a logistic regression analysis performed by a statistical software program such as, for example, R, or can be performed using machine-learning based regression analysis approaches, e.g. regression analysis using support vector machines (SVMs), random forests or neural networks. The performing of the regression analysis can comprise minimizing the difference of a predicted label of a data record representing a patient and the label (indicating the treatment status of the patient) that is actually assigned to the patient. Thereby, the predicted label is computed and predicted based on the feature vector generated by the neural network. According to some embodiments, the statistical analysis of the treated data record group and the control data record group comprises performing a Kolmogorov-Smirnov test, and/or computing a p-value of a D statistics and/or performing a Student T Test on the propensity score distribution or feature value distributions of the said two "virtual" patient groups.

According to embodiments, the statistical analysis comprises: determining a first distribution of the feature values of the data records of the treated data record group and a second distribution of the feature values of the data records of the control data record; and comparing the first and the second distribution. The treated data record group and the control data record group are determined to be similar if the similarity of the two distributions exceeds a feature-value-distribution-similarity-threshold. For example, this threshold can be specified as a p-value of a Kolmogorov-Smirnov test or as a maximum mean square deviation of a set of feature values of the feature values of the data records of the two groups. The treated data record group and the control data record group are determined to be dissimilar if the similarity of the two distributions is below the feature- value-distribution-similarity-threshold.

This may be advantageous as the similarity of groups is determined based on objective, reproducible criteria. Furthermore, there already exist several statistical programs such as R that allow performing various tests for determining the similarity of the distribution of one or more parameter values.

According to embodiments, the comparing of the first and second distributions comprises computing mean differences of the feature values of the data records in the treated data record group and the control data record group.

This may be advantageous as this approach can be used for determining if the health status of the patients assigned to the virtual group of the "treated" patients is sufficiently similar to the health status of the patients assigned to the virtual "control" group. In other words, the determination of the similarity of the two virtual groups allows automatically determining if the two virtual patient groups are "balanced", i.e., suited for a statistical comparison and analysis for determining if the treatment has a positive or negative effect on the health of the patients. The feature value distribution comparison may comprise displaying the distributions of the individual features of the patients, thereby enabling a user, e.g. a clinician, to visually assess the contribution of the individual features to the similarity or dissimilarity of the individual groups.

According to embodiments, the statistical analysis comprises: determining a first distribution of the propensity scores of the data records of the treated data record group and a second distribution of the propensity scores of the data records of the control data record; and comparing the first and the second distribution. The treated data record group and the control data record group are determined to be similar if the similarity of the two distributions exceeds a propensity-score-distribution-similarity-threshold. The treated data record group and the control data record group are determined to be dissimilar if the similarity of the two distributions is below the propensity-score-distribution-similarity- threshold.

The propensity score is a score that is derived from several feature values, and applicant has observed that based on the comparison of the propensity score distributions of the two virtual groups alone, a computationally efficient and highly accurate alternative method for determining the similarity of the two virtual patient groups is provided. According to embodiments, both the similarity of the propensity score distributions of the two virtual groups and the similarity of one or more feature value distributions of the two virtual groups are determined for computationally assessing the similarity and comparability of the two virtual patient groups which is considered a prerequisite for performing any further assessment of the treatment effect on the basis of the two virtual patient groups.

According to embodiments, the comparing of the first and second distributions comprises performing a Kolmogorov-Smirnov test of the propensity scores of the data records in the treated data record group and of the control data record group.

According to embodiments, the method further comprises: if the output indicates that the data set is suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment, automatically performing a further statistical analysis of the treated data record group and the control data record group using patient-related feature values obtained during or after treatment of the patients in the treated patient group.

Preferably, the feature values of the patients in the data set used for training the neural network comprise only patient-related feature values obtained at one or more time points before the treatment started. To the contrary, the statistical analysis of the feature values of the patient related data records in the two virtual patient groups uses patient-related feature values having been obtained after or during the patients in the "virtual" group of treated patients received the treatment. This may ensure that the feature values considered for determining if the treated and control virtual patient groups are balanced do not comprise feature value affected by the treatment. This is beneficial, because the purpose of the propensity score based creation of virtual patient groups is the creation of patient groups having similar health status before the treatment is started. This is a prerequisite for determining if and to which extent a particular treatment has an effect on the patients.

Embodiments of the invention may allow immediately performing a database study to assess the efficacy, effectiveness and/or safety of a treatment immediately after it was determined that the virtually created patient groups are similar and well-balanced.

According to embodiments, if the output indicates that the data set is not suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment, the method comprises automatically selecting a subset of not-yet selected data records from the data set, if any; and repeating the propensity score computation, the

identification of pairs of first and second data records having similar propensity scores, the creation of the treated data record group, the creation of the control data record group, the statistical analysis and the outputting of the result of the statistical analysis.

Hence, according to embodiments, a statistical analysis on "virtual" patient cohorts for determining treatment efficacy, effectiveness and/or safety is automatically performed selectively in case the data quality (group similarity) of the virtual patient groups formed based on the propensity score is high enough. Otherwise, a warning message is output. For example, a warning message can be any indication that the computationally created "treated" and "control" patient groups are so dissimilar that they cannot be used for statistically accurately assessing the efficacy, effectiveness and/or safety of the treatment. The warning message can be, for example, a text string displayed on a graphical user interface, a dialog window, a sound, an e-mail sent to a user or any other form of message or signal that is adapted to inform a user on the result of the determination of the similarity of the two compared virtual patient groups. In addition, or alternatively, the method may comprise selecting a different subset of data records from the totality of data records used for training the neural network and re-computing and analyzing the virtual "treated" and "control" patient groups. This may greatly ease and increase the efficacy of evaluating the suitability of a patient data set for virtual clinical studies and may allow automatically selecting sub-sets of a larger data set that allows the propensity-score based creation of virtual treated and control patient groups being sufficiently balanced for further analysis of a treatment effect.

According to embodiments, the neural network is an autoencoder.

Previously, autoencoders have mainly been used in the image analysis domain for creating a dimensionally reduced representation of digital images. Autoencoders comprise a reduction side and a reconstructing side. The reduction side of an autoencoder learns, during training, to reduce a set of variable values provided as input to a reduced variable set comprising less variables than originally input to the autoencoder, whereby the reduced set of variables are nevertheless highly characteristic and descriptive of the original input. During the training, the reconstruction side of the autoencoder learns to reconstruct the original input, e.g. the set of originally provided input variables (e.g. pixel intensity values).

A loss function computes the deviations of the input variable values (e.g. original image pixels) from the reconstructed variable values (e.g. image pixels computed by the reconstruction side) and modifies the machine learning model of the autoencoder during training such that the deviation is minimized. Applicant has surprisingly observed that the reduction and reconstruction capabilities of autoencoders which have hitherto mainly been used in the field of image analysis can be used for increasing accuracy and speed of propensity score computation. This is because the reduction part of the autoencoder is adapted to learn during the training process to compute a feature vector of reduced dimensionality from a larger set of feature values of a patient which are input to the autoencoder. Hence, the reduction part of the autoencoder learns to identify a comparatively small number of features of a patient which are highly characteristic of the health state of a patient. It is not necessary to provide a manually annotated data set where the health status of a patient is manually categorized and labeled. Hence, using autoencoders for computing the propensity score allows assessing the usability of data sets being free of explicit or manually added annotations that characterize the health state of a patient. This may allow making use of many existing patient data sets which are not annotated or are inconsistently annotated regarding the health state of the patients. The autoencoder simply takes the totality of patient related features such as physiologic parameters, age, gender, health history, genetic parameters available and learns, in a self- supervised manner, to create a feature vector being a dimensionally reduced

representation of all these feature values. As a consequence, all further processing steps, in particular the regression analysis for computing a propensity score from the feature vector, is tremendously accelerated, because only a small subset of the originally available patient related feature values have to be processed.

Hence, using autoencoders may allow computing high-quality propensity scores also for patient data sets which are free of a consistent annotation of the health status of the patients. Furthermore, thanks to the computation of the feature vector as a dimensionally reduced representation of the health status of the patient, the regression analysis and the computation of the propensity score may be significantly accelerated and the CPU consumption may be reduced.

According to embodiments, the neural network comprises a bottle neck layer.

A "bottleneck layer" as used herein is a layer that contains fewer nodes than its previous layers. In particular, it can be a layer of a neural network comprising the fewest number of nodes. It can be used to obtain a representation of the input with reduced dimensionality.

According to embodiments, the neural network is configured to perform a nonlinear dimensional reduction of the feature values input to the neural network for computing the feature vector. For example, several autoencoders comprise a bottleneck layer adapted to compute a feature vector from a plurality of input feature values by performing a nonlinear dimensional reduction of the input feature values. This may be advantageous, because applicant has observed that several clinically relevant patient related features interrelate with each other in a nonlinear manner. For example, the BMI may not correlate

significantly with certain cardiovascular problems provided the BMI is within the first value range. However, if the BMI is in a second value range, the effects on cardiovascular problems may increase strongly in a nonlinear manner. Hence, autoencoders adapted to perform a nonlinear dimensional reduction of the input feature values has been observed to be particularly appropriate and useful for identifying interdependencies of different patient related features and for identifying a feature vector of reduced size that nevertheless accurately characterizes the health status of the patient.

According to one embodiment, the bottleneck layer is a layer having about 10-300 nodes ("neurons"), e.g. 128, 64, 32 and most preferably only 16 nodes. For example, a 5-layer autoencoder could comprise a 128 nodes bottle neck layer. The feature vector computed by a 11-layer autoencoder can be smaller because of less dimensionality reduction in a 5 layer autoencoder as compared to the 11-layer autoencoder. According to embodiments, the bottleneck layer is configured to generate a feature vector being a nonlinear lower dimensional representation of the totality of feature values of a patient input to the neural network (during training or at test phase).

According to embodiments, the neural network comprises an input layer comprising more than 200 neurons, each of the neurons corresponding to a respective one of the patient- related features.

According to embodiments, the neural network comprises 5 to 15 layers in total, preferably 5 to 11 layers in total, most preferably 5 layers in total.

According to embodiments, the neural network comprises an encoder part comprising a plurality of layers whose neurons use an activation function being a rectifier function and comprising a decoder part comprising a plurality of layers whose neurons use an activation function being a sigmoidal function.

Applicant has observed that the above-mentioned combination of a rectifier function and a sigmoidal function improves the learning effect and improves the ability of the

autoencoder to compute a strong dimensional reduction of the input feature values. According to embodiments, the neural network is trained using a number of at least 50 epochs (e.g. 100 epochs, more preferentially at least 200 epochs, e.g. 256 epochs) and/or a batch size of 60-80 data records (e.g. 64 data records).

According to embodiments, the neural network is trained using a noise a value of 3-7%, in particular 5%. This means that about 3-7 % of the feature values that are input to the neural network during training are randomly set to zero or to a different computationally generated data value. This feature may help avoiding overfitting during training.

Applicant has observed that neural networks, in particular autoencoders, having the above- mentioned number of layers, epochs, activating function, noise function and/or nodes are particularly suited for computing a feature vector that can be used as basis for fast and accurate propensity score computation.

According to embodiments, the neural network is configured to reconstruct the feature values of a patient input to the neural network from the feature vector. The neural network is trained using a loss function that determines a deviation of the reconstructed feature values of a patient from the patient-related feature values actually input to the neural network.

According to embodiments, the performing of the regression analysis of the feature vector and the label comprises determining which feature values of the feature vector have the strongest predictive power in respect to the treatment status of a patient indicated in the label, and computing the propensity score as a function of the determined predictive power of the feature values in the feature vectors and the label.

For example, an autoencoder can be used as the neural network. The autoencoder is adapted to learn to reconstruct the totality of input feature values from a dimensionally reduced set of feature values stored in a feature vector in an unsupervised (or "self- supervised") manner. The autoencoder hence does not need explicit labels indicating the health state of a patient. The reconstruction (decoder) part of the autoencoder can use a cross-entropy function to compute the deviation of the predicted ^reconstructed) input vector and the real input vector (and thereby learning the most important features which are to be stored in the bottle neck layer and which are to be output by the bottleneck layer IB of the trained autoencoder in the form of a feature vector. The propensity score is computed afterwards using the feature vector output by the bottle neck layer of the trained neural network as variables in the regression analysis.

According to embodiments, the neural network is trained using a loss function. The loss function is a cross-entropy loss function, in particular a binary cross-entropy loss function.

According to embodiments, the method comprises normalizing the feature values before they are input into the neural network. For example, the normalization can comprise a standardization or scaling of feature values. For example, the normalization can be performed using a min-max function.

According to embodiments, the trained neural network comprises an encoder part, a bottleneck layer and a decoder part. The computing of the feature vector for each of the data records comprises, for each of the data records of at least the subset of data records of the data set:

- storing the feature vector computed by the bottleneck layer as a function of the feature values of a data record on a volatile or non-volatile storage medium;

- inputting the stored feature vector and the label of the data set into a statistical analysis program (e.g. R); and

- performing, by the statistical analysis program, the regression analysis (in particular logistic regression analysis) of the feature vector values in the feature vector and the label for computing the propensity score for the patient represented by the data record.

According to embodiments, the feature values are selected from a group comprising:

patient age, patient gender, patient eating habits, patient drinking habits, patient smoking habit, patient weight, patient BMI, patient height, patient health insurance status, patient hospital visit information, patient body temperature, patient pulse rate, patient respiration rate (rate of breathing), patient blood pressure, patient physiological values, in particular blood parameter values, patient genomic features, patient history features, geographic region of the patient, current medication of a patient, comorbidities of a patient, patient metabolic features, and any combination of two or more of the aforementioned features. According to embodiments, the feature values are free of an indication if the patient was treated or not. This may be beneficial as it is ensured that the neural network does not learn to compute a dimensionally reduced characterization of the health status of a patient based on an attribute (treated or not) that itself is a result of the assessment of the health status by the clinician in charge. This may increase accuracy of the trained neural network.

According to embodiments, all feature values of a patient input to the neural network are indicative of patient-related features observed at one or more time points before the treatment started. Preferably, in case the method determines that the virtual patient groups are balanced, the assessment of the efficacy, effectiveness and/or safety of the treatment is performed on patient-related feature values that are identical or similar to the above-mentioned feature values but have been measured or otherwise observed during or after the patients of the "treated" group received the treatment. This may ensure that the neural network does not learn any bias based on the historical and inconsistent treatment decisions of many different clinicians and that at the same time, if the data set allows performing a database trial, this "virtual" database trial is performed on health-related data that really reflects a patient's health during or after treatment.

According to embodiments, the patients are cancer patients and the treatment is selected from a group comprising: immunotherapy, chemotherapy, targeted cancer therapy, and any combination of two or more of the aforementioned therapies. However, many other treatment approaches exist and the method can be used according to embodiments of the invention for determining the suitability of data sets for later analysis in respect to many different treatment forms including those addressing non-cancer related diseases.

In a further aspect, the invention relates to a method for assessing the suitability of patient data for accurately determining the efficacy, effectiveness and/or safety of a medical treatment. The method comprises:

- providing a data set comprising a plurality of data records, each data record

- training a neuronal network on the data set for providing a trained neural network; - computing, by the trained neural network, a feature vector for each of at least a subset of the data records of the data set, each feature vector being a dimensionally reduced representation of the feature values of the respective data record input to the trained neural network;

- using the computed propensity scores for computationally creating a treated data

record group and a control data record group, the treated data record group selectively comprising data records having assigned a label indicating that the patient represented by the said data record has received the treatment, the control data record group selectively comprising data records having assigned a label indicating that the patient did not receive the treatment, the propensity scores of the data records in the treated data record group being similar to the propensity scores in the control data record group; and

- predicting whether or not the treated data record group and the control data record group are suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment as a function of the similarity of the treated data record group and the control data record group.

According to embodiments, the using the computed propensity scores for computationally creating a treated data record group and a control data record group comprises:

- creating a treated data record group as a data record group selectively comprising all first data records of the identified data record pairs; and - creating the control data record group as a data record group selectively comprising all second data records of the identified data record pairs.

According to embodiments, the prediction whether or not the treated data record group and the control data record group are suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment as a function of the similarity of the treated data record group and the control data record group comprises:

All embodiments and examples mentioned for different aspects of the invention such as method, computer-readable storage medium and system can freely be combined with each other.

In a further aspect, the invention relates to a computer-readable storage medium comprising instructions which, when executed by a processor, cause the processor to perform a computer-implemented method according to any one of the embodiments and examples described herein.

In a further aspect the invention relates to a system for assessing the suitability of patient data for accurately determining the efficacy, effectiveness and/or safety of a medical treatment. The system comprises a user interface and a storage medium. The storage medium comprises a trained neuronal network and a (e.g. monolithic or distributed) quality assessment program.

The trained neural network has been trained on a data set comprising a plurality of data records, each data record representing a patient and comprising a plurality of patient- related feature values and a label indicating whether the patient has received a particular medical treatment. The trained neural network is configured to compute a feature vector for each of at least a subset of the data records of the data set. Each feature vector is a dimensionally reduced representation of the feature values of the respective data record input to the neural network.

The quality assessment program is configured for:

- computing a propensity score for each of the data records for which a feature vector was computed. The computation of the propensity score comprising performing a regression analysis of the feature vector and the label of the said data record. The propensity score is indicative of the likelihood that a patient having a particular set of feature values will receive a particular medical treatment;

- creating a treated data record group selectively comprising all first data records of the identified data record pairs;

- creating a control data record group selectively comprising all second data records of the identified data record pairs;

outputting, via the user interface, that the treated data record group and the control data record group are suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment; and - if the treated data record group and the control data record group are dissimilar, outputting, via the user interface, that the treated data record group and the control data record group are not suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment.

The quality assessment program can be, for example, a set of one or more software functions, software modules and/or software programs that are orchestrated such that a method for computing propensity scores for the feature vectors output by the trained neural network and using the propensity score for creating and comparing virtual patient groups and outputting whether or not the virtual patient groups can be used as basis for assessing the efficacy, effectiveness and/or safety of a treatment can be used. The quality assessment program can be implemented with any programming language, e.g. C, C++, .Net, Java, Phyton or the like. The quality assessment program can be implemented e.g. in the form of a set of software, functions and/or modules that are part of or that constitute a software framework as described, for example, with reference to figure 4.

A "propensity score" as used herein is a numerical value being indicative of the likelihood that a patient having a particular set of feature values will receive a particular medical treatment. Thereby, the propensity score is computed as a function of the feature values. For example, the feature values of a patient can be stored as feature values of a data record representing the patient. For example, the propensity score can be descriptive of a patient's probability of receiving a particular treatment, whereby this probability depends on feature values of the patient. These feature values can also be described as observed baseline covariates of the patient.

The term "efficacy" as used herein is the performance of a treatment, whereby the treatment is preferably performed under controlled circumstances, e.g. in randomized trials. For example, in case some or all patients represented by the data records in the data set are patients having participated in a randomized trial, the method can be used to check whether the patient cohorts were indeed balanced and whether the data set allows assessing the efficacy of the treatment to cure or improve a particular disease. A "treated" patient group is a group of patients known to have received a particular treatment of interest, e.g. immunotherapy. Accordingly, a data record in the data set representing a "treated" patient has assigned a label indicating that this patient has received the treatment of interest.

A "control" patient group is a group of patients known not to have received the particular treatment of interest. These patients may not have received any treatment of all and/or may have received one or more treatments other than the treatment of interest.

Accordingly, a data record in the data set representing a "control" patient has assigned a label indicating that this patient has not received the treatment of interest and/or has received a different treatment.

The term "effectiveness" as used herein describes relative positive effects of a treatment under everyday/ routine care scenarios. For example, the data set can be a large data set collected from different clinics, different patient groups within and/or outside of a clinical study and may hence be highly heterogeneous. The data set may not comprise any predefined patient cohorts or study branches. Nevertheless, the propensity score may allow selecting "treated" and "control" patient sub-groups from this highly heterogeneous data set for determining - without performing any real medical study and based on available data alone - whether a particular treatment had a positive (or negative) effect. According to preferred embodiments, the analysis of the data set is performed for determining whether the data set or a subset thereof allows determining the effectiveness of a particular drug. This allows determining if the data set or subset is suited for performing a database study on patient data that was collected under 'real-world' rather than "clinical study/controlled" conditions.

The„safety" of a treatment (or of a drug used in the treatment), also known as pharmacovigilance, is descriptive of the number and type of undesired side effects that the treatment (or respective drug) typically has or can have on the patient.

A "data set" as used herein is a collection of patient data. Preferably, a plurality of feature values of each of a plurality of patients is stored in the form of a respective data record. According to preferred embodiments, some or all of the patient data is collected during clinical routine and outside of a clinical study. In some embodiments, the patients represented by the data records of the data set comprise patients from many different clinics.

A "data record" as used herein is a data structure used for storing medical data of a particular patient. According to some embodiments, a data record is a collection of fields, possibly of different data types, typically in fixed number and sequence. For example, each data record can be stored in the form of a table line in a relational database. Alternatively, each data record can be stored as a separate file or as an element of a semi-structured document, e.g. an XML file, on a data storage medium.

A "treatment" as used herein is a medical treatment, i.e., an attempted remediation of a health problem, usually following a diagnosis. Typically, but not necessarily, a therapy involves the application of a particular drug, e.g. at defined concentrations and/or intervals, or the application of a combination of multiple drugs.

A "label" as used herein is a data value, e.g. a string or a Boolean or a numerical value, that represents and specifies a patient-related data value. In particular, the label can indicate whether a patient receives a particular treatment or not. For example, the label could be "true" for all patients that receive a particular treatment and "false" for all patients who don't.

A "feature value" as used herein is a data value, e.g. an alphanumeric string, a numerical value, a Boolean value or the like. For example, the feature value can be a patient-related data value and can be indicative of a health-related attribute of the patient, e.g. age, gender, BMI, comorbidities, metabolite concentrations in the blood or other body fluids, respiratory attributes, cardiovascular attributes and the like.

A "feature vector" as used herein is a data structure that is generated by a neural network from a set of feature values input to the neural network, whereby the generated feature vector is a dimensionally reduced representation of the feature values of the respective data record input to the neural network. In particular, the dimensional reduction may comprise automatically identifying and omitting features whose values strongly correlate with other features and hence are not required for characterizing the patient, in particular the health state of a patient. The feature vector can be, for example, a vector or array of data values output by the bottleneck layer of a neural network having learned to reconstruct the input feature values from the generated feature vector during the training phase. That the feature vector is a "dimensionally reduced representation of a set of feature values input to the neural network" means that the feature vector comprises less feature values than have been input into the neural network per patient during the training phase.

A "neural network" or "artificial neural network" as used herein is a software composed of artificial neurons or nodes and being adapted - after a training phase - for solving artificial intelligence (Al) problems. The connections of the neurons are modeled as weights. An activation function controls the amplitude of the output.

An "autoencoder" as used herein is a type of artificial neural network used to learn efficient data codings in an unsupervised (or "self-supervised") manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal "noise". Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. According to embodiments, an autoencoder is a software adapted to perform a data compression algorithm (the encoder) and a data decompression algorithm (the decoder) where the compression and decompression functions are 1) data-specific, 2) lossy, and 3) learned automatically from examples rather than engineered by a human. According to embodiments, the autoencoder is a deep, preferably fully-connected, autoencoder. According to other embodiments, the autoencoder is a ID convolutional autoencoder or a sequence-to-sequence autoencoder. Examples of autoencoder implementations that can be used according to embodiments of the invention are described for example in the following publications: Goodfellow, Y. Bengio, and A.

Courville, "Deep learning", MIT Press, 2016; or Miotto, Riccardo, et al. "Deep patient: an unsupervised representation to predict the future of patients from the electronic health records", Scientific reports 6 (2016), 26094; or Shickel, Benjamin, et al. "Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis", IEEE journal of biomedical and health informatics 22.5 (2017), 1589-1604. A good introduction to autoencoders is provided on the homepage of the Stanford university: http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/.

According to embodiments, the open-source neural-network library Keras is used for defining and training the autoencoder. Keras comprises several types of autoencoders and useful functions in the context of neural networks and machine learning. Keras is written in Python and is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit,

Theano, or PlaidML.

A "regression analysis of the feature vector and the label" as used herein is a process of automatically and statistically estimating the relationships among the feature values in the feature vector and the label indicating if the patient received the treatment. It can comprise analyzing the feature values of the feature vector and the value of the label for automatically predicting, or learning to predict, the propensity score from the feature vector. The propensity score indicates the likelihood of a patient for receiving a treatment and hence the likelihood of a data record for having assigned a label indicating that the patient receives the treatment. Regression analysis helps to understand and model how the typical value of the dependent variable (here: the value of the label) changes when any one of the values in the feature vector is varied, while the other feature values in the feature vector are held fixed.

A "machine learning program (MLP)" as used herein is a software program or other piece of software that has been or that can be trained in a training process and that - as a result of the training process - has learned to perform some predictive and/or data processing tasks based on the provided training data. Thus, an MLP can be a program code that is at least partially not explicitly specified by a programmer, but that is implicitly learned and modified in a data-driven learning process that builds one or more implicit or explicit models from sample inputs. Machine learning may employ supervised or unsupervised learning. Effective machine learning is often difficult because finding patterns is hard and often not enough training data are available. Common examples of MLPs are neuronal networks and support vector machines.

A "rectifier function" as used herein is a type of activation function often termed "ReLU", whereby ReLU stands for Rectified Linear Units. It is an activation function defined as the positive part of its argument: f(x) = x⁺ = max(0, x) where x is the input to a neuron. Despite its name and appearance, it's not linear and has been observed to provide better performance and learning speed than sigmoidal functions. To the contrary, a sigmoid activation function takes a real value as input and outputs another value between 0 and 1. It is non-linear, continuously differentiable, monotonic, and has a fixed output range.

Brief description of the drawings

In the following embodiments of the invention are explained in greater detail, by way of example only, making reference to the drawings in which:

Figure 1 depicts a flowchart of a method according to an embodiment of the

invention;

Figure 2 depicts a dataset comprising a plurality of data records respectively

representing a "treated" or an "control" patient and having assigned a propensity score, whereby treated and nontreated patients having a highly similar propensity score are computationally grouped into a pair;

Figure 3 depicts the paired "treated" and "control" data records, the totality of paired data records representing a treated patient forming a "treated data record group", the totality of paired data records representing a "control" patient forming a "control data record group";

Figure 4 is a block diagram of a computer system according to an embodiment of the invention;

Figure 5 depicts a network architecture of an autoencoder program adapted to

compute a compressed representation of a digital image;

Figure 6 depicts a network architecture of an autoencoder program being adapted to compute a feature vector used as a compressed representation of the feature values of a patient; Figure 7 depicts a network architecture of an autoencoder program according to figure 6 comprising 11 layers;

Figure 8 depicts the propensity score distributions of the "treated" and "control" data records in the data set (empirical, not virtual groups);

Figure 9 depicts the standardized mean differences (SMD) for feature values of the data records representing "treated" and "control" patients (non-treated or "otherwise treated" patients) in the totality of patients in the data set ("empirical" patient groups);

Figure 10 depicts the D-statistics according to Kolmogorov-Smirnov of the propensity score distributions of a "treated" and a "control" group created based on six alternative approaches of computing the propensity score;

Figure 11 depicts the D-statistics according to Kolmogorov-Smirnov of the propensity score distributions of a "treated" and a "control" group created based on six alternative approaches of computing the propensity score, wherein a different similarity threshold for creating the virtual patient groups is used;

Figure 12 depicts the D-statistics according to Kolmogorov-Smirnov of the propensity score distributions of a "treated" and a "control" group created based on six alternative approaches of computing the propensity score, wherein a different similarity threshold for creating the virtual patient groups is used;

Figure 13 depicts the D-statistics according to Kolmogorov-Smirnov of the propensity score distributions of a "treated" and a "control" group created based on six alternative approaches of computing the propensity score, wherein a different similarity threshold for creating the virtual patient groups is used;

Figure 14 depicts the baseline characteristics of the feature value distributions for a list of features of the data records of a "treated" and a "control" group created based on six alternative approaches of computing the propensity score, wherein a particular similarity threshold for creating the virtual patient groups is used; Figure 15 depicts the baseline characteristics of the feature value distributions for a list of features of the data records of a "treated" and a "control" group created based on six alternative approaches of computing the propensity score, wherein a different similarity threshold for creating the virtual patient groups is used;

Figure 16 depicts the baseline characteristics of the feature value distributions for a list of features of the data records of a "treated" and a "control" group created based on six alternative approaches of computing the propensity score, wherein a different similarity threshold for creating the virtual patient groups is used;

Figure 17 depicts the standardized mean differences of a list of features of the data records representing "treated" and "control" (non-paired) patients in the subset of "advanced NSCLC patients" in the data set;

Figure 18 is a Kaplan-Meier plot showing the survival probabilities of two different treatment arms of patients for the time after initiation of (second line) treatment obtained by comparing the virtual "treated" and "control" groups;

Figure 19 shows changes in the propensity score and feature value distributions of treated vs. "control" patients before and after the PS-based pairing, wherein the propensity scores are computed based on manually selected features;

Figure 20 shows changes in the propensity score and feature value distributions of treated vs. "control" patients before and after the PS-based pairing, wherein the propensity scores are computed based on features selected using PCA;

Figure 21 shows changes in the propensity score and feature value distributions of treated vs. "control" patients before and after the PS-based pairing, wherein the propensity scores are computed based on features selected using a 5- layered autoencoder;

Figure 22 shows training and validation losses observed in autoencoders comprising 5,

7, 9 and 11 layers. Figure 1 depicts a flowchart of a method according to an embodiment of the invention. In the following, the method will be described for various embodiments and examples making reference also to an exemplary system architecture depicted in figure 4 and to the illustrations of patient data shown in figures 2 and 3.

The method can be used e.g. for assessing the suitability of patient data for accurately determining the efficacy, effectiveness and/or safety of a medical treatment.

First in step 102, a data set 200 comprising a plurality of data records 214, 218, 222, 224 is provided. Each data record represents a patient and comprises a plurality of patient-related feature values and a label indicating whether the patient has received a particular medical treatment. For example, the data can be patient data for many thousands, tens of thousands or hundreds of thousands of patients. The data may have been collected over a period of several years by different doctors and clinics in different cities and countries. Patients may be of different ages and sex, may have different diagnoses, and may have received different medications or treatment regimens. Even with the same diagnosis, the type of treatment received can vary greatly, as treatment depends on the judgement of the doctor, guidelines from clinics and health insurers, and last but not least, on the drugs available and approved at a given time. The data may include data from clinical trials, but typically the data consist entirely or partially of data from the field of clinical or medical routine. Hence, due to its size, the data set may be of interest for medical practitioners and the pharmaceutical industry as a candidate data set for performing a database study in order to identify "virtual" patient groups that have a similar health status at a particular time (before a treatment is applied) and that allow - by comparing the health status of the patient group at a later time (when or after receiving a particular treatment) with those patients who did not - to determine in hindsight if the treatment was effective and/or safe. The problem of those "virtual", purely data-driven medical studies is that they are not based on an explicit, controlled creation of real patient groups (treatment and control) having a similar health status before the treatment is started. Hence, embodiments of the invention try to computationally identify - in hindsight - "virtual" patient groups in the data set whose health state is similar and hence allows treatment effect evaluation based on a comparison of those "virtual" treatment and control groups.

For example, each data record can correspond to a respective patient and comprise a plurality of patient-and health related feature values such as age, sex, BMI, body temperature, blood pressure, information regarding genetic markers, metabolite concentrations in the blood or other body fluids or the like. Preferably, for each feature like "body temperature", a plurality of feature values is comprised in the data record, whereby each feature value has assigned a timestamp being indicative of the time when the feature was observed or measured. Preferably, only feature values having a timestamp of a time before a patient received a treatment of interest (if any) is used for training the neural network and computing a propensity score as described below. In case it was able to create the virtual patient groups which are sufficiently similar to perform a database study for the treatment of interest, preferably only those feature values having a timestamp of a time during or after a patient received the treatment of interest (if any) is used for assessing the efficiency, effectiveness and/or safety of the treatment of interest.

According to one example, the data set comprises health related data of 160,000 patients, 57,000 of which being lung cancer patients. Of these lung cancer patients, 10,000 have received immunotherapy as second line treatment while the other 47,000 patients did not receive immunotherapy as second line therapy. In order to assess the efficiency, effectiveness and/or safety of immunotherapy as second line therapy, the data records of the 57,000 lung cancer patients can be used as "the data set" in all subsequent processing steps.

Preferably, the data records of the data set are stored in a relational database such as MySQL, PostgreSQL or Oracle as relational database management systems allow an efficient retrieval and processing of a large number of data records. However, it is also possible to store the data set in the form of a plurality of files or any other suitable data structure or format.

Next in step 104, a neural network 410, preferably an autoencoder, is trained on the data set, e.g. the 57,000 lung cancer patient records, in order to generate a trained neural network that is capable of computing, in response to receiving a plurality of patient related feature values as input, a feature vector being a dimensionally reduced representation of the feature values having been input to the trained neural network. Suitable autoencoder architectures and training parameters are described, for example, with reference to figures 5, 6 and 7 in greater detail.

Next in step 106, at least a subset of the data set (57,000 lung cancer patients), e.g. the data records of thousand randomly chosen lung cancer patients is selected in order to assess whether this subset of patients comprises and allows computationally identifying "virtual" groups of "treated" and "control" lung cancer patients that are highly similar and hence comparable in respect to all relevant health status related aspects. The "control" group can consist of patients having received a different treatment, e.g. a treatment with docetaxel. This is a prerequisite to be able to statistically assess whether or not

immunotherapy has a positive or negative effect, if at all, on patient health and survival rates. The size of the subset strongly depends on the size of the available data set, the heterogeneity of the data set and other factors, e.g. the expected strength of the treatment effect. In some embodiments, the subset of data records selected for further examination can comprise the totality of data records in the original data set or can comprise only a small and fraction of the data set, e.g. 5% or 1%. In the following, the further processing of the randomly selected subset of 1000 data records will be described in greater detail, bearing in mind that in other embodiments, the size of the subset may be significantly smaller or larger and may even comprise the total original data set. Typically, the selection is performed randomly and automatically, but it is also possible that a user selects the subset, whereby the user or the selection algorithm has to ensure that the selected subset comprises multiple data records of both treated and "control" patients.

The trained neural network 410 computes a feature vector for each of the patient records. This is performed by inputting the feature values of a data record into the trained neural network. The trained neural network is configured to compute a feature vector as a dimensionally reduced representation of the feature values of the data record having been input to the trained neural network. Preferably, in case of data records of patients having received the treatment of interest, selectively those feature values are input into the neural network that have been obtained before this treatment started. The feature vector generated by the neural network comprises a smaller number of features then have been input to the neural network. Nevertheless, thanks to the training phase, the feature values in the feature vector are highly characteristic of the health status of a particular patient. It should also be mentioned that the label indicating whether or not a patient received a particular treatment is not considered to be a "feature value" and is not input to the neural network (neither in the training phase nor in the test/production phase) in order to ensure that neural network learns to compute the dimensionally feature vector being

characteristic of the health status of a patient irrespective of the decision of a medical expert to provide the treatment, because this decision is based on subjective factors and an unknown set of patient related feature values and hence the consideration of the

"treatment"-label could reduce the predictive power of the feature vector.

Next in step 108, the feature vectors computed by the neural network 410 are provided to a different software or software modules 412 which is configured to compute a propensity score for each of the feature vectors (and for the respective data records and patients) using regression analysis. The regression analysis software 412 receives the feature vectors and the labels of the said data records as input and performs a regression analysis, in particular a logistic regression analysis in order to determine how the values in the feature vector correlate with/have an impact on the likelihood that a particular patient record has assigned a "treatment received" label. During the regression analysis, a mathematical model is computed that allows computing a propensity score of a patient from a feature vector provided as input, whereby the features of the feature vector correspond to the features identified during the training phase of the neural network being particularly characteristic of the health status of a patient. As mentioned above, a propensity score is a numerical value that is indicative of the likelihood that a patient having a particular set of feature values will receive a particular medical treatment. The way the propensity score is computed according to embodiments of the invention ensures that the propensity score is able to accurately represent this likelihood for a particular data set under examination.

Next in step 110, a further software function that can be implemented, for example, in a group selector software or software modules 414, is used for identifying pairs of first and second data records having similar propensity scores, whereby the first data record of each pair represents a patient having received the medical treatment of interest (e.g.

immunotherapy) and whereby the second data record of each pair represents a patient not BO having received the treatment of interest. Whether or not a particular patient has received the treatment of interest is indicated in the label, so the group building function can be executed quickly and involves analyzing the label and the propensity score assigned to and/or computed for a particular data record. Step 110 is illustrated by a combination of figures 2 and three. Figure 2 shows a patient data set 200 or a subset thereof comprising a plurality of data records for which a propensity score has been computed. Data records having assigned a "treatment received" label are positioned to the left side of the block representing the data set 200 while data records having assigned a "no treatment received" label are positioned to the right side of the block. "Treated" and "control" data records (respectively representing a patient) are sorted according to their propensity score. The identification of first and second data record pairs can be implemented as follows: a first one of the "treated" data records is selected, for example the one 212 having the highest propensity score. Then, the group selector 414 identifies the one of the "control" data records having the most similar propensity score, e.g. data record 214. In addition, the group selector 414 checks whether the "distance" between the two data records 212, 214 that is computed as the difference of propensity scores of the two data records exceeds a predefined distance threshold 206. The threshold is indicated in figure 2 by a dotted box surrounding each data record having assigned the "treated" label. The threshold is only evaluated/valid in vertical direction, as this direction represents the propensity score value. In the example depicted in figure 2, the propensity score difference between data records 212 and 214 exceeds the threshold 206. Therefore, the two data records 212 and 214 are not paired. The algorithm proceeds with selecting the one of the data records having assigned a "treated" label and having not yet been selected previously, in this case data record 220. Again, the one of the "control" data records is identified that has the most similar propensity score to data records 220. The group selector 414 checks whether the distance between the two data records 220 and 218 exceeds the distance threshold 206. In this case, the threshold is not exceeded and the "treated" data record 220 and the

"control" data records 218 are grouped into a pair of a first 220 and second 218 data record. The method continues until all data records having assigned a "treated" label have been processed. Hence, in case the data set 200 comprises 500 "treated" data records, the maximum theoretically possible number of data record pairs is 500. Typically, however, the number of pairs is lower, because in some cases the propensity score difference to the "nearest" "control" data record may exceed the threshold for pair selection 206. In figure 2, the totality 208 of treated patients and respective data records having been successfully paired is indicated in the form of a rhombus having a bold outline. The totality 210 of "control" patients/data records having been successfully paired is indicated in the form of circles having a bold outline. The totality of "treated", not-paired 202 and "control", not paired 204 patients is indicated in the form of a circle or rhombus having a thin outline.

Next in step 112, a treated data record group 302 selectively comprising all first data records 208 of the identified data record pairs is automatically created.

In addition, in step 114, a control data record group 304 selectively comprising all second data records 210 of the identified data record pairs is automatically created.

The group 302 can be considered as "virtual" treatment arm of a database study, i.e., a "virtual" group of patients having received or being selected for receiving a treatment of interest. The group 304 can be considered as a "virtual" control arm of a database study, i.e., a "virtual" group of patients having received or being selected for receiving a treatment of interest. The two "virtual" patient groups 302, 304 derived from the data record pairs of figure 2 are depicted in figure 3.

Next in step 116, the treated data record group 302 and the control data record group 304 are statistically analyzed for determining if the treated data record group and the control data record group are similar. Different approaches for statistically determining group similarity or dissimilarity can be used (e.g. mean of square differences of the feature values of the different group members and/or Kolmogorov Smirnov statistics of the propensity score distributions of the two groups. For example, the group similarity analysis can be performed by a further software function of further softer application, e.g. the group similarity analyzer 416 depicted in figure 4.

In case the treated data record group 302 and the control data record groups are similar (e.g. in case a similarity score is computed that is larger than a minimum similarity threshold), in a further step 120 and output is generated indicating that the treated data record group and the control data record group are suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment. For example, the output can be a text message displayed on a GUI to a user 420 via a display screen and/or an audio warning message or the like.

In case the treated data record group and the control data record group are dissimilar, in a further step 122 and output is generated that the treated data record group and the control data record group are not suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment.

Figure 4 is a block diagram of a computer system 400 according to an embodiment of the invention. The system comprises one or more processes 402 and a storage medium 404, e.g. a volatile or non-volatile storage medium. The storage medium can be a combination of multiple local or remote storage media. The storage medium comprises a data set 200 comprising a plurality of data records comprising health-related data of patients. Some of the data records represent patients having received, at a particular moment in their life, a treatment of interest, e.g. immunotherapy as second line therapy for treating lung cancer.

The storage medium comprises a training and analysis framework 408 comprising a plurality of functions which are executed within the framework in an orchestrated manner such that the method according to embodiments of the invention, as illustrated and described for example with reference to figure 1, is executed. For example, the framework can comprise a neural network 410, in particular an autoencoder, having been trained on the data set 200. The framework can also comprise some functions required for or involved in the training process. For example, the training and analysis software framework can comprise some software routines, modules and/or applications for normalizing the feature values before they are input to the untrained or trained neural network.

Furthermore, the framework 408 can comprise, according to embodiments, a routine for randomly selecting a subset of the data set in order to compute propensity scores for the data records of the subset and for computationally assessing if this selected subset comprises virtual treatment and control groups of patients who are sufficiently similar.

According to some embodiments, the framework is configured to input each of the data records (to be more particular, the health related feature values of a patient represented by this data record (and in case of "treated patients": feature values having been obtained before this patient received the treatment)) of the selected subset into the trained neural network for 104 obtaining a respective feature vector being a dimensionally reduced) temptation of the feature values having been input to the trained neural network. The framework is configured to forward the feature vectors computed by the trained neural network 410 together with the treated/control -labels of the respective data records to a regression analysis module or application program 412. For example, the module or application program performing the regression analysis can be a machine learning program or any other type of software that is capable of computing a regression analysis. For example, the regression analysis software or module 412 can be interoperable with a statistical analysis toolbox or library 418, e.g. with the R package. In the regression analysis, propensity scores are computed for each of the feature vectors output by the neural network.

The framework 408 is configured to forward the computed propensity scores together with the respective treated/control -labels and data records or data record identifiers to a group selector module or application program 414. The group selector 414 is configured to analyze the propensity scores and labels of the selected subset of data records and to automatically create pairs of "treated" and "control" data records by filling the condition that the similarity of the propensity scores of the two compared data records must exceed a similarity threshold. For example, the propensity scores could be normalized to a score range of 0-1.0, wherein "0" represents identity of propensity scores and 1 means that the propensity scores are as dissimilar as they can be. In this case, the similarity threshold for pair building could be 0.1, wherein only data records whose propensity score difference is smaller than 0.1 are paired. When the group selector 414 has finished identifying pairs of data records having a sufficiently similar propensity score, the group selector is configured to create a treated data record group 302 from all data records having been included into one of the pairs and having assigned a "treated" label and to create a control data record group 304 from all data records having been included into one of the pairs and having assigned a "nontreated" label.

The two groups 302, three or four is forwarded by the framework 308 or otherwise provided to a group similarity analyzer 416. The group similarity analyzer 416 is configured to statistically analyze the treated group and the control group for determining if the two virtually created groups are sufficiently similar as to allow performing a database-driven, statistical evaluation of the effects of a treatment. If the treated and the control groups are sufficiently similar, the analyzer 416 is configured to output, e.g. via the screen 406, a message that the two groups are suitable for accurately determining the efficiency, effectiveness and/or safety of a particular treatment. Otherwise, an output is generated and displayed or otherwise provided to the user 420 indicating that the two groups are not suited for such analysis. In this case, the user or the framework may select a different, not yet selected subset of the complete data set for computing the feature values and propensity scores for the data records on the different subset.

According to some embodiments, some or all of the modules or applications 410, 412, 414, 416 have access to our comprise a statistical analysis toolbox 418, e.g. R or a similar program or library.

Figure 4 shows only one possible example for a system architecture that is able to implement a method according to embodiments of the invention and as depicted, for example, in figure 1. For example, the logistic regressor 412 and the group selector 414 can be modules of the first application program while the group similarity analyzer is part of a second application program. It is also possible that the framework 408 is implemented as a standalone software application and the functionalities 410, 412, 414, 416 and 418 are merely sub- routines or sub- modules of this application program.

Figure 5 depicts a network architecture of an autoencoder 500 adapted to compute a compressed representation 512 of an original digital image 508 provided as input. The figure is a partial adaptation of a figure published online on https://blog.keras.io/building- autoencoders-in-keras.html. The autoencoder comprises an input layer 502, a bottleneck layer 504 comprising fewer nodes than the input layer, and an output layer 506. Typically, an autoencoder comprises additional layers (not shown). In functional terms, the autoencoder comprises an encoder part 510 adapted to compute a compressed

representation 512 of an original input image 508 that nevertheless allows an accurate characterization of the original input. The autoencoder further comprises a decoder part 514 adapted to reconstruct the original image by computing a reconstructed input image 516 that is at least approximately identical or highly similar to the original input. During the training phase of the autoencoder, the autoencoder learns, in a self-supervised manner, to identify a small set of features to be represented in the bottleneck layer 504 which are sufficiently strongly predictive of the features of the original image that they allow the decoder part to reconstruct the input from the reduced set of features comprised in the bottleneck layer only. During the training phase, a loss function tries to minimize the difference between the reconstructed 516 and the original 508 image. Autoencoders are commonly used in the field of image compression and image processing but have hitherto not been used in the context of propensity score computation. Applicant has observed that the use of autoencoders allows computing feature vectors and propensity scores which provide a more accurate basis for determining if a data set allows performing a statistical assessment of the efficiency, effectiveness and/or safety of a treatment than state-of-the- art methods.

Figure 6 depicts a network architecture of an autoencoder 600 being adapted to compute a feature vector used as a compressed representation of the feature values of a patient according to an embodiment of the invention. The autoencoder comprises an input layer 602 comprising more than 300 nodes, one, two, three or four further encoding layers, a bottleneck layer 604 having e.g. 16, 32, 34 or 128 hidden units, one, two, 304 decoding layers 610, and an output layer 606 comprising more than 300 nodes. The bottleneck layer of the trained neural network is used for performing a logistic regression with respect to the treated/control label to compute a propensity score. Preferably, the activation function connecting the layers in the encoder part are relu functions while the activation function connecting the layers in the decoder part are sigmoid functions. Applicant has observed that autoencoders having the above specified number of encoding and decoding layers and/or the above mentioned bottleneck layer size and/or activation functions are particularly suited in providing and computing feature vectors which accurately

characterize the health status of a patient.

Figure 7 depicts a network architecture 700 of an autoencoder program according to figure 6 comprising 11 layers 702-722. The number of nodes in each layer is given in the form of the number below each bar representing one of the layers. Figure 8 depicts a plot 800 depicting a propensity score distribution 806 of the "treated" data records and a propensity score distribution 804 of the "control" data records in the data set (before and irrespective of any data record pairing). The treated and non-treated patient groups are empirical groups as specified in the data set, not virtually computed groups. The vertical distance 802 indicates the Kolmogorov of-Smirnov D-statistics being indicative of the maximum vertical deviation between the two propensity score

distributions.

Figure 9 depicts a plot 900 showing the standardized mean differences of a list of features of the data records representing "treated" and "control" (non-paired) patients in the totality of patients in the data set. "Non-paired" means that no pairing is considered. All patients in the data set having assigned the "treated" label and their respective feature values are compared with all patients in the data set having assigned a "control" label and their respective feature values. So this plot indicates the similarity of the "empirical" treated and "control" patient groups. A "control" label may indicate that the patient was not treated with the treatment of interest, e.g. immunotherapy, or that the patient was treated with another therapy, e.g. docetaxel.

The list of features comprises features such as age, various metabolite concentrations and other biomarkers such as albumin, ALP, ALT, ST, Bilirubin, but also features like the BMI, gender, heart rate and the like.

The plot reveals that the empirical treated and control patient groups are highly dissimilar in respect to features such as "Albumin", "Gender" and "Lymphocytes" but are similar in respect to features like "ALP", "Calcium" an "LDH". The vertical line at 0.1 can be regarded as a similarity threshold, wherein a feature positioned to the left of the line is considered to be "similar" and a feature positioned to the right of the line is considered to be "dissimilar" in the two compared empirical groups. As many features are located on the right side of the 0.1 SMD similarity threshold, plot 900 indicates that the empirical "treated" and "control" patient groups are quite dissimilar and do as such not allow a valid statistical assessment of the efficacy, effectiveness and/or safety of a treatment. The variable "SMD" stands for the standardized mean differences of the treated and control patients in the dataset (empirical rather than virtual patient groups). Figure 10 depicts the D-statistics according to Kolmogorov-Smirnov of the propensity score distributions of a "treated" and a "control" group created based on six alternative approaches of computing the propensity score. In this case, the "treated" and a "control" groups are "virtual" patient groups based on pairing and grouping paired data records based on the propensity score. The smaller the D-value, the more similar the two virtual groups. The higher the similarity of the virtual patient groups, the higher the quality of any statistical, database based evaluation of the efficacy, effectiveness or safety of a particular treatment performed on these two virtual groups. Figure 10 is used for illustrating the quality of the propensity score computed based on feature values or feature vectors having been selected in accordance with six different approaches: illustrates the similarity of virtual "treated" and "control" groups obtained for a plurality of different data record subsets of an original data set. Each individual value/symbol in the plot represents a D value of a D-statistics according to Kolmogorov-Smirnov used for comparing the propensity score distributions of virtual "treated" and "control" groups obtained for a particular subset.

In order to identify the virtual "treated" and "control" groups, a similarity threshold based pairing of data records was performed as described with reference to figures 2 and 3, whereby a similarity threshold 206 was used that is equal to 0.2 standard deviations of the propensity score logit.

The D-values of the column 950 virtual treated/control group differences in respect to the propensity score having been computed based on a manual feature selection.

The D-values of the column 952 virtual treated/control group differences in respect to the propensity score having been computed based on patient features selected using a principal component analysis (PCA).

The D-values of the column 954 virtual treated/control group differences in respect to the propensity score having been computed based on a feature selection/reduction performed by a 5-layer autoencoder. The D-values of the column 956 virtual treated/control group differences in respect to the propensity score having been computed based on a feature selection/reduction performed by a 7-layer autoencoder.

The D-values of the column 958 virtual treated/control group differences in respect to the propensity score having been computed based on a feature selection/reduction performed by a 9-layer autoencoder.

The D-values of the column 960 virtual treated/control group differences in respect to the propensity score having been computed based on a feature selection/reduction performed by a 11-layer autoencoder.

As can be inferred from figure 10, all three autoencoders provide better propensity scores than the manual feature selection and PCA based approaches. Better propensity scores allow creating the virtual patient groups that are more similar and hence comparable with each other. Furthermore, the plot indicates that the autoencoders comprising 11 layers performs best.

Figure 11 basically depicts the same plot as figure 10, the only difference being that a similarity threshold 206 was used that is equal to threshold value 0.55 standard deviations of the propensity score logit.

Figure 12 basically depicts the same plot as figure 10, the only difference being that a similarity threshold 206 was used that is equal to threshold value "5".

Figure 13 basically depicts the same plot as figure 10, the only difference being that a similarity threshold 206 was used that is equal to threshold value "10".

Hence, it can be inferred from figures 10-13, that the autoencoder-based approach consistently and reproducibility outperforms manual feature-based selection and PCA based approaches and is quite robust in respect to the use of different similarity threshold during pairing.

Figure 14 depicts the baseline characteristics of the feature value distributions for a list of features of the data records of a "treated" and a "control" group created based on the above described six alternative approaches of selecting features for computing the propensity score. The horizontal axis represents the standardized mean differences (SMD) for feature values of the virtually created treated and control group. The higher the fraction of features whose respective symbol is positioned to the left of the vertical threshold line, the higher the similarity of the patients in the virtual control and treatment groups and the better the quality of any statistical analysis performed on the virtual control and treatment groups for assessing the efficiency, effectiveness and/or safety of a treatment. Figure 14 illustrates the similarities of virtual treatment and control groups for a list of features, whereby the pairing step illustrated in figure 2 for obtaining the virtual groups was performed using a similarity threshold 206 of 0.2 standard deviations of the propensity score logit.

Figure 15 basically depicts the same plot as figure 14, the only difference being that a similarity threshold 206 was used that is 0.55 standard deviations of the propensity score logit.

Figure 16 basically depicts the same plot as figure 14, the only difference being that a similarity threshold 206 was used that is equal to threshold value "10".

Figure 17 depicts a plot 1700 showing the standardized mean differences of a list of features of the data records representing "treated" and "control" (non-paired) patients in the subset of "advanced NSCLC patients" in the data set. A comparison of this plot with plot 900 of figure 9 reveals that the empirical treated and nontreated patient groups are more similar if just the sub- group of NSCLC patients is considered. Nevertheless, the dissimilarity of the empirical groups is still quite high.

Figure 18 is a plot showing Kaplan-Meier plots with survival probabilities the survival probabilities of two different "virtual" treatment arms of patients after the start of the initiation of second line treatment for patient groups receiving docetaxel (control) or cancer immunotherapy ("treated") group. The curve 1802 represents the virtual "treated" group having received immunotherapy as second line treatment. The curve 1804 represents the virtual control group having received a Docetaxel treatment the second line treatment. This analysis is performed on virtual treated and control groups having been computationally derived from a data set 200 according to embodiments of the invention. This means that the virtual "treated" and "control" patient groups were created based on the identification of pairs of "treated" and "otherwise treated=control" patients and respective data records having the most similar propensity score and the creation of a virtual "treated" and a virtual "control" groups by respectively combining first and second members of each of the pairs. This analysis is performed only in case the similarity of the virtually created "treated" and control" group exceeds a minimum similarity threshold.

Figure 19 shows six plots illustrating changes in the propensity score and feature value distributions of "treated" vs. "control" patients before and after the PS-based pairing, wherein the propensity scores are computed based on manually selected features.

The upper three plots respectively show propensity score distributions (1902), box plots of propensity scores (1904) and feature value SMDs (1906) of empirical treatment and control group as given in the form of data records of the data set.

The lower three plots respectively show propensity score distribution (1908), box plots of propensity scores (1910) and feature value SMDS (1912) of computed virtual treatment and control groups, whereby for the creation of the virtual patient groups, a propensity score based pairing of patient records was performed as illustrated in figures 2 and 3, and whereby the features used for computing the propensity scores were selected using a manual feature selection approach.

A comparison of the upper three plots indicating dissimilarities of empirical patient groups with the lower three plots indicating dissimilarities of virtual patient groups reveals that the manual feature selection for computing propensity scores in the construction of propensity score based virtual groups can drastically reduce the dissimilarity of groups. Nevertheless, the virtual and control data record groups are still recognizably different.

Figure 20 shows six plots illustrating changes in the propensity score and feature value distributions of treated vs. "control" patients before and after the PS-based pairing, wherein the propensity scores are computed based on features selected using PCA. The upper three plots respectively show propensity score distributions (2002), box plots of propensity scores (2004) and feature value SMDs (2006) of empirical treatment and control group as given in the form of data records of the data set.

The lower three plots respectively show propensity score distribution (2008), box plots of propensity scores (2010) and feature value SMDS (2012) of computed virtual treatment and control groups, whereby for the creation of the virtual patient groups, a propensity score based pairing of patient records was performed as illustrated in figures 2 and 3, and whereby the features used for computing the propensity scores were selected using a PCA based approach.

A comparison of the upper three plots indicating dissimilarities of empirical patient groups with the lower three plots indicating dissimilarities of virtual patient groups reveals that the PCA-based feature selection for computing propensity scores in the construction of propensity score based virtual groups can drastically reduce the dissimilarity of groups. Nevertheless, the virtual and control data record groups are still recognizably different.

Figure 21 shows six plots illustrating changes in the propensity score and feature value distributions of "treated" vs. "control" patients before and after the PS-based pairing, wherein the propensity scores are computed based on features selected using a 5-layer autoencoder.

The upper three plots respectively show propensity score distributions (2102), box plots of propensity scores (2104) and feature value SMDs (2106) of empirical treatment and control group as given in the form of data records of the data set.

The lower three plots respectively show propensity score distribution (2108), box plots of propensity scores (1010) and feature value SMDS (2112) of computed virtual treatment and control groups, whereby for the creation of the virtual patient groups, a propensity score based pairing of patient records was performed as illustrated in figures 2 and 3, and whereby the features used for computing the propensity scores were selected using a 5- layer autoencoder. A comparison of the upper three plots indicating dissimilarities of empirical patient groups with the lower three plots indicating dissimilarities of virtual patient groups reveals that the use of a 5-layer autoencoder for computing propensity scores in the construction of propensity score based virtual groups can drastically reduce the dissimilarity of groups. The virtual and control data record groups appear to be basically identical. Hence, a comparison of the plots in figure 21 with the plots in figures 19 and 20 reveals that the 5-layer autoencoder outperforms manual feature selection and PCA.

Figure 22 comprises four plots 2202-2208 showing training and validation losses observed in autoencoders comprising 5, 7, 9 and 11 layers. The plot 2208 may be interpreted to reveal some minor overfitting effects. The plots reveal that the losses are quickly minimized during training and that already 25 epochs in the case of the 5-layer autoencoder or 50 epochs in the case of the other autoencoders may be sufficient to train the autoencoder. Nevertheless, the training typically comprises more than 200 epochs, e.g. 250 epochs.

L i s t o f r e f e r e n c e n u m e r a l s

100 method

102-122 steps

200 patient data set

202 "treated" patient (and corresponding data record), not

assigned to a virtual group

204 "control" patient (and corresponding data record), not

assigned to a virtual group

206 similarity threshold for identifying patients having similar propensity scores

208 treated patient (and corresponding data record) assigned to the "treated" group

210 nontreated patient (and corresponding data record) assigned to the "control" group

212 treated patient (data record)

214 nontreated patient (data record)

216 similarity threshold applied on 212

218 nontreated patient (data record)

220 treated patient (data record)

222 nontreated patient (data record)

224 nontreated patient (data record)

302 virtually created "treated" data record group

304 virtually created "control" data record group

400 computer system

402 processes

404 storage medium

406 display with graphical user interface

408 training and analysis software framework 410 neural network

412 logistic regressor

414 group selector

416 group similarity analyzer

418 statistical analysis toolbox

420 user

422 feature values after/during treatment

500 autoencoder

502 input layer

504 bottleneck layer

506 output layer

508 input image

510 encoding part

512 compressed representation of the original input image

514 decoding part

516 reconstructed input image

600 autoencoder architecture

602 input layer

604 bottleneck layer

606 output layer

608 one or more layers of encoder part

610 one or more layers of decoder part

700 autoencoder architecture

702 input layer

708-714 layers of encoder part

704 bottleneck layer

716-722 layers of decoder part

706 output layer

800 plot showing two propensity score distributions

802 deviation computed according to Kolmogorov Smirnov

statistics

804 propensity score distribution of the nontreated virtual group 806 propensity score distribution of the treated patient group

900 plot showing feature value differences of "treated" and

"control" patients (not assigned to any virtual group)

950 manual selection of features and feature values

952 principal component analysis (PCA)

954 5-layer autoencoder

956 7-layer autoencoder

958 9-layer autoencoder

960 11-layer autoencoder

1700 plot showing feature value differences of virtual "treated" and

"control" patient groups

1802 Kaplan-Meier curve of NSCLC patients receiving immune therapy

1804 Kaplan-Meier curve of NSCLC patients receiving docetaxel therapy

1902 PS distribution of original "treated" and "control" patients, PS based on manual feature selection

1904 box plot of PSs of original "treated" and "control" patients, PS based on manual feature selection

1906 SMD of feature values of original "treated" and "control" patients, PS based on manual feature selection

1908 PS distribution of virtual "treated" and "control" patients, PS based on manual feature selection

1910 box plot of PSs of virtual "treated" and "control" patients, PS based on manual feature selection

1912 SMD of virtual "treated" and "control" patients, PS based on manual feature selection

2002 PS distribution of original "treated" and "control" patients, PS based on PCA

2004 box plot of PSs of original "treated" and "control" patients, PS based on PCA 2006 SMD of feature values of original "treated" and "control" patients, PS based on PCA

2008 PS distribution of virtual "treated" and "control" patients, PS based on PCA

2010 box plot of PSs of virtual "treated" and "control" patients, PS based on PCA

2012 SMD of virtual "treated" and "control" patients, PS based on

PCA

2102 PS distribution of original "treated" and "control" patients, PS based on 5-layer autoencoder

2104 box plot of PSs of original "treated" and "control" patients, PS based on 5-layer autoencoder

2106 SMD of feature values of original "treated" and "control" patients, PS based on 5-layer autoencoder

2108 PS distribution of virtual "treated" and "control" patients, PS based on 5-layer autoencoder

2110 box plot of PSs of virtual "treated" and "control" patients, PS based on 5-layer autoencoder

2112 SMD of virtual "treated" and "control" patients, PS based on

5-layer autoencoder

2202 training and validation loss of 5-layer autoencoder

2204 training and validation loss of 7-layer autoencoder

2206 training and validation loss of 9-layer autoencoder

2208 training and validation loss of 11-layer autoencoder

Claims

C l a i m s

1. A computer-implemented method for determining the efficacy, effectiveness and/or safety of a medical treatment retrospectively, the method comprising:

providing (102) a data set (200) comprising a plurality of data records (214, 218,

222, 224), each data record representing a patient and comprising a plurality of patient-related feature values being indicative of the patient's health status and a label indicating whether the patient has received a particular medical treatment;

- training (104) a neuronal network (410) on the data set such that a trained neural network is provided, the trained neural network having learned to compute from the feature values of a patient provided as input a dimensionally reduced

representation of the feature values and the health status of the patient;

- computing (106), by the trained neural network, a feature vector for each of at least a subset of the data records of the data set, each feature vector being a

dimensionally reduced representation of the feature values of the respective data record input to the trained neural network;

- computing (108) a propensity score for each of the data records as a function of the computed feature vector of the data record, the computation of the propensity score comprising performing a regression analysis of the feature vector and the label of the said data record, the propensity score being indicative of the likelihood that a patient having a particular set of feature values will receive a particular medical treatment;

- applying a data filter on the data set for identifying a treated data record group (302) representing patients having been received the medical treatment and a control data record group (304) representing patients not having received the medical treatment, wherein the two groups comprise patients of identical or similar health status, the applying of the data filter comprising:

o identifying (110) pairs of first and second data records having similar propensity scores, wherein the first data record of each pair represents a patient having received the particular medical treatment, wherein the second data record in each pair represents a patient not having received the particular medical treatment;

o creating (112) a treated data record group (302) selectively comprising all first data records of the identified data record pairs;

o creating (114) a control data record group (304) selectively comprising all

second data records of the identified data record pairs;

- statistically analyzing (116) the treated data record group and the control data

record group obtained in the filtering step for determining if the health status of the patients in the treated data record group and of the patients in the control data record group are similar;

if the treated data record group and the control data record group are similar, performing a comparative statistical analysis of the health status of the patients of the treated data record group after receiving the treatment and of the health status of the patients of the control data record group for accurately determining the efficacy, effectiveness and/or safety of the medical treatment; and

if the treated data record group and the control data record group are dissimilar, outputting a warning not to perform a comparative statistical analysis of the health status of the patients of the treated data record group and the control data record group and/or repeating the computing (106) of the feature vectors, the computing of the propensity scores and the application of the data filter on a different subset of the data records of the data set .

2. The computer-implemented method of claim 1, further comprising: treating second patients with the medical treatment based on the efficacy, effectiveness and/or safety of the medical treatment.

3. A computer-implemented method for assessing the suitability of patient data for accurately determining the efficacy, effectiveness and/or safety of a medical treatment, the method comprising: providing (102) a data set (200) comprising a plurality of data records (214, 218,

222, 224), each data record representing a patient and comprising a plurality of patient-related feature values and a label indicating whether the patient has received a particular medical treatment;

- training (104) a neuronal network (410) on the data set for providing a trained

neural network;

- computing (108) a propensity score for each of the data records for which a feature vector was computed, the computation of the propensity score comprising performing a regression analysis of the feature vector and the label of the said data record, the propensity score being indicative of the likelihood that a patient having a particular set of feature values will receive a particular medical treatment;

identifying (110) pairs of first and second data records having similar propensity scores, wherein the first data record of each pair represents a patient having received the particular medical treatment, wherein the second data record in each pair represents a patient not having received the particular medical treatment;

- creating (112) a treated data record group (302) selectively comprising all first data records of the identified data record pairs;

- creating (114) a control data record group (304) selectively comprising all second data records of the identified data record pairs;

record group for determining if the treated data record group and the control data record group are similar;

if the treated data record group and the control data record group are similar, outputting (120) that the treated data record group and the control data record group are suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment; and

if the treated data record group and the control data record group are dissimilar, outputting (122) that the treated data record group and the control data record group are not suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment.

4. The computer-implemented method of any one of the previous claims, wherein the statistical analysis (116) comprises:

- determining a first distribution of the feature values of the data records of the

treated data record group and a second distribution of the feature values of the data records of the control data record,

- comparing the first and the second distribution;

wherein the treated data record group and the control data record group are determined to be similar if the similarity of the two distributions exceeds a feature- value-distribution-similarity-threshold; and

wherein the treated data record group and the control data record group are determined to be dissimilar if the similarity of the two distributions is below the feature-value-distribution-similarity-threshold.

5. The computer-implemented method of any one of the previous claims 1-3, wherein the statistical analysis comprises:

- determining a first distribution (806) of the propensity scores of the data records of the treated data record group and a second distribution (808) of the propensity scores of the data records of the control data record,

- comparing the first and the second distribution;

wherein the treated data record group and the control data record group are determined to be similar if the similarity (802) of the two distributions exceeds a propensity-score-distribution-similarity-threshold; and

wherein the treated data record group and the control data record group are determined to be dissimilar if the similarity of the two distributions is below the propensity-score-distribution-similarity-threshold.

6. The computer-implemented method of claim 5, the comparing of the first and second distributions comprising performing a Kolmogorov-Smirnov test of the propensity scores of the data records in the treated data record group and of the control data record group.

7. The computer-implemented method of any one of the previous claims, further

comprising:

if the output indicates that the data set is suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment, automatically performing a further statistical analysis of the treated data record group and the control data record group using patient-related feature values obtained during or after treatment of the patients in the treated patient group for determining the efficacy, effectiveness and/or safety of the treatment in improving patient health at the offset time in comparison to the control group of patients represented by the control data record group; and/or

if the output indicates that the data set is not suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment, automatically selecting a subset of not-yet selected data records from the data set, if any; and repeating the propensity score computation, the identification of pairs of first and second data records having similar propensity scores, the creation of the treated data record group, the creation of the control data record group, the statistical analysis and the outputting of the result of the statistical analysis.

8. The computer-implemented method of any one of the previous claims, the neural network being an autoencoder (500, 600, 700).

9. The computer-implemented method of any one of the previous claims, the neural network comprising a bottle neck layer (504, 604, 704).

10. The computer-implemented method of any one of previous claims, the neural

network comprising 5 to 15 layers in total, preferably 5 to 11 layers in total, most preferably 5 layers in total.

11. The computer-implemented method of any one of previous claims, the neural

network comprising an encoder part (510, 608, 701-714) comprising a plurality of layers whose neurons use an activation function being a rectifier function and comprising a decoder part (514, 610, {716-722, 706}) comprising a plurality of layers whose neurons use an activation function being a sigmoidal function.

12. The computer-implemented method of any one of previous claims, the neural

network being configured to reconstruct the feature values of a patient input to the neural network from the feature vector, the neural network being trained using a loss function that determines a deviation of the reconstructed feature values of a patient from the patient-related feature values actually input to the neural network.

13. The computer-implemented method of any one of previous claims, the performing of the regression analysis of the feature vector and the label comprising determining which feature values of the feature vector have the strongest predictive power in respect to the treatment status of a patient indicated in the label, and computing the propensity score as a function of the determined predictive power of the feature values in the feature vectors and the label.

14. The computer-implemented method of any one of previous claims, the trained neural network comprising an encoder part, a bottleneck layer and a decoder part, the computing of the feature vector for each of the data records comprising, for each of the data records of at least the subset of data records of the data set:

inputting the stored feature vector and the label of the data set into a statistical analysis program(408, 412, 422); and

performing, by the statistical analysis program, the regression analysis (412) of the feature vector values in the feature vector and the label for computing the propensity score for the patient represented by the data record.

15. The computer-implemented method of any one of previous claims, the feature values being selected from a group comprising: patient age, patient gender, patient eating habits, patient drinking habits, patient smoking habit, patient weight, patient BMI, patient height, patient health insurance status, patient hospital visit information, patient body temperature, patient pulse rate, patient respiration rate (rate of breathing), patient blood pressure, patient physiological values, in particular blood parameter values, patient genomic features, patient history features, geographic region of the patient, current medication of a patient, comorbidities of a patient, patient metabolic features, and any combination of two or more of the

aforementioned features.

16. The computer-implemented method of any one of previous claims, the feature values being free of an indication if the patient was treated or not.

17. A system (400) for determining the efficacy, effectiveness and/or safety of a medical treatment retrospectively, the system comprising:

- a user interface (406);

- a storage medium (404) comprising a trained neuronal network (410),

o the trained neural network having been trained on a data set (200) comprising a plurality of data records, each data record representing a patient and comprising a plurality of patient-related feature values being indicative of the patient's health status and a label indicating whether the patient has received a particular medical treatment;

o the trained neural network being configured to compute a feature vector for each of at least a subset of the data records of the data set, each feature vector being a dimensionally reduced representation of the feature values and the health status of a patient represented by the respective data record input to the neural network;

- a decision program (408) configured for

o computing (106) a propensity score for each of the data records as a function of the feature vector computed for the data record, the computation of the propensity score comprising performing a regression analysis of the feature vector and the label of the said data record, the propensity score being indicative of the likelihood that a patient having a particular set of feature values will receive a particular medical treatment;

o applying a data filter on the data set for identifying a treated data record group (302) representing patients having been received the medical treatment and a control data record group (304) representing patients not having received the medical treatment, wherein the two groups comprise patients of identical or similar health status, the applying of the data filter comprising:

^■ identifying (110) pairs of first and second data records having similar propensity scores, wherein the first data record of each pair represents a patient having received the particular medical treatment, wherein the second data record in each pair represents a patient not having received the particular medical treatment;

^■ creating (112) a treated data record group (302) selectively comprising all first data records of the identified data record pairs;

^■ creating (114) a control data record group (304) selectively comprising all second data records of the identified data record pairs; o statistically analyzing (116) the treated data record group and the control data record group obtained in the filtering step for determining if the health status of the patients in the treated data record group and of the patients in the control data record group are similar;

o if the treated data record group and the control data record group are

similar, performing a comparative statistical analysis of the health status of the patients of the treated data record group after receiving the treatment and of the health status of the patients of the control data record group for accurately determining the efficacy, effectiveness and/or safety of the medical treatment; and

if the treated data record group and the control data record group are dissimilar, outputting a warning not to perform a comparative statistical analysis of the health status of the patients of the treated data record group and the control data record group and/or repeating the computing (106) of the feature vectors, the computing of the propensity scores and the application of the data filter on a different subset of the data records of the data set.

18. The system of claim 17, further comprising the data filter.

19. A system (400) for assessing the suitability of patient data for accurately determining the efficacy, effectiveness and/or safety of a medical treatment, the system comprising:

- a user interface (406);

- a storage medium (404) comprising a trained neuronal network (410),

o the trained neural network having been trained on a data set (200) comprising a plurality of data records, each data record representing a patient and comprising a plurality of patient-related feature values and a label indicating whether the patient has received a particular medical treatment;

o the trained neural network being configured to compute a feature vector for each of at least a subset of the data records of the data set, each feature vector being a dimensionally reduced representation of the feature values of the respective data record input to the neural network;

- a quality assessment program (408) configured for

o computing (106) a propensity score for each of the data records for which a feature vector was computed, the computation of the propensity score comprising performing a regression analysis of the feature vector and the label of the said data record, the propensity score being indicative of the likelihood that a patient having a particular set of feature values will receive a particular medical treatment;

o identifying (110) pairs of first and second data records having similar propensity scores, wherein the first data record of each pair represents a patient having received the particular medical treatment, wherein the second data record in each pair represents a patient not having received the particular medical treatment; creating (112) a treated data record group (302) selectively comprising all first data records of the identified data record pairs;

creating (114) a control data record group (304) selectively comprising all second data records of the identified data record pairs;

statistically analyzing (116) the treated data record group and the control data record group for determining if the treated data record group and the control data record group are similar;

if the treated data record group and the control data record group are similar, outputting (120), via the user interface, that the treated data record group and the control data record group are suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment; and

if the treated data record group and the control data record group are dissimilar, outputting (122), via the user interface, that the treated data record group and the control data record group are not suitable for accurately determining the efficacy, effectiveness and/or safety of the medical treatment.