WO2025168228A1

WO2025168228A1 - Stable classification by components for interpretable machine learning

Info

Publication number: WO2025168228A1
Application number: PCT/EP2024/072755
Authority: WO
Inventors: Sascha Saralajew
Original assignee: NEC Laboratories Europe GmbH
Current assignee: NEC Laboratories Europe GmbH
Priority date: 2024-02-06
Filing date: 2024-08-12
Publication date: 2025-08-14
Anticipated expiration: 2026-08-06

Abstract

The present disclosure relates to a stable classification by components (SCBC) data processing architecture, configured to classify input data into one or more classes, comprising: a component detection module configured to compare the input data to a set of detection components, representing data patterns relevant for the classification, and determine a detection probability for each detection component based on the comparison. The SCBC data processing architecture further comprises a probabilistic reasoning module configured to compute one or more class prediction probabilities for the one or more classes based on the determined detection probabilities, a set of class-specific prior probabilities for the determined detection probabilities, and a set of class-specific reasoning probabilities for the determined detection probabilities. Application scenarios include medical and pharmaceutical applications, as well as healthcare in general such as interpretable and secure diagnosis and treatment recommendation systems. Related SCBC data processing system, methods and computer programs are also disclosed, as well as corresponding model training methods and systems.

Description

August 12, 2024 NEC Laboratories Europe GmbH N173901WO KAU/Bmn Stable Classification by Components for Interpretable Machine Learning [0001] The present application claims beneﬁt of the European Patent Application no.24156043.2ﬁled on February 6, 2024, which is expressly incorporated herein in its entirety by reference. 5 Technical field [0002] The present disclosure is directed to artificial intelligence and machine learning technologies concerned with the training and application of interpretable machine learning data processing architectures, models, and systems. More specifically, the present disclosure relates to Stable Classification by Components0 (SCBC) for interpretable machine learning. Application scenarios include medical and pharmaceutical applications, as well as healthcare in general such as interpretable and secure diagnosis and treatment recommendation systems. Background [0003] In the realm of machine learning, interpretability and robustness are5 critical, especially in high-stakes applications such as healthcare, finance, and autonomous systems. One promising approach to achieving interpretability in neural networks and similar data processing architectures is the Classification-By- Components (CBC) architecture. CBC architectures are designed to provide, for example, fully interpretable deep probabilistic neural network variants, making them0 suitable for applications where understanding a model's prediction is essential, e.g., when providing a medical diagnosis, financial advice, etc. [0004] CBC architectures are inspired by prototype-based learning, where the classification process involves analyzing inputs through a set of components or prototypes. These components are either learned or predefined, e.g., based on expert5 knowledge, and are used to match parts of the input data. The classification decision is then made based on the detection probabilities of these components, following a probabilistic reasoning process defined by a probability tree diagram (for an example see Fig.3). [0005] In this context, the following publications are of general relevance: [1] Saralajew et al.: “Classification-by-Components: Probabilistic Modeling of Reasoning over a Set of Components.” NeurIPS 2019. [2] DE102019119087 A1: “Komponentenbasierte Verarbeitung von Eingangsgrößen.” Saralajew et al.2019. [3] WO2024/083360 A1 “Chained Classification by Components for Interpretable Machine Learning” [4] Saralajew et al.: “Fast Adversarial Robustness Certification of Nearest Prototype Classifiers for Arbitrary Seminorms.” NeurIPS 2020 Summary [0006] Despite the promise of CBC architectures, they suffer from several significant drawbacks that hinder their practical application, in particular in high-risk scenarios, such as medical diagnosis and treatment recommendation systems, financial advice systems, etc.: - Training Difficulties: One of the primary challenges with conventional CBC architectures is the difficulty in training them effectively. The training process often converges to suboptimal solutions, making it hard to achieve the desired performance. This issue arises due to the complex nature of the probabilistic reasoning process and the interactions between the components and their detection probabilities. - Non-Unique Optimal Configurations: In CBC architectures, the configuration of probabilities for a given classification output is invariant to scaling. This means that the reasoning probabilities can be scaled by an arbitrary value without changing the classification outcome. As a result, the optimal classification output state is not unique, leading to multiple possible configurations that can produce the same result. This non-uniqueness can result in low-confidence detections being considered optimal, which is counterintuitive to the desired property of the model. - Numerical Instabilities: Division operations typically involved in computing the class probabilities can lead to numerical instabilities during the learning process. These instabilities further complicate the training process and can result in poor model performance. [0007] These issues, individually and collectively, make conventional CBC architectures challenging to train and apply in real-world scenarios, limiting their effectiveness and reliability in high-risk applications. It is therefore an object of the present disclosure to address such issues of prior art technologies based conventional CBC or similar approaches. [0008] In the following, it is assumed that a reader skilled in the art has fundamental knowledge of conventional artificial intelligence and machine learning technologies and in particular of interpretable ML approaches such as the CBC architectures known in the art. Thus, for conciseness, terminology, and concepts such as neutral network or machine learning types or architectures, associated training algorithms and loss functions, etc. that have been presented in relevant textbooks and review articles known to the skilled person (see for example: Deep Learning; available at: https://www.deeplearningbook.org) are not defined and /or explained in detail herein. For example, it is assumed in the following, that the skilled reader knows the structure and training methods of conventional CBC data processing architectures and systems – as also described in the above-mentioned references [1] to [4]. [0009] A first aspect of the present disclosure relates to a stable classification by components (SCBC) data processing architecture (e.g., a convolutional neural network architecture, or a chained SCBC architecture, etc.), configured to classify input data (105) into one or more classes (e.g., classifying X-ray images, MRI images, or similar digital image date into several disease diagnosis classes etc.). The SCBC data processing architecture disclosed herein comprises: a component detection module configured to compare the input data to a set of detection components, representing data patterns relevant for the classification (e.g., to a set of image patches associated with abnormal tissues or organ structures visible in medical images), and to determine a detection probability for each detection component based on the comparison. The SCBC data processing architecture disclosed herein further comprises a probabilistic reasoning module (120) coupled to the component detection module and configured to compute one or more class prediction probabilities for the one or more classes, based on the determined detection probabilities, a set of class-specific prior probabilities for the determined detection probabilities, and a set of class-specific reasoning probabilities for the determined detection probabilities. [0010] As discussed in more detail below with reference to the drawings and potential application scenarios / use cases, such a SCBC data processing architecture thus allows for improved training of SCBC classifier models which allow for more accurate, robust and interpretable predictions. In particular, as explained in detail below, with reference to the probability tree of Fig.4, the class-specific prior probabilities for the detection components, allow to avoid numerical instabilities (e.g., by avoiding numerical division operations) and to work with unique class prediction probabilities. Consequently, the SCBC data processing architecture disclosed herein fixes the problems of conventional CBC systems discussed above. The inventors found that in conventional CBC architectures the priors for detecting components are defined independently of the class labels. This approach can be counterintuitive in certain applications. For example, in medical image classification, the occurrence probability of components representing healthy tissue and tumorous tissue should differ based on the class (e.g., presence or absence of a tumor). The class-independent priors fail to capture this distinction, leading to less accurate and interpretable models. [0011] Another aspect of the present disclosure relates to an SCBC data processing system, comprising data processing circuitry coupled to memory and configured to initialize and execute the SCBC data processing architecture disclosed herein for processing of input data to be classified into one or more classes, wherein the memory is configured for storing the set of detection components, the set of reasoning probabilities, and the set of class-specific prior probabilities for the detection components of the SCBC data processing architecture. The data processing system is configured to train a SCBC classifier model based on the SCBC data processing architecture, based on the stored set of detection components, based on the stored set of reasoning probabilities, based on the set of class-specific prior probabilities, and based on a training data set comprising a plurality of class-labeled input data, wherein training the SCBC classifier model comprises determining a trained set of detection components, a trained set of reasoning probabilities, and / or a trained set of class- specific prior probabilities. [0012] Another aspect of the disclosure relates to a method for generating a trained SCBC classifier model, comprising obtaining a training data set comprising a plurality of class-labeled input data, and generating the trained SCBC classifier model by performing a model parameter training algorithm, using the SCBS data processing system described above and discussed in more detail below, as well as based on a corresponding training data set. [0013] Another aspect of the disclosure relates to a method for classifying input data into one or more classes, comprising initializing a trained SCBC classifier model generated by the training method disclosed herein and classifying the input data by processing the input data with the initialized trained SCBC classifier model. The one or more classification probabilities for the one or more classes are then outputted based on the classifying. Brief description of the drawings [0014] Various aspects of the present disclosure are described in more detail in the following by reference to the accompanying drawings. Fig.1 illustrates a typical CBC data processing architecture and system; Fig.2 illustrates an exemplary two-stage CBC process comprising component detection and probabilistic reasoning for class prediction; Fig.3 depicts a probability tree diagram used in conventional CBC reasoning; Fig.4 shows a probability tree diagram for a stable CBC (SCBC) architecture according to aspects disclosed herein; Fig.5 shows an exemplary SCBC data processing architecture and system according to aspects disclosed herein; Fig.6 shows an exemplary SCBC data processing architecture and system according to aspects disclosed herein; Fig.7 shows another SCBC data processing architecture according to a further different embodiment disclosed herein; Fig.8 shows a data processing system for obtaining a SCBC classifier model based on the SCBC architecture disclosed herein; Fig.9 shows a flow chart for an exemplary method for obtaining a SCBC classifier model based on the SCBC architecture disclosed herein; Fig.10 shows a flow chart for an exemplary method for classifying input data using a SCBC classifier model as disclosed herein; Fig.11 shows a data processing system for classifying input data using a SCBC classifier model as disclosed herein. Detailed description of exemplary embodiments / implementations [0015] In the following, some exemplary embodiments / implementations of the various aspects disclosed herein are described in more detail, with reference to the drawings. Naturally, the computing systems, architectures and apparatuses of the present disclosure may employ standard hardware components (e.g., a set of on- premises edge computing hardware and / or cloud-based computing resources connect to each other via conventional wired or wireless networking technology). In some implementations, application-specific hardware (e.g., circuitry for training SCBC data processing architectures / models and /or circuitry for executing a trained SCBC data processing architectures / models for data classification) may also be employed. Further, such computing hardware may be configured to execute software instructions (e.g., retrieved from collocated or remote memory circuitry) to execute the computer- implemented methods discussed herein. [0016] While specific feature combinations are described in the following paragraphs with respect to the exemplary embodiments of the present disclosure, it is to be understood that not all features of the discussed embodiments have to be present for realizing the disclosure, which is defined by the subject matter of the claims. The disclosed embodiments may be modified by combining certain features of one embodiment with one or more technically and functionally compatible features of other embodiments. Specifically, the skilled person will understand that features, components, processing steps and / or functional elements of one embodiment can be combined with technically compatible features, processing steps, components and / or functional elements of any other embodiment of the present disclosure as long as covered by the disclosure as specified by the appended claims. [0017] Moreover, the various embodiments discussed herein can be implemented in hardware, software or a combination thereof. For instance, the various modules of the data processing architectures and systems disclosed herein may be implemented via application specific hardware components such as application specific integrated circuits, ASICs, and / or field programmable gate arrays, FPGAs, and / or similar components and / or application specific software modules being executed on multi-purpose data and signal processing equipment such as CPUs, GPUs, DSPs and / or systems on a chip, SOCs, or similar components or any combination thereof. [0018] For instance, the various computing (sub)-systems discussed herein may be implemented, at least in part, on multi-purpose data processing equipment such as edge computing servers. Similarly, SCBC model training subsystems or processes discussed herein may be implemented, at least in part, on multi-purpose cloud-based data processing equipment such as a set of cloud-severs and similar technology such as GPU-based computing systems optimized for model training as known to the skilled person. [0019] Generally, neural networks and similar data processing architectures and models, may employ interconnected layers of nonlinear processing units to predict an output for a received input. Some neural networks include hidden layers in addition to an output layer. The output of each (hidden) layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of network parameters (processing unit connection weights, activation function parameters, etc.). As known to the skilled person and discussed, in part, in some of the prior art references mentioned above, some neural networks and similar architectures, such as CBC models are interpretable. Training such models e.g., via supervised learning using a training set, unsupervised learning, or combinations thereof, results – inter alia – in determining the network parameters of the neural networks implementing the component detection functions and / or the probabilistic reasoning function disclosed herein as accurate as possible. For example, Ričards Marcinkevičs & Julia E. Vogt: Interpretable and explainable machine learning: A methods-centric overview with concrete examples; available at: https://wires.onlinelibrary.wiley.com/doi/epdf/10.1002/widm.1493 provides a general overview of the field of interpretable machine learning as known to the skilled person. [0020] As described in further detail below with reference to the drawings, as compared to the prior art technologies, the data processing architectures, systems and methods described herein have several advantages. For instance, aspects disclosed herein allow to generate an interpretable and provably robust architecture based on a probability model that is defined by a probability tree diagram such as the example shown in Fig.4. In particular, using such a probability tree diagram solves the unstable training and non-uniqueness present in ordinary CBCs by removing the explicit indefinite reasoning state and by making the detection component priors class dependent. Further, by making the detection class probability function class dependent such that the model can account for different similarity regions around each component, allows to make the detection probability function better adaptable to data variations. In addition, as discussed below, providing a lower bound on the robustness allows that the model can be trained to be provably robust and that the robustness of each data point can be assessed by a human expert operator. [0021] Moreover, to date, there essentially are two main themes to tackle the interpretability of machine learning methods. The first one is to build post-hoc explanations (see P.J.G. Lisboa, S. Saralajew, et al.: The coming of age of interpretable and explainable machine learning models, in Neurocomputing, 2023; available at: https://www.sciencedirect.com/science/article/abs/pii/S0925231223001893). [0022] These explanations are often based on approximations or relaxations of the network after the network is trained. The problem with these explanations is that it is unclear to which extent they are faithful (reflect the true reasoning process of the model). Aspects disclosed herein allow to derive explanations as part of the training process in this category. The second approach are models that have a built-in interpretability, like k-nearest neighbors methods, which are well known in the art. With built-in interpretability it is meant that by design, e.g., through constraints (not by adding regularization terms to the training loss), each weight or subset of weights has a clear meaning in the classification process that can be understood by a human expert. Moreover, this also means that the interpretation of these weights reflects the true reasoning process of the model. Such models that are interpretable by design are preferred for high-stakes decisions, such as in medical diagnosis and treatment recommendation systems. However, currently, there is no method available that can build similar fully interpretable architectures that are easy to train. The present disclosure addresses this problem by presenting an approach to create fully interpretable architectures that are easily end-to-end trainable. Moreover, the SCBC systems and methods disclosed herein are provably robust so that the computed output probabilities are calibrated and constitute trustworthy confidence measures. [0023] Fig.1 illustrates the overall structure of a Classification-By-Components (CBC) architecture, which is designed to provide a fully interpretable deep probabilistic classification. An input module is responsible for receiving the input data 105 that needs to be classified. The input data can be of various types, such as images, text, or other forms of structured data. In the component detection module 110, the input data 105 can be analyzed by using a set of detection components, which can be either learned or predefined (e.g., by using expert knowledge) and can be stored in a memory 115. The analysis can be performed by searching for matches of parts of the input data 105 with the detection components with respect to an appropriate probability measure. The detection probability measure for a given component is high if the component has a match with the input or parts thereof and becomes small otherwise. This is depicted in the first part of Fig.1 and the left part of Fig.2. [0024] The probabilistic reasoning module 120 is responsible for analyzing the probabilities of the detected components to reason about the class of the given input. This reasoning is done by computing, for each class, the probability that the respective detection components have been detected (illustrated by matching "1" in the example of Fig.2) or have not been detected (illustrated by matching "0" in Fig.2). Additionally, to avoid that each component has to be considered for the computation of a class probability, the model can also learn that a component is ignored (illustrated by the "x" in Fig.2). The probabilistic reasoning module 120 typically uses a set of reasoning probabilities, stored in memory 125 for each detection component to perform this computation. The probabilistic reasoning process is defined by a probability tree diagram such as the probability tree diagram of Fig.3, ensuring that the model's reasoning is fully interpretable. This is depicted in the second part of Fig.1 and the right and middle part of Fig 2. The reasoning probabilities associated with each detection component are crucial for the probabilistic reasoning process and are learned during the training phase (e.g., during a supervised model training phase based on a set of labeled training data). Due to the relation to a probability tree diagram, each weight in the model can be interpreted and thus has a precise meaning for an expert using the model (for further details see Ref. [1]). This property can be used to interpret, for instance, why a model is fooled by an adversarial example or what caused the model to predict a certain class. [0025] The final stage of the CBC architecture of Fig.1 is an output module, which provides the class prediction probabilities 130. These probabilities indicate the likelihood of the input data belonging to each class. The output can be used to make decisions, such as applying a specific treatment in a medical application. The overall data classification process follows a probabilistic model and is fully interpretable. Additionally, the model is fully differentiable, allowing all model weights (e.g., detection components and the reasoning probabilities) to be learned via end-to-end training e.g., by maximizing the correct class probability (for further details see below and Ref. [1]). [0026] Fig.1 demonstrates how the CBC architecture processes input data through component detection and probabilistic reasoning to produce interpretable and reliable class predictions. The architecture ensures that each step in the process is transparent and understandable, making it suitable for high-stakes applications where interpretability is essential. [0027] Fig.3 shows a probability tree diagram that may be used in a CBC architecture such as the one shown of Fig.1. The first level in the probability tree diagram (see left side in Fig.3) is related to the class prior (random variable c) which can be fixed or can be learned. On the next level are the component priors (random variable k). Importantly, these priors are conventionally defined independent of the class label and, similar to before, can be fixed or learned. However, it was recommended to keep them fixed in Ref. [1]. This level is followed by the importance variable (random variable I) which defines that a component can be important – contributes to the classification process for this class – or is unimportant. The next level is the reasoning random variable R. This variable defines that a component ^ must be detected to provide evidence for the given class. On the last level, is the detection probability (random variable ^), which defines how likely it is that a given component is detected in the given input ^. As an example, for the computation of the detection probability ^_^ ^{^}^^{^} for component ^, consider the standard RBF kernel: [0028] where ^_^ is the Euclidean distance and ^ is a temperature parameter. Note that the detection probability function can be more complex, and the only _{requirement to be fulfilled is that ^^^^^ ∈ [0,1] and that ^^^^^ = 1 if ^ = ^.} [0029] Thus, for instance, also tangent distances or divergences can be used to construct detection probability functions. Finally, a random variable for agreement ^ is defined. An agreement is a path in the probability tree diagram where the reasoning ^ and the detection ^ agree, which means that either a component is detected (^) and should be detected (^) or a component is not detected (^^^ ^) and should not be detected (^^^ ^) –the solid paths in Fig.3 indicate the paths of such agreement. These paths indicate the paths where the configuration of the random variables provide evidence for the given class ^. Computing over this probability tree diagram the so- called probability for an agreement ^ under the condition that the components are important (because otherwise they cannot contribute to the classification process), _{denoted by ^^|", ^, ^^, yields:} [0030] where each element / in the vector %^^^ models the detection probability _{^^^^^ for the component ^ = /; each element / in the vector (±} _{$ models the joint} _{probability for ^", ^|^, ^^ if “+” and ^", ^^^ ^|^, ^^ if “-“ for the component ^ = /; and} , is a vector consisting of ones. [0031] Next, a CBC classifier model can be trained by learning the components ^ (e.g., representative image patches in case of image classification, see Fig.2 for an example) and the reasoning probabilities (_$ ^± such that #_$^^^ is maximal for the correct class (compared to all alternative classes). Even if such a classifier model has a superior interpretability compared to standard neural networks, it has two major drawbacks: First, the configuration of probabilities for a given #_$^^^ is invariant to scaling and therefore not unique in the optimal output state. One can easily show that the _{reasoning probabilities (± can be scaled by an arbitr ±} _{$ ary value 1 > 0 (as long as 1 ⋅ ($ ∈} [0,1]³ remains valid) without changing the value of #_$^^^. This means that in the optimal classification output state, the configuration of probabilities is not unique and even worse small reasoning probabilities, which means, for instance, no confident detection of a component, could create optimal classification outputs. This behavior is counterintuitive to the desired property of the model: Optimal classification outputs should only be achieved for confident reasoning configurations (i.e., this component should (or should not) be definitely detected). Second, the division operation in the calculation of #_$ ^{^}^^{^} above can lead to numerical instabilities in the training process. Both drawbacks together, result in the problem that conventional CBC-based classifier models are difficult to train and often converge to suboptimal configurations. [0032] Fig.4 shows a probability tree diagram according to aspects of the present disclosure. The inventors found that it is possible to overcome the limitations of conventional CBC architectures by proposing an alternative formulation for the probability tree diagram (see Fig.4 and compare it with the marked regions by the dotted box in Fig.3). The resulting architecture is denoted herein as Stable CBC (SCBC). [0033] The architectural change entails defining the component priors to be class dependent, which is somewhat counterintuitive in ordinary CBCs where the priors are class independent. As illustration consider the task of tumor classification in x-ray images. Further, assume that there are learned detection components for healthy tissue and tumorous tissue. In ordinary CBCs each detection component has the same occurrence probability (prior) for each class (no matter if the priors are learned or kept fix), which means for the prediction / classification of “there is a tumor in the input data” the healthy and tumorous tissue components are equally likely. This is counterintuitive as in an input x-ray image with only healthy tissue (prediction “there is no tumor in the image”), the occurrence of tumorous components should be unlikely. [0034] This problem is corrected in the SCBC data processing architecture, systems and methods disclosed herein by learning class specific priors for the detection components, denoted by ^{^}^^|^^{^} in Fig.4. However, if this change is made, it must be noted that this is redundant to having a random variable for importance. Because if the class-specific prior ^{^}^^|^^{^} is low for a certain component, this is equivalent to this component is unimportant for the class. Consequently, we must note that SCBCs model the importance of components via ^{^}^^|^^{^} whereas CBCs model the importance via ^_{"|^, ^^. Even if this change seems minor, the impact on the resulting class probability} _{equation is significant. For instance, repeating the derivation of the #$^^^ = ^^|^, ^^} over the probability tree diagram of Fig.4, one obtains: #_{^ ^ ^ | ^ +^ ^ ^ ^ -&} _{$ ^ = ^ ^, ^ = 2($ − , ∘ % ^ − ($ + , ⋅ 6$ ,} [0035] where 6_$ is the vector of prior probabilities ^{^}^^|^^{^} for the given class, (_$ _{is the vector of reasoning probabilities ^^|^, ^^ for the given class, and ∘ is the} Hadamard product, the operator ∙ a scalar product, and T a transpose operation as know in the art. Considering this new result for #_$ ^{^}^^{^} the following can be noted: #_$ ^{^}^^{^} is not invariant to scaling with respect to (_$, and there is no division operation needed for calculating the #_$ ^{^}^^{^}. [0036] Consequently, the architectural change of learnable class-specific component priors fixed the problems of CBCs. The learnable parameters in the SCBC data processing architectures, systems and methods disclosed herein (see Fig.5 for an SCBC architecture overview) are the detection components ^ (similar to CBCs), the _{reasoning probabilities ( ∈ [0,1]3, and the [ ]3} _{$ component priors 6$ ∈ 0,1 with} _∑3 _{89: 6$,8 = 1. Therefore, SCBCs and CBCs have the same number of parameters. Similar} to CBCs, SCBCs are optimized / trained by maximizing the probability #_$ ^{^}^^{^} of the correct class. For instance, by optimizing the so-called margin loss ;_{<^^, =^ = max+#$@^^^ − #A^^^ + B, 0-,} _{[0037] where B ∈ [0,1] is a hyperparameter that controls the probability gap of} the learned model (the crispness of the classifier), #_A^^^ is the probability of the correct class =, and #_$ ^{@^}^^{^} is the probability of the highest probable incorrect class. Moreover, the loss can be lower bounded in terms of the achievable robustness and, thus, another loss can be formulated on the lower bound such that the robustness of the model being _{trained can be optimized. The robustness ‖D^^, =^‖^ of an input ^ with the predicted} label = is specified as the magnitude of the minimal shift applied to ^ such that the predicted class label changes. For instance, the lower bound on the robustness for the detection probability is given by (for a correctly classified sample) with & _{K = L+(A − ,- ∘ 6A − ($ ∘ 6$M %^^^,} _{where ^^, =^ is a pair comprising the input ^ and the true label = and ^ is the label of the} highest probable incorrect class. _{[0038] Furthermore, if the model is learned with B = 1 (crisp classification), it} can be shown that this can only be achieved if the model realizes a form of Nearest Prototype Classification (NPC) and, in this case, the loss becomes the well-known NPC contrastive loss form that was shown to maximize the robustness (see Ref. [4]). Additionally to this major change, we propose to learn the hyperparameters during the detection computation class- and component-specific (similar to the individual scaling parameters in a t-SNE). For instance, one can learn individual temperatures ^_$,^ for each component and class: [0039] The advantage of this approach is that it can account for different similarity regions around a component or different scaling of components, which affect the Euclidean distance computation. Consider again the example of x-ray image classification. If a learned component represents tumorous tissue, then we want that a decay of the exp is strong, which means small ^_$,^, for the healthy class (because small changes should be immediately flagged as not similar) and soft for the tumorous class, which means large ^_$,^, so that slight differences in the image with respect to the component still produces reasonably large detection probabilities. [0040] Fig.5 shows an exemplary stable classification by components (SCBC) data processing architecture (also illustrated in Fig.6), that is configured to classify input data 505 (e.g., medical image data) into one or more classes (e.g., different medical diagnosis). The illustrated SCBC data processing architecture comprises a component detection module 510 configured to compare the input data 505 to a set of detection components, representing data patterns relevant for the classification, and to determine a detection probability for each detection component based on the comparison. As shown in Fig.5 the detection components can be stored in a data storage 515 and can be predefined by expert knowledge and / or can be trained as discussed below. The SCBC data processing architecture further comprises a probabilistic reasoning module 520 coupled to the component detection module 510 and configured to compute one or more class prediction probabilities 530 for the one or more classes based on the determined detection probabilities, a set of class-specific prior probabilities for the determined detection probabilities, and a set of class-specific reasoning probabilities for the determined detection probabilities which all may be stored in a data storage 525. [0041] In some implementations, the component detection module 510 may be configured to output a detection probability vector d(x) with dimension K as input to the probabilistic reasoning module 520, wherein K equals the number of detection components. Further, the probabilistic reasoning module 520 may also be configured to compute a class prediction probability pc(x) for each class based on the detection probability vector d(x), based on a set of C reasoning probability vectors rc with dimension K, wherein C equals the number of classes, and based on a set of C class- specific prior probability vectors bc with dimension K such that pc(x) is not invariant to scaling with respect to rc; and the computation of pc(x) does not involve a numerical division operation. In some implementations, the class prediction probability pc(x) for each class can be computed by wherein the operator ∘ is a Hadamard product, the operator ∙ a scalar product, and T a transpose operation. [0042] Moreover, as illustrated in Fig.5 and in more detail in Fig.7 multiple SCBC blocks can be stacked (e.g., so that they are chained) if a proper binarization is applied between the SCBCs blocks (see also Ref. [3]). Furthermore, again similar to CBCs, the derived framework can be used to build convolutional-like architectures where the components are slid over the input and the final output probability is the accumulation of several probabilities (for each sliding position; see Ref. [1]). Additionally, because each part in the SCBC architecture has a dedicated meaning by the connection to the probability events, similar to CBCs, each part can be efficiently initialized by expert knowledge, if available, and, if required, can be fixed during training of a corresponding SCBC classifier model based on the architecture. This means that if feasible values for the detection components are known beforehand, the components can be initialized with these values and can be kept fixed during model training. Further, known approaches to increase the capacity of a classifier model can applied too. For instance, one can define multiple reasoning and/or bias vectors per class and resolve the multiple outputs by a kind of majority vote or averaging. The same is possible for the detection components: Define multiple realizations for each detection component and pick the active component by some kind of majority vote or averaging over the computed detection probabilities. [0043] Thus, as shown in more detail in Fig.7 the SCBC data processing architecture disclosed herein, may further comprising a binary gate module 610 for gating of forwarding of one or more of the determined detection probabilities based on a gating threshold value. In some implementations, the SCBC data processing architecture disclosed herein may also comprise one or more additional component detection modules 620 and one or more corresponding additional probabilistic reasoning modules 630 sequenced in a series of stages each including a corresponding additional binary gating module 640. [0044] As already stated above, SCBC based classifier models are fully interpretable and the following model specific visualization (also called explanations herein) can thus be created to ease the interpretation of the learned probabilities and reasoning process of a trained SCBS classifier model: [0045] Visualization of the learned detection components allows to understand what the building blocks of the classifier model are, e.g., to explain and understand what common patterns in the data are used for classification. The detection components can be visualized in a meaningful way because they are defined in the input space. Therefore, any visualization technique that can be applied on the input data can be applied to the detection components too. [0046] Visualization of differences between input and detection components to understand where deviations are. This answers the question where a detection component is similar to a part of the input data and / or what causes a high or low detection probability. For this purpose, the detection probability function can be analyzed and the discrimination computation be visualized. For example, considering the exemplary proposed rbf-kernel-like detection probability function as discussed above, a discrimination can defined by the Euclidean distance and within the Euclidean distance the discrimination operation (the operation that compares P_^ and ^) can be _{the difference P^ − ^. Therefore, the visualization of the differences can be defined by} the difference operation. This could be different depending on the detection probability function. For instance, in case of a detection probability function constructed over a cosine similarity measure, the discrimination could be defined by a product operation. [0047] Visualization of learned reasoning probabilities to gain knowledge about what the classifier model has learned. Specifically, each weight (except for the component related weights) has a probabilistic meaning. Therefore, the probabilities can be purposefully visualized, e.g., in heatmaps or similar visualizations to show the distribution of the probabilities. For instance, the bias probability vector 6 can be visualized for an individual class to analyze which detection components are most important for this specific class. The same can be performed for the other probability values to learn which component should be detected or which component has been detected. Further model specific visualization / explanations include: [0048] Visualization of the learned reasoning process to understand / explain what triggers the activation of a particular class. This visualization can be summarized in a heatmap and/or matrix-like shape, such as: [0049] Each value is again a probability that indicates whether the absence or the detection of a detection component will contribute to the reasoning process. Such a plot / visualization can be created for each class. [0050] Visualization of agreement (where something was detected that should have been detected) to understand which and how a detection component contributed to the prediction / classification of the class label. Such a visualization can be summarized in a heatmap and/or a matrix-like shape, such as: [0051] Such a plot visualized the paths of agreement for each detection component (see solid lines in Fig.4). Again, such a plot / visualization can be created for each class. The sum over all the probabilities in this plot yields the output probability #_$ ^{^}^^{^}. [0052] Similar to before disagreement (where was something detected that should not have been detected) can be visualized to inspect which information contradicts the models reasoning process. Again, it can be visualized as a heatmap and/or matrix. The construction is similar to the agreement visualization except that the disagreement is visualized, (see dashed lines in Fig.4), e.g., via: [0053] Such a visualization can be created for each class and a high value means that it causes disagreement. The sum over all the probabilities in this plot yields the _{output probability for a disagreement 1 − #$^^^, which means how likely is it that the} prediction is incorrect. This also shows immediately that all probabilities in this plot _{will be zero if, and only iff, #$^^^ = 1.} [0054] In some aspects, generating a trained SCBC classifier model based on the SCBC data processing architecture disclosed herein may comprise: defining a number of detection components and a number of reasoning vectors (could be the number of classes). If required, optionally, a convolutional-like architecture or a chained SCBC like architecture can be defined. Next, initializing all vectors with suitable values. If available, initialize the vectors by expert knowledge. For instance, pick training data samples (or parts of) as detection components. If no expert knowledge is available, initialize the vectors by suitable random noise. And, end-to-end training of the SCBC classifier model on a given training dataset e.g., by optimizing the probability gap or the lower bound for the robustness. [0055] The obtained SCBC classifier model can then be used to predict the classes of given inputs, to compute the robustness for predicted class labels, and / or to provide the explanations in form of the visualizations to debug and understand the model – as discussed above. [0056] In one embodiment, the systems and methods disclosed herein may be used for interpretable and robust peptide binding prediction. For example, a black-box model that predicts whether two peptides will bind could be monitored and compared to a corresponding SCBC classifier model as disclosed herein. For such an application, the data source may be sequencing data with corresponding labels. [0057] The SBCB classifier model disclosed herein may be trained in the same manner as the black-box model. Together with the black-box model, the model disclosed herein creates an ensemble. During prediction, both models predict whether the peptides will bind based on the sampled peptides. If both models predict that the peptides will bind and if the prediction of the SCBC model is highly confident, the prediction is accepted because the model is provably robust, which means that the prediction can be trusted. On the other hand, if the confidence is low (the labels might agree or disagree), the SCBC model expresses uncertainty so that the prediction of the ensemble should be evaluated by experts. If the SCBC predicts a different label with high confidence, the prediction of the ensemble shouldn’t be trusted. In all scenarios, the prediction of the SCBC classifier model can be checked by experts because the model is fully interpretable. The output of the SCBC model may be binary (binding or no binding). The sampled peptides may be coming from a patient and a potential personalized medicine candidate. The provably robust prediction can be used to identify promising candidates for a personalized medicine. If the candidate is promising (high binding probability), the medicine is synthesized. If the candidate is not promising, a new potential candidate can be sampled. Additionally, experts can understand the prediction why the model predicts that peptides don’t bind. By this, experts can gather knowledge in order to purposefully guide/control the sampling of potential candidates. [0058] In another implementation, the systems and methods disclosed herein may be used for interpretable and robust medical image classification, e.g., for polyp classification. For instance, a SCBC classifier model may be trained on available training data to classify polyps imaged during an endoscopy into benign and malignant. Here, the data source may for example be polyp images including the label (benign or malignant) or reference/standard images for polyps of the classes benign and malignant. Aspects of the present disclosure can be also applied to other kind of image data (besides RGB images from a video camera) such as x-ray images as mentioned above. For instance, the SCBC models disclosed herein may be trained on labeled images to provide an interpretable and provably robust classification of poly images. To have convincing explanations for experts/doctors, the components might be initialized by the reference/standard images and might be kept fixed during training so that decisions are based on comparisons with the agreed standard/reference images. The output of the training may be a network model that is fully interpretable and provably robust so that given a poly image a robust and interpretable decision can be provided. The trained model can be used to train students by showing them for reliable (robust) predictions why a polyp is benign or malignant based on the reference/standard images. For this, visualization techniques as discussed above may be used. Consequently, such a SCBC system can be used to create reliable teaching systems. Moreover, the system can be used to assist doctors during an endoscopy. Because the predictions are robust and explainable, the system is trustworthy. Moreover, it is also possible to use this system to led robots perform an endoscopy. [0059] In another embodiment, the disclosed systems and methods may be used to train a model on patient electronic health record data to classify diseases in an interpretable way, e.g., based on electronic health record data. Here, the data source may be health record data of patients including labels for certain diseases. Additionally, if available, expert knowledge can be provided and integrated into the SCBC classifier model. The model may be trained on the data to provide an interpretable and provably robust classification of disease. As the output of the model is not a probability vector that sums up to 1 but a vector where each element is a probability, the network model can be trained to classify multiple diseases per patient (several probabilities for different diseases at 1). This is advantageous since some sick patients have multiple diseases. The output of model training may thus be a network model that is fully interpretable and provably robust so that given an electronic health record a robust and interpretable decision can be provided. The trained system can then be used to automatically diagnose patients and prescribe medications under supervision of medical experts. [0060] Fig.8 shows an SCBC data processing system 800, comprising data processing circuitry 810 coupled to memory 820 and configured to initialize and execute a SCBC data processing architecture as disclosed herein for processing of input data to be classified into one or more classes. The memory 820 may be configured for storing the set of detection components, the set of reasoning probabilities, and the set of class-specific prior probabilities for the detection components of the SCBC data processing architecture as discussed above. Further, the data processing system 800 may be configured to train a SCBC classifier model based on the SCBC data processing architecture disclosed herein, e.g., based on the stored set of detection components, based on the stored set of reasoning probabilities, based on the set of class-specific prior probabilities, and based on a training data set comprising a plurality of class- labeled input data as discussed above. Further, training of the SCBC classifier model may comprise determining a trained set of detection components, a trained set of reasoning probabilities, and / or a trained set of class-specific prior probabilities. As also discussed above, some of these model parameters may also be provided by expert knowledge and may be kept fixed during training. [0061] In some implementations, training may also comprise training the SCBC classifier model end-to-end by optimizing a loss function that includes a probability gap and / or a lower bound for prediction robustness – as discussed in detail for some examples above. Training the SCBC classifier model may also comprises maximizing the class prediction probability for the correct class indicated by the class labels of the class-labeled input data of the training data set. Moreover, maximizing the class prediction probability for the correct class may comprise optimizing a margin loss function depending on a difference between the class prediction probability for the correct class and a class prediction probability for a highest probable incorrect class as well as on a hyperparameter that controls a probability gap of the SCBC classifier model. [0062] The SCBC data processing system disclosed herein may further be configured to determine one or more explanations for the trained set of detection components, the trained set of reasoning probabilities, the trained set of class-specific prior probabilities, a reasoning process, and/or a classification robustness, and for displaying the one or more explanations to facilitate debugging and understanding of the SCBC classifier model by an expert user – as also discussed in more detail above. [0063] Fig.9 shows a flow chart for an exemplary method for obtaining a SCBC classifier model based on the SCBC architecture disclosed herein. In step 910 a training data set comprising a plurality of class-labeled input data is obtained, e.g., a set of medical images with labels for a certain disease, etc. In step 920 the trained SCBC classifier model is generated by performing a model parameter training algorithm, using the SCBS data processing system discussed above and the obtained training data set. [0064] Fig.10 shows a flow chart for an exemplary method for classifying input data into one or more classes using a SCBC classifier model as disclosed herein. The method comprises initializing, in step 1010 a trained SCBC classifier model generated by a model training method as disclosed herein, classifying, in step 1020 the input data by processing the input data with the initialized trained SCBC classifier model, and outputting 1030 one or more classification probabilities for the one or more classes based on the classifying. In some implementations, the method may further comprise determining a set of significant detection components, and optionally, a set of corresponding reasoning probabilities and class-specific prior probabilities, and outputting one or more explanations for the learned components, the learned detection probabilities, the reasoning process, and/or the classification robustness to facilitate debugging and understanding of the SCBC classifier model as discussed above. [0065] Fig.11 shows a data processing system 1100 for classifying input data into one or more classes using a SCBC classifier model as disclosed herein. The data processing system 1100 comprises processing circuitry 1110 coupled to memory 1120, and is configured to classify the input data by executing a method for classifying input data into one or more classes using a SCBC classifier model as disclosed herein, e.g., a model trained by the method of Fig.9. The present disclosure also relates to a computer program comprising instructions, when being executed by a data processing system, carry out the steps of any of the methods disclosed herein. [0066] The foregoing disclosure provides illustration and description but is not intended to be exhaustive or to limit the aspects to the precise form disclosed. Modifications and variations may be made in light of the above disclosure or may be acquired from practice of the aspects. As used herein, the term component is intended to be broadly construed as hardware, software, or a combination of hardware and software. As used herein, a processor or processing circuitry is implemented in hardware, software, or a combination of hardware and software. [0067] It will be apparent that systems, models, data processing architectures, and/or methods described herein may be implemented in different forms of hardware, firmware, or a combination of hardware and software. The actual specialized control hardware or software code used to implement these systems, models, data processing architectures, and/or methods is not limiting of the aspects. Thus, the operation and behavior of the systems, models, data processing architectures, and/or methods were described herein without reference to specific software code—it being understood that software and / or hardware can be designed to implement the systems, models, data processing architectures, and/or methods based on the description herein. [0068] Even though particular combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the disclosure of various aspects. In fact, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each dependent claim listed below may directly depend on only one claim, the disclosure of various aspects includes each dependent claim in combination with every other claim in the claim set. A phrase referring to “at least one of” a list of items refers to any combination of those items, including single members. As an example, “at least one of: a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a- c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c). [0069] No element, act, or instruction used herein should be construed as critical or essential unless explicitly described as such. Also, as used herein, the articles “a” and “an” are intended to include one or more items, and may be used interchangeably with “one or more.” Furthermore, as used herein, the terms “set” and “group” are intended to include one or more items (e.g., related items, unrelated items, a combination of related and unrelated items, and/or the like), and may be used interchange-ably with “one or more.” Where only one item is intended, the phrase “only one” or similar language is used. Also, as used herein, the terms “has,” “have,” “having,” and/or the like are intended to be open-ended terms. [0070] As used herein, the phrase “based on” shall not be construed as a reference to a closed set of information, one or more conditions, one or more factors, or the like. In other words, the phrase “based on A” (where “A” may be information, a condition, a factor, or the like) shall be construed as “based at least on A” unless specifically recited differently. [0071] As used herein, the term “or” is an inclusive “or” unless limiting language is used relative to the alternatives listed. For example, reference to “X being based on A or B” shall be construed as including within its scope X being based on A, X being based on B, and X being based on A and B. In this regard, reference to “X being based on A or B” refers to “at least one of A or B” or “one or more of A or B” due to “or” being inclusive. Similarly, reference to “X being based on A, B, or C” shall be construed as including within its scope X being based on A, X being based on B, X being based on C, X being based on A and B, X being based on A and C, X being based on B and C, and X being based on A, B, and C. In this regard, reference to “X being based on A, B, or C” refers to “at least one of A, B, or C” or “one or more of A, B, or C” due to “or” being inclusive. As an example of limiting language, reference to “X being based on only one of A or B” shall be construed as including within its scope X being based on A as well as X being based on B, but not X being based on A and B. [0072] Further, process diagrams such as Fig.9, and Fig.10 do not necessarily indicate a particular order or sequence of steps. For example, steps may also be performed in a different order or, if hardware capabilities allow it, simultaneously, without deviating from the scope of the present disclosure.

Claims

1

August 12, 2024 _{NEC Laboratories Europe GmbH N173901WO KAU/BMN} Claims 1_{. A stable classification by components (SCBC) data processing architecture,} _{configured to classify input data (505) into one or more classes, comprising:} _{a component detection module (510) configured to:} _{compare the input data to a set of detection components, representing data} _{patterns relevant for the classification, and} _{determine a detection probability for each detection component based on the} comparison; and a_{probabilistic reasoning module (520) configured to:} _{compute one or more class prediction probabilities (530) for the one or more} _{classes based on the determined detection probabilities, a set of class-specific prior} _{probabilities for the determined detection probabilities, and a set of class-specific} _{reasoning probabilities for the determined detection probabilities.} _{2. SCBC data processing architecture of claim 1,} _{wherein the component detection module (510) is configured to output a} detection probability vector d(x) with dimension K as input to the probabilistic r_{easoning module (520), wherein K equals the number of detection components; and} _{wherein the probabilistic reasoning module (520) is configured to compute a} _{class prediction probability pc(x) for each class based on the detection probability} _{vector d(x), based on a set of C reasoning probability vectors rc with dimension K,} _{wherein C equals the number of classes, and based on a set of C class-specific prior} _{probability vectors bc with dimension K such that:} pc(x) is not invariant to scaling with respect to rc; and the computation of p_c(x) does not involve a numerical division operation. 2

_{3. SCBC data processing architecture of claim 2, wherein the class prediction} probability pc(x) for each class is computed by w_{herein the operator ∘ is a Hadamard product, the operator ∙ a scalar product,} and T a transpose operation. _{4. SCBC data processing architecture of any of claims 1 to 3, further comprising a} _{binary gate module (610) for gating of forwarding of one or more of the} determined detection probabilities based on a gating threshold value. _{5. SCBC data processing architecture of any of claims 1 to 4, further comprising} _{one or more additional component detection modules (620) and one or more} _{corresponding additional probabilistic reasoning modules (630) sequenced in a series} _{of stages each including a corresponding additional binary gating module (640).} _{6. An SCBC data processing system (900), comprising:} _{data processing circuitry (910) coupled to memory (920) and configured to} _{initialize and execute the SCBC data processing architecture of any of claims 1 to 5 for} _{processing of input data (505) to be classified into one or more classes; and} _{wherein the memory is configured for storing the set of detection components,} the set of reasoning probabilities, and the set of class-specific prior probabilities for the _{detection components of the SCBC data processing architecture;} _{wherein the data processing system is configured to train a SCBC classifier} _{model based on the SCBC data processing architecture, based on the stored set of} detection components, based on the stored set of reasoning probabilities, based on the set of class-specific prior probabilities, and based on a training data set comprising a plurality of class-labeled input data, w_{herein training the SCBC classifier model comprises determining a trained set} of detection components, a trained set of reasoning probabilities, and / or a trained set of class-specific prior probabilities. 3

_{7. SCBC data processing system of claim 6, wherein training comprises training} _{the SCBC classifier model end-to-end by optimizing a loss function that includes a} _{probability gap and / or a lower bound for prediction robustness.} _{8. SCBC data processing system of claim 7, wherein training the SCBC classifier} model comprises maximizing the class prediction probability for the correct class indicated by the class labels of the class-labeled input data. _{9. SCBC data processing system of claim 8, wherein maximizing the class} _{prediction probability for the correct class comprises optimizing a margin loss function} _{depending on a difference between the class prediction probability for the correct class} _{and a class prediction probability for a highest probable incorrect class as well as on a} _{hyperparameter that controls a probability gap of the SCBC classifier model.} _{10. SCBC data processing system of any of the preceding claims 6 to 9, further} configured to: d_{etermine one or more explanations for the trained set of detection} components, the trained set of reasoning probabilities, the trained set of class-specific _{prior probabilities, a reasoning process, and/or a classification robustness; and} displaying the one or more explanations to facilitate debugging and _{understanding of the SCBC classifier model by an expert user.} _{11. A method for generating a trained SCBC classifier model, comprising:} obtaining (1010) a training data set comprising a plurality of class-labeled input data; and g_{enerating (1020) the trained SCBC classifier model by performing a model} parameter training algorithm, using the SCBS data processing system of any of claims 6 _{to 10 and the obtained training data set.} _{12. A method for classifying input data (505) into one or more classes, comprising:} _{initializing (1120) a trained SCBC classifier model;} _{classifying (1130) the input data by processing the input data with the initialized} _{trained SCBC classifier model; and} _{outputting (1140) one or more classification probabilities for the one or more} 4 _{classes based on the classifying.} 13. Method of claim 12, further comprising: determining a set of significant detection components, and optionally, a set of _{corresponding reasoning probabilities and class-specific prior probabilities; and} _{outputting the one or more explanations for the learned components, the} _{learned detection probabilities, the reasoning process, and/or the classification} _{robustness to facilitate debugging and understanding of the SCBC classifier model.} _{14. A data processing system (1200) for classifying input data into one or more} _{classes comprising processing circuitry (1210) coupled to memory (1220), and being} configured to classify the input data by executing the steps of any of the methods of _{claims 12 or 13.} _{15. Computer program comprising instructions, when being executed by a data} _{processing system, carry out the steps of the method of claim 11 or of the method of any} _{of claims 12 or 13.}