Literature Summary - Explainable AI for Decision Making

This literature summary was written for a university seminar on Explainable AI

Introduction

With the increased use of AI systems, the subfield of Explainable AI has gained significance in recent years. As these systems gain entry into more critical fields, machine learning models need to be trustworthy and reliable in order to be applicable. However, many models operate through black-box decision-making, which is often difficult to fully comprehend for humans. As models reach high accuracies, the seemingly ‘high intelligence’ of these systems might be deceiving. One related phenomenon to this behavior is the Clever Hans Effect (see Lapuschkin et al. (2019)).

In the following literature summary I will first describe the Clever Hans Effect, secondly explore methods for model explanation and lastly, discuss ways in which systems can be taught valid decision strategies.

Explaining ‘Clever Hans’ behavior

There are a variety of bugs occurring in a machine learning environment. In Adebayo et al. (2020) these are categorized into model contamination bugs, data contamination bugs and test- time contamination bugs. Data contamination bugs, or more precisely spurious relations within data, cause the Clever Hans Effect. Through those, models develop invalid decision-strategies which is typically known as Clever Hans Behavior. For instance, a dataset with birds always shown against a blue sky may lead the model to wrongly associate a blue sky with birds.

To understand bugs and their cause, explanation methods are frequently used tools. However, in order to be able to trust an explanation method, its effectiveness to reveal a bug needs to be assured. Adebayo et al. (2020) takes on this issue by comparing the effectiveness of explanation methods to uncover bugs of all types. These methods include gradient-based, surrogate, and modified-back-propagation approaches. Considering the data contamination bugs, the results are ambivalent: while the explanation methods do not prove to be relevant for mislabelled data, all explanation methods reveal a model’s invalid decision-strategy based on spurious cor- relations.

However, as mentioned, Clever Hans is only related to spurious correlations. Therefore, when assuming Clever Hans Behavior of a model, explanation methods will effectively assist in de- tecting decision-strategies based on spurious correlations.

While there is a range of explanation methods, one popular explanation method is Gradientweighted Class Activation Mapping (Grad-CAM) presented by Selvaraju et al. (2020). It is a method to visualize explanations of decision strategies of large class Convolutional Neural Net- work (CNN) based models. Grad-CAM is a method designed to contribute to interpretability and transparency of models by producing feature attribution masks in the shape of heatmaps. This allows a user to trace back a model’s decision to the most influential image regions. There have been other visualization methods before, however Grad-CAM stands out by being class- discriminative (meaning distinct for classes) and specific to a single input image.

Spoken on a high level, Grad-CAM works by leveraging the gradients of a target class flowing into the final convolutional layer of a CNN to produce a heatmap that highlights the regions of the input image most relevant for the class prediction. This technique makes Grad-CAM highly class-discriminative, which enables humans to correctly identify the object most relevant to the prediction of a given class. Furthermore, it can be used to explain model biases and failures.

Grad-CAM can be incorporated in frameworks requiring visual explanation of CNN-based model decisions. While it is frequently used to reveal decision strategies of models, it requires, like other explanation methods, manual comparison of the feature attribution masks it generates. This is labor intensive and time inefficient.

Lapuschkin et al. (2019) presents a method, with which some of the work can be automated without losing, but gaining information. Spectral Relevance Analysis (SpRAy) is a semi-auto- mated analysis technique, which embeds the explanation method LRP (Layerwise-Relevance- Propagation) and on the basis of heatmaps, identifies a wide spectrum of learned decision strategies. SpRAy consists of three steps. First, feature attribution masks of multiple data sam- ples and object classes are generated with LRP. Then, a cluster analysis based on eigenvalues reveals and clusters similar prediction strategies of the model. Finally, the different clusters can be made accessible by visualizing them using t-SNE.

With this technique, an inspection can be done, after a sum of images has been evaluated. Overall, the clusters provide a summary of all possible prediction strategies, including even those inconspicuous to the human eye. Therefore applying SpRAy to the training process can be considered to be a useful tool to minimize the risk of overlooking invalid prediction strategies. Altogether SpRAy facilitates the systematic investigation of classifier behavior on a large scale. Even though there’s still manual work involved, SpRAy’s approach to automated and scalable model analysis contributes to the applicability of explanation methods.

Even when invalid decision strategies have been detected and categorized, it is just one part of the problem. Spurious relations in datasets are hard to avoid, as datasets often encompass thousands of samples and are therefore difficult to be manually reviewed and compared. Thus, in order for a model to make reasonable predictions, strategies have to be implemented, with which a model learns to base its decisions not on the ‘wrong reasons’ but ‘the right reasons’.

Counteracting invalid strategies

This is the issue tackled by Ross et al. (2017). The developed approach aims to efficiently explain and counter Clever Hans efficiently by examining and selectively penalizing a model’s input gradients. The focus on gradient-based methods was driven by the need for scalable methods for differentiable models, which lacked options for optimizing classifiers to avoid decision-strategies based on spurious relations. Previously, gradient methods aimed to encourage robustness and sparsity, but didn’t tackle confounders.

As stated in Ross et al. (2017), using gradients as local explanations is possible, as they de- scribe the model’s decision boundary. Or in other words, a high gradient magnitude shows the importance of an input feature to a model’s prediction. Regularizing the model by constraining input gradients optimizes it to learn correct explanations. In the “Right for the right reason loss”(RRR) this is done using a binary mask, with which high magnitude values of undesired regions are additionally regularized.

Similar to SpRAy, the paper proposes a semi-automatic method. With this, using the RRR- loss, alternative decision strategies can be explored in an unsupervised manner. During this process human annotators create the binary mask typically used for the RRR-loss, which is then iteratively self-adapted until the accuracy decreases or the explanations stop changing. With this technique new decision strategies are discovered by restricting one strategy and therefore forcing the model into using another strategy. However, while SpRAy clusters explanations, the semi-automatic RRR-loss readjustment de- livers a range of reasonable decision strategies. Therefore these methods can complement each other but not substitute each other. In any case, both methods are more efficient, combining intelligent logic with rapid techniques like the fast-computing RRR-loss. Even though these approaches promoted less manual work and less human intervention, the paper adopts a more human-centric perspective to conclude. It suggests that models should use interactive human guidance to choose better decision-making strategies.

Involving humans in the loop

One paper further discussing this approach is Selvaraju et al. (2019), which presents HINT (Human Importance-aware Network Training). HINT uses human demonstrations to improve visual grounding in vision and language models. Visual grounding describes the task of linking natural language to a corresponding image region. However vision and language models are inadequate at distinguishing visual concepts associated with words from the ground truth. For instance, models often incorrectly answer the question “What color is the banana?” with yellow, regardless of the actual color in the image. HINT aims to make Vision-Question-Answering (VQA) models refer to visual concepts instead of relying on confounding language. Using human attention, HINT encourages deep networks to focus on the same input regions as humans by enforcing a ranking loss between human annotations and gradient-based explanations. HINT, as discussed by Adebayo et al. (2020), extends the RRR-loss by aligning gradient-based expla- nations with human attention. Human attention is mapped onto a matrix with values from 0 to 1, and network importance is determined by calculating gradients relative to the ground truth. A ranking loss then updates the misranked pairs’ loss using the absolute difference between network and human importance scores.

HINT was tested on VQA and Image Captioning tasks and showed notable improvement in model performance with minimal use of human-importance maps, while enhancing visual grounding. A user study comparing trust in the HINTed model and the base model for Image Captioning found that users significantly preferred the HINTed model’s predictions and focus regions, demonstrating its increased reliability. HINT emphasizes the significance of gradient- based penalization for aligning a model’s focus with appropriate image regions. Compared to Ross et al. (2017) ‘Right for the Right Reason Loss’, it offers a more human-centered approach in defining attention masks for regularization.

Another approach at including humans in the loop is shown by Schramowski et al. (2024) through ‘explanatory interactive learning’ (XIL). In contrast to HINT, XIL is not a method, in which humans independently contribute to the optimization of the model. In XIL the human expert is involved directly in the training process allowing them to interactively revise the original model by providing feedback on its explanations. The feedback is then used to auto- matically augment the training with counterexamples or to modify the model using a revised regularization term.

The foundational concept of XIL is to commission the human user with revision of a model’s decision strategies. This is implemented with the framework CAIPI using counterexamples, which was supplemented by the RRR loss as an alternative correction method. In CAIPI an active learner selects an unlabeled instance and asks the explainer to explain the prediction ŷ = f(x). The instance, along with prediction and explanation (x, ŷ, ẑ) is handed to the user who then can initiate the correction of the model.

XIL demonstrates it’s effectiveness across various datasets, notably increasing test phase accu- racy with more counterexamples and confirms that trust in highly accurate machines drops when incorrect behavior is observed. Furthermore, unlike HINT, XIL doesn’t require the user to know exactly what to focus on - it can be found out in the process. It therefore allows a more subtle approximation of the correct decision-making strategy. In comparison to other methods, XIL is inefficient and time consuming. Involving a user directly in the training loop and acquiring annotations of explanations naturally slows the process down.

Conclusion

Along with overall progression in machine learning, a variety of methods has been developed in order to analyze and correct a model’s behavior. Fortunately, these methods complement each other in various ways, allowing humans to better comprehend a model’s decision making process and intervene if necessary. While explanation methods are generally successful at revealing Clever Hans Behavior, there remains the possibility that invalid decision strategies have slipped through the cracks (e.g. when the model was not exposed to data that would contradict its invalid decision strategy during test-time). Linhardt et al. (2024) provides an approach to tackle this issue. As challenges in the transparency of machine learning models persist, trust studies alongside these technical advancements, stress the importance and value of methods involving humans in a model’s training loop. This finally allows human oversight and ensures the reliability and accountability of machine learning systems in the future.

Published: 06/07/24