\subtitle{Explaining the Predictions of Any Classifier}
\author{Casper Vestergaard Kristensen \and Alexander Munch-Hansen}
\institute{Aarhus University}
\date{\today}
\begin{document}
\begin{frame}
\titlepage
\end{frame}
\begin{frame}
\setbeamertemplate{section in toc}[sections numbered]
\frametitle{Outline}
\setstretch{0.5}
\tableofcontents
\end{frame}
\section{Meta information}
%\subsection{Authors}
\begin{frame}
\frametitle{Authors}
\begin{itemize}
\item Marco Tulio Ribeiro
\item Sameer Singh
\item Carlos Guestrin
\end{itemize}
\end{frame}
%\subsection{Publishing}
\begin{frame}[fragile]{Metropolis}
\frametitle{Publishing}
\begin{itemize}
\item Conference Paper, Research
\item KDD '16 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
\begin{itemize}
\item A premier interdisciplinary conference, brings together researchers and practitioners from data science, data mining, knowledge discovery, large-scale data analytics, and big data.
\item Sigkdd has the highest h5 index of any conference involving databases or data in general
\item Highly trusted source
\end{itemize}
\end{itemize}
\end{frame}
\section{Article}
%\subsection{Problem}
\begin{frame}
\frametitle{Problem definition}
\begin{itemize}
\item People often use Machine Learning models for predictions
\item Blindly trusting a prediction can lead to poor decision making
\item We seek to understand the reasons behind predictions
% It becomes clear the dataset has issues, as there is a fake correlation between the header information and the class Atheism. It is also clear what the problems are, and the steps that can be taken to fix these issues and train a more trustworthy classifier.
\item Explains the predictions of \emph{any} classifier or regressor in a faithful way, by approximating it locally with an \emph{interpretable} model.
% Note: g acts in d' while f acts in d, so when we say that we have z' in dimension d', it's the model g, we can recover the z in the original representation i.e. explained by f in dimension d.
\frametitle{Sampling for Local Exploration}
Goal: Minimizing $\mathcal{L}(f,g,\pi_x)$ without making assumptions on $f$
\begin{itemize}
\item For a sample $x$, we need to draw samples around $x$
\item Accomplished by drawing non-zero elements of $x$, resulting in perturbed samples $z^\prime$
\item Given $z^\prime\in\{0,1\}^{d^\prime}$, we compute un-pertubed $z \in R^d$, $f(z)$, so we have a label for $z^\prime$.
% Talk through the algorithm, discussing the sampling and K-Lasso (least absolute shrinkage and selection operator), which is used for feature selection
\item Time/patience of humans is explained by a budget \emph{B} which denotes number of explanations a human will sit through.
\item Given a set of instances \textbf{X}, we define the \emph{pick step} as the task of selecting \emph{B} instances for the user to inspect.
\end{itemize}
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{The pick step}
The task of selecting \emph{B} instances for the user to inspect
\begin{itemize}
\item Not dependent on the existence of explanations
\item So it should not assist users in selecting instances themselves
\item Looking at raw data is not enough to understand predicitions and get insights
\item Should take into account the explanations that accompany each prediction
\item Should pick a diverse, representative set of explanations to show the user, so non-redundant explanations that represent how the model behaves globally.
\end{itemize}
\end{frame}
% This is a matrix explaining instances and their features explained by a binary list s.t. an instance either has a feature or does not. The blue line explains the most inherent feature, which is important, as it is found in omst of the instances. The red lines indicate the two samples which are most important in explaining the model. Thus, explaining importance, is done by: I_j = sqrt(sum_i=1^n W_ij)
% c is a coverage function, which computes the total importance of the features that appear in at least one instance in a set V .
% NOte: maximizing a weighted coverage function is NP-hard, but the version used in the algorithm is iterativily greedy, so it just adds the one with the maximum gain, which offers a constant-factor approximation guarantee of 1−1/e to the optimum.
\item Use two datasets, \emph{books} and \emph{DVDs}, both of $2000$ instances.
\begin{itemize}
\item Task is to classify reviews as \emph{positive} or \emph{negative}
\end{itemize}
\item Decision Trees (\textbf{DT}), Logistic Regression (\textbf{LR}), Nearest Neighbours (\textbf{NN}), and SVMs with RBF kernel (\textbf{SVM}), all used BoW as features, are trained.
\begin{itemize}
% Note, random forest will make no sense without any explanation system, such as LIME
\item Also train random forest (\textbf{RF}) with $1000$ trees.
\end{itemize}
\item Each dataset used for training will consist of $1600$ instances and $400$ will be used for testing.
\item Explanations of \textbf{LIME} is compared with \textbf{parzen}
\begin{itemize}
\item\textbf{parzen} approximates black box classifier globally and explains individual predictions by taking the gradient of the prediction probability function.
\item Both are also compared to a greedy method where features are picked by removing most contributing ones until prediction change, as well as a random procedure.
% K explains the amount of words in the BoW model and the complexity of the model. Higher K => More complex but more faithful, lower k => Less complex, potentially less faithful
\item Faithfulness of explanations is measured on classifiers that are interpretable, \textbf{LR} and \textbf{DT}.
\begin{itemize}
\item Both are trained s.t. the max no. of features which they can find is $10$, so features found by these are the \emph{gold standard} of features, in regards to which features are important.
\item For each prediction on the test set, explanations are produced and the fraction of the gold features found, is computed.
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{Faithfulness}
% We observe that the greedy approach is comparable to parzen on logistic regression, but is substantially worse on decision trees since changing a single feature at a time often does not have an effect on the prediction. The overall recall by parzen is low, likely due to the difficulty in approximating the original highdimensional classifier. LIME consistently provides > 90% recall for both classifiers on both datasets, demonstrating that LIME explanations are faithful to the models.
% In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive). The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
% Seems kind of unfair, that random and greedy is mistrusted by simply having an unstrutworthy feature in their explanation, while LIME and parzen just have to not change, when these untrustworthy are removed.
\item Evaluate if explanations can be used for model selection
\item They add 10 artificially “noisy” features s.t.
\begin{itemize}
\item Each artificial feature appears in 10\% of the examples in one class, and 20\% of the other in the training/validation data.
\item While on the test instances, each artificial feature appears in 10\% of the examples in each class.
\end{itemize}
\item Results in models both using actual informative features, but also ones creating random correlations.
\item Pairs of competing classifiers are computed by repeatedly training pairs of random forests with 30 trees until their validation accuracy is within 0.1\% of each other, but their test accuracy differs by at least 5\%.
\end{itemize}
\end{frame}
\begin{frame}
\frametitle{Can I trust this model?}
% They evaluate whether the explanations can be used for model selection, simulating the case where a human has to decide between two competing models with similar accuracy on validation data.
% Accomplished by "marking" the artificial features found within the B instances seen, as unstrustworthy. We then evaluate how many total predictions in the validation set should be trusted (as in the previous section, treating only marked features as untrustworthy).
% SP-parzen and RP-parzen are omittedfrom the figure since they did not produce useful explanations, performing only slightly better than random. Is this ok?
\frametitle{Can we learn something from the explanations?}
% Hand picked images to create the correlation between wolf and snow, s.t. the classifier miss-predicts whenever a husky is in snow or a wolf is without snow
\item They use only sparse linear models as explanations, our framework supports the exploration of a variety of explanation families, such as DTs.
\begin{itemize}
\item This estimate of faithfulness can also be used for selecting an appropriate family of explanations from a set of multiple interpretable model classes, thus adapting to the given dataset and the classifier.
\end{itemize}
\item One issue that they do not mention in this work was how to perform the pick step for images.
\item They would like to investigate potential uses in speech, video, and medical domains, as well as recommendation systems.
\item They would like to explore theoretical properties (such as the appropriate number of samples) and computational optimizations (such as using parallelization and GPU processing)
\item LIME is a framework for explaining predictions made by machine learning algorithms
\item It explains models, by intelligently picking individual predictions based on a budget of time, defining the amount of time the user wish to spend
\item Only uses linear models at the moment
\item It shown to make it significantly easier for people to better the classifiers, even non-experts.