fuck mig

2019-11-28 05:13:39 +01:00 · 2019-11-28 05:13:39 +01:00 · b96cb1016d
parent 48dc576076
commit b96cb1016d
2 changed files with 59 additions and 39 deletions
--- a/pres.pdf
+++ b/pres.pdf
--- a/pres.tex
+++ b/pres.tex
@ -289,7 +289,7 @@
    \frametitle{Explaining models}
    Idea: We give a global understanding of the model by explaining a set of individual instances
    \begin{itemize}
-    \item Still model agnositc (since the indiviudal explanations are)
+    \item Still model agnositc (since the individual explanations are)
    \item Instances need to be selected in a clever way, as people won't have time to look through all explanations
    \item Some definitions
      \begin{itemize}
@ -402,7 +402,7 @@
    Interested in three questions:
    \begin{itemize}
    \item Are the explanations faithful to the model?
-    \item Can the explanations aid users in ascertaining trust in the predictions?
+    \item Can the explanations aid users in ascertaining trust in the individual predictions?
    \item Are the explanations useful for evaluating the model as a whole?
    \end{itemize}
  \end{frame}
@ -423,6 +423,11 @@
      \item For each prediction on the test set, explanations are produced and the fraction of the gold features found, is computed.
    \end{itemize}
  \end{frame}
+  \note[itemize] {
+  \item Train logistic regression and decision tree classifiers, so that they use a maximum of 10 features to classify each instance.
+  \item These 10 features are the gold set of features that are actually considered important by the model.
+  \item The explanations should recover these features.
+  }
  
  \begin{frame}
    \frametitle{Faithfulness}
@ -444,19 +449,24 @@
  \begin{frame}
    \frametitle{Should I trust this prediction?}
    \begin{itemize}
-    \item Compute F-measure over trustworthy predictions where whether a prediction is trustworthy is based on:
+    \item Randomly select 25\% of the features as untrustworthy.
+    \item Simulated users deem a prediction untrustworthy if:
      \begin{itemize}
-      \item Random and greedy explanations are untrustworthy if they contain any untrustworthy features
-      \item Parzen and LIME explanations are untrustworthy if the linear approximation change, if an untrustworthy feature is removed from the explanation
+      \item Lime \& Parzen: the linear approximation changes, when all untrustworthy features are removed from the explanation.
+      \item Greedy \& Random: they contain any untrustworthy features.
      \end{itemize}
    \end{itemize}
-    \includegraphics[width=0.5\linewidth]{graphics/F1_trust.png}
-    \includegraphics[width=0.5\linewidth]{graphics/sample_points.png}
+    \includegraphics[width=0.60\linewidth]{graphics/F1_trust.png}
+    \includegraphics[width=0.35\linewidth]{graphics/sample_points.png}
  \end{frame}

  \note[itemize] {
-  \item In statistical analysis  of  binary  classification,  the  F1  score (also F-score or F-measure) is a measure of a test's accuracy.
-  \item Seems kind of unfair, that random and greedy is mistrusted by simply having an unstrutworthy feature in their explanation, while LIME and parzen just have to not change, when these untrustworthy are removed.
+  \item 2nd experiment: test trust in individual predicitions.
+  \item Test-set predictions are deemed (oracle,truely) untrustworthy if the prediction from the black-box classifier changes when these features are removed.
+  \item Simulated user knows which features to discount.
+  \item If the line is different when untrustworthy features are removed, something is wrong!
+  \item F-measure = a measure of a test's accuracy, i.e. if the user correctly distrusts a prediction based on the explanation given by fx LIME.
+  \item The table show that the other methods achieve lower recall = mistrust too many predictions, or lower precision = trust too many predictions.

  }

@ -473,6 +483,12 @@
      \item Pairs of competing classifiers are computed by repeatedly training pairs of random forests with 30 trees until their validation accuracy is within 0.1\% of each other, but their test accuracy differs by at least 5\%.
    \end{itemize}
  \end{frame}
+    \note[itemize] {
+        \item 3rd experiment: two models, user should select the best based on validation accuracy.
+        \item 
+
+  }
+
  
  \begin{frame}
    \frametitle{Can I trust this model?}
@ -483,23 +499,9 @@
  \note[itemize]{
    \item They evaluate whether the explanations can be used for model selection, simulating the case where a human has to decide between two competing models with similar accuracy on validation data.
    \item Accomplished by "marking" the artificial features found within the B instances seen, as unstrustworthy. We then evaluate how many total predictions in the validation set should be trusted (as in the previous section, treating only marked features as untrustworthy).
-    \item SP-parzen and RP-parzen are omitted from the figure since they did not produce useful explanations, performing only slightly better than random. Is this ok?
+    \item As B, the number of explanations seen, increases, the simulated human is better at selecting the best model.
  }
  \subsection{Human user experiments}  
-  \begin{frame}
-    \frametitle{Human evaluation setup}
-    \begin{itemize}
-    \item Create new dataset, the \emph{religion} set
-      \begin{itemize}
-      \item Consist of $819$ christianity and atheism websites
-      \end{itemize}
-    \item Most experiments are trained on the \emph{newsgroup} dataset
-      \begin{itemize}
-      \item The one containing the emails
-      \end{itemize}
-    \end{itemize}
-  \end{frame}
-  
  \begin{frame}
    \frametitle{Can humans pick the best classifier?}
    \includegraphics[scale=0.35]{graphics/avg_acc_humans.png}
@ -510,8 +512,8 @@
  \item Train two classifiers, one on standard data set and one on a cleaned version of the same data set
  \item Use the newsgroup dataset for training, which is the one with the atheism/christianity emails
  \item Run the classifiers on a \say{religion} dataset, that the authors create themselves, to question if the classifiers generalizes well
-  \item Standard one achieves higher validation accuracy
-  \item Humnas are asked to pick the best classifier when seeing explanations from the two classifiers for B and K = 6 (They see 6 explanations with 6 features)
+  \item Standard one achieves higher validation accuracy - but it's not correct!
+  \item Humans are asked to pick the best classifier when seeing explanations from the two classifiers for B and K = 6 (They see 6 explanations with 6 features)
  \item Repeated $100$ times
  \item Clearly SP LIME outperforms other options
  }
@ -529,11 +531,12 @@
  \note[itemize] {
  \item Non-expert humans, without any knowledge of machine learning
  \item Use newsgroup dataset
-  \item Ask mechanical turk users to select features to be removed, before the classifier is retrained
+  \item Ask mechanical turk users to select features to be removed (email headers), before the classifier is retrained
  \item B = K = 10
  \item Accuracy shown in graph, is on the homebrewed religion dataset
  \item Without cleaning, the classifiers achieve roughly $58\%$, so it helps a lot!
  \item It only took on average 11 minutes to remove all the words over all 3 iterations, so little time investment, but much better accuracy
+  \item SP-LIME outperforms RP-LIME, suggesting that selection of the instances to show the users is crucial for efficient feature engineering.
  }

  \begin{frame}
@ -549,7 +552,8 @@
  \end{frame}

  \note[itemize] {
-    \item Use graduate students who has taken at least one course in machine learning
+    \item Use graduate students who has taken at least one course in machine learning.
+    \item Intentionally train bad classifier by having snow on all wolf-images during training.
  }

  \begin{frame}
@ -572,7 +576,7 @@
  \end{frame}
  %\subsection{Human Subjects}
  \note[itemize] {
-  \item Clearly shows that seeing the explanations changes their answers consistently
+  \item Clearly shows that seeing the explanations leads to insight, changing their answers consistently.
  }
  \section{Conclusion}
  
@ -581,34 +585,50 @@
    \begin{itemize}
  \item They argue that trust is crucial for effective human interaction with machine learning systems
  \item Explaining individual predictions is important in assessing trust
-  \item They proposed LIME, a modular and extensible ap- proach to faithfully explain the predictions of any model in an interpretable manner
+  \item They proposed LIME, a modular and extensible approach to faithfully explain the predictions of any model in an interpretable manner
  \item They introduced SP-LIME, a method to select representative and non-redundant predictions, providing a global view of the model to users.
  \item Experiments demonstrated that explanations are useful for a variety of models in trust-related tasks in the text and image domains
    \end{itemize}
  \end{frame}
+  \note[itemize] {
+  \item Establishing trust in machine learning models, requires that the system can explain its behaviour.
+  \begin{itemize}
+    \item Both Individual predictions.
+    \item As well as the entire model.
+  \end{itemize}
+  \item To this end, they introduce (submodular-pick) SP-LIME, which select a small number of explanations, which together (hopefully) explain the entire model.
+  \item Experiments show that this is indeed the case.
+  }

  \begin{frame}
    \frametitle{Future work}
    \begin{itemize}
-    \item They use only sparse linear models as explanations, our framework supports the exploration of a variety of explanation families, such as DTs.
-      \begin{itemize}
-      \item This estimate of faithfulness can also be used for selecting an appropriate family of explanations from a set of multiple interpretable model classes, thus adapting to the given dataset and the classifier.
-      \end{itemize}
+    \item Explanation families beyond spare linear models.
    \item One issue that they do not mention in this work was how to perform the pick step for images.
    \item They would like to investigate potential uses in speech, video, and medical domains, as well as recommendation systems.
    \item They would like to explore theoretical properties (such as the appropriate number of samples) and computational optimizations (such as using parallelization and GPU processing)
    \end{itemize}
  \end{frame}
+  \note[itemize] {
+  \item The paper only describes sparse linear models as explanations, but the framework supports other explanation families, such as decision trees.
+  \item They envision adapting the explanation family based on the dataset and classifier.
+  \item Extend framework to support images(better), speech, video, etc.
+  \item LIME framework ready for production and available on GitHub.
+  \item Therefore would like to optimise computation using parallelisation and GPU processing.
+  }
  \section{Recap}
  \begin{frame}
    \frametitle{Recap}
    \begin{itemize}
-    \item LIME is a framework for explaining predictions made by machine learning algorithms
-    \item It explains models, by intelligently picking individual predictions based on a budget of time, defining the amount of time the user wish to spend
-    \item Only uses linear models at the moment
-    \item It shown to make it significantly easier for people to better the classifiers, even non-experts.
+    \item LIME is a framework for explaining predictions made by machine learning algorithms.
+    \item It explains models by intelligently picking a limited number of individual explanations.
+    \item Only uses linear models at the moment.
+    \item Is shown to make it significantly easier for people to better the classifiers, even non-experts.
    \end{itemize}
  \end{frame}
+  \note[itemize] {
+  \item LIME is able to explain entire ML models by presenting the user with a limited number of individual, non-redundant explanations, that describe the model well enough without overwhelming them.
+  }

  \begin{frame}
    \frametitle{Discussion}