This commit is contained in:
Alexander Munch-Hansen 2020-01-12 02:08:10 +01:00
parent 4011ae2a19
commit ee51566008

365
notes.org
View File

@ -330,3 +330,368 @@
- *Size-invariance:* We can normalise the heat trace signatures, thus making it size-invariant. - *Size-invariance:* We can normalise the heat trace signatures, thus making it size-invariant.
*** Experiments *** Experiments
- NetLSD is very scalable. - NetLSD is very scalable.
* Explaining Outliers and Glitches
** Empirical Glitch Explanations
- Data glitches are unusual observations that do not conform to data quality expectations
+ Can be both logical, semantic or statistical
- Data integretity constraints can potentially flag large sections of data as being non-compliant, which is not ideal, as ignoring or repairing significant sections of the data could bias the results and conclusions drawn from the analyses
- In the context of big data, large numbers and volumes of feeds from disparate (a lot of different) sources are integrated and as such, it is likely that significant portions of this data seems noncompliant, while it is actually legitimate data.
- They introduce Empirical Glitch Explanations, which are concise, multi-dimensional descriptions of subsets of potentially dirty data and propose a scalable method for empirically generating such explanatory characterisations
- These explanations could serve two valuable functions
1) Provide a way of identifying legitimate data and releasing it back into the pool of clean data, reducing cleaning-related statistical distortion of the data
2) Used to refine existing data quality constraints and generate and formalise domain knowledge
*** Introduction
- Much attention has been paid to identifying data quality constraint violations and developing cleaning strategies, while not much focus has been on whether all data that is noncompliant should be repaired or all data which violates a constraint should be treated homogeneously (i.e. all data violating is treated equally)
- By incorrectly (or when it's not needed) repairing noncompliant data, we risk changing the data to such an extent that it is unrecognisable and thus we suffer high statistical distortion.
+ And conclusions drawn from this could be misleading
- Data constraints are usually fairly broad and as a result, they flag much data as suspect. It is thus critical to study this data for additional, potentially explanatory relationships in the data that could reduce the cost and distortion associated with cleaning and it might yield additional knowledge of the data.
- Data quality is highly domain and context dependent and any empirical method that allows for the gathering of domain knowledge is valuable in itself.
- This paper shows that it is possible that significant portions of data violating constraints, actually have valid explanations and can thus be released back into the pool of clean data unaltered.
- Identifying empirical explanations for seemingly suspicious data based on attribute patterns is a valuable contribution to the data quality process.
**** An example
- They have the constraint "Any given phone number must have only one record associated with it" on data from some big database.
- They then find several duplicates, where they present three instances where each phone number occurs thrice.
- In this example, the first phone number actually has missing fields as well and is thus likely a result of corrupt data and this should likely be regarded as bad data.
- The other two cases however are likely results of phone numbers used for some specific purpose and should as such be added back to the clean pool and then the constraint should be extended to allow for these two types of cases, as there might be more in the future. This arguing allows to put 66% of the violating data back into the clean data pool.
**** Related Work
- Statistical distortion is the distortion in data caused by well-intentioned data repair efforts and it was introduced as a critical criterion for measuring the utility of data cleaning strategies.
**** Their contributions
- They seek to explain seemingly anomalous data by empirically discovering patterns and characterising subsets that can be returned to the clean data pool, thus reducing statisticial distortion as no unnecessary repairs have to be done.
- They introduce the notion of explainable glitches which are seeming violations that can be collectively described by a succint empirical description. These explanations have the potential to explain the glitches, either by consulting subject matter experts or other heuristics. These can serve two valuable functions:
1) Provide a way of identifying legitimate data and releasing it back and in doing so reducing statistical distortion
2) Refine the existing data quality constraints and generate and formalise domain knowledge
- They propose a robust and scalable method for empirically generating the explanations by developing the new notion of cross subsampling which create subsets that are similar to the noncompliant set. In doing so, they reduce the redundancy of the resampling procedure caused by the disparity in sizes between dataset D and the suspicious subset A, ensuring that the results are statistically significant.
- Define two objective metrics, size and merit, for evaluating and ranking explanations. These make the method flexible and customisable depending on the application.
- They evaluate the methodology within a comprehensive experimental framework using real and synthetic data sets and explore the robustness and scalability of explanations. They are able to retain 99% of the data flagged as suspicious.
*** Problem description
1. We are given a dataset D with N rows (records) and d columns (attributes) and a constraint C.
2. Constraints are rules (logical, semantic, statistical) that are imposed on data to ensure conformity to expectations about the data; "Any given phone number must have only one record associated with it".
3. Let subset A consist of all suspicious records in D that violate C. In the example, A would consist of the 9 records violating C, as the three phone numbers each belonged to 3 records.
4. In absence of explanations, the problematic set Q that needs to be cleaned is given by Q = A.
5. The objective is then to reduce the size of Q by identifying portions of A that can in fact be explained as clean, using characteristics derived from the other attributes and data values.
6. This cuts the cost of cleaining and reduce distortion.
7. A cleaning process typically change the data by making educated guesses on the correct data.
8. We wish to generate empirical explanations E, each of which will describe a set of records $P \subseteq A$. Explanations are of the form $\{s_j\}$, where s_j describes a condition on a value v_j in the suspicious set A.
9. These explanations for the phone number example could be: E_1 = "blank is frequent and occurs in multiple attributes", E_2 = "ID_5 in attributes 1 and 6, new hire, d2300", E_3 = "ID_13, A132, D8000" where E_1 essentially means that the data is bad, as a lot of the attributes are left blank, but E_2 might mean that the "new hires are assigned their supervisor's phone number" and E_3 could mean that "members of the same department are working for the same supervisor and they share the same physical room and thus phone". So only E_1 was essentially problematic.
*** Their approach
- Take a nonparametric approach
+ This ensures a general applicability that is agnostic to any underlying data distributions
- Main steps are:
1) Identify the set A by applying constraint C to D. In the absence of any explanation, the entire set A is deemed suspicious.
2) For each value $v \in A$, generate a propensity signature $s$. This signature is probabilistic and should capture the propensity (an inclination or natural tendency to behave in a particular way, tilbøjelighed) of occurence of a value $v$ across all records and attributes of $A$.
3) Rank the signatures based on their suspiciousness, using statistical criteria. The significant signatures together constitute an explanation E = {s_j}. These signatures can be used collectively in a conjuctive disjunctive or in some other manner to define the explanation.
4) Apply the explanation E to A, yielding a set of records P of A.
5) Quantify the effectiveness of an explanation using the size and merit in reducing the statistical distortion of impacted records.
**** Suspicious Set
- Given C, this is applied to D to identify A. Identifying A is easy for obvious glitches such as missing values or duplicates. It is however complex in more non-trivial cases, such as disguised missing (in which unknown, inapplicable, or otherwise nonspecified responses are encoded as valid data values, can arise from poorly designed questionnaires (e.g.,inapplicable or ambiguously worded questions), errors madeby the interviewer (e.g., omitted questions), or nonresponseby the interview subject (e.g., subject cant remember orrefuses to answer)) values and where the glitches are masked or hidden.
- If glitch detection is dependent on thresholds, for example outliers, then determining A is more task dependent.
- Methods for formulating C and determining A are outside the scope of this paper.
- They just assume C and A clearly specified.
- Usually |A| << |D|.
- Let "good" or non-suspicious data $A' = D - A$ (i.e. data not violating the constraint).
- They need to identify the values $v \in A$ that exhibit different statistical behavior in A and A'.
**** Propensity Signatures
- Let $v$ be a value in A. I assume this is any value of any attribute within the records of A, but this is not directly mentioned. To capture the behavior of this value $v$, they propose propensity signatures.
- Let p_k be the probability of v occuring in attribute C_k of A, then:
- /The propensity signature of a value v in the set A is a d-dimensional vector given by: $s_A(v) = (p_1, ..., p_k, ..., p_d)$, k=1,d and it captures the propensity of occurence of v in A.
- Also defined for the values of A', however large P's.
- Propensity signatures focus on the occurence of a value across all records and attributes in a sample.
- We do not know distributions of v a priori, so they use the empirical estimates of propensity signatures \tilde(s(v)) to identify the set of suspicious values $V = \{v\}$ that have statistically different signatures in the suspicious data set A, compared to the good data A'.
- For the phone number example: $\tilde{s_{A}(ID_5)} = (1/3,0,0,0,0,2/3)$ and $\tilde{s_{A}(NewHire)} = (0,2/3,0,0,0,0)$. This is likely due to $ID_5$ being in all three of the records, but one of them had this as their own ID while the other $2/3$ had this value as their supervisor ID. Also, $2/3$ of them are "New Hires".
**** Statistical Significance
- How is it determined if the propensity signature of a value is statistically significant?
+ Could compute distances of propensity signatures of all values in A from the corresponding value signatures in the good A' and then rank values based on signature distances. The ones with the highest distance could be considered "different".
+ However, the signatures of different values are not comparable, nor their distances between.
**** Crossover Subsampling
- An alternative approach to the above problem, is to use re-sampling, where samples are drawn repeatedly, compute propensity signatures of a given value in each sample and construct a sampling distribution of the propensity signatures.
- From this sampling distribution, we can infer the expected signature of the given value as well as the expected variability in its signature.
- since all signatures in the sampling distribution pertain to the same value, the question of comparability does not arise.
- As |A| << |D|, simply using random sampling is not good enough, as we need to ensure that our sampled sets are of the same size.
- We would like to construct specialized sumsamples that share some characteristics of A, in addition to being like-sized
- Thus Crossover subsampling is introduced.
- With crossover subsampling, they are guaranteed that every record in A is represented in a specified proportion of the subsamples
- A q-crossover subsample of size B drawn from two sets D and A \subset D where the size of A is |A| = B, is defined to be a set that contains q proportion of samples from A and the rest from D - A, and every record in A occurs in exactly q proportion of the subsamples.
- This is constructed by:
1) partioning A into b = 1/q chunks of size M = B/b (A has size B), A = A_1 + A_2 + ... + A_b
2) cross each piece A_i with a random piece of size B - M drawn from D - A_i to create a like-sized sample of size B.
+ This process is replicated R times, holding A_i fixed, but drawing randomly without replacement from D - A_i. This yields R samples of size B, corresponding to A_i.
3) The sampling distribution of propensity signatures of each value v in A_i from these R replications corresponding to chunk A_i is denoted $\tilde{F_{A_i}(v)}$.
4) The estimated signature $\tilde{s_{A}(v)}$ is compared against $\tilde{F_{A_i}(v)}$, establising if that particular value has a statistically different pattern of occurences in A_i.
- Each chunk A_i gets to vote on the suspicioness of value v.
- A value v in a set A is voted to be suspicious with respect to the empirical sampling distribution $\tilde{F_{A_i}(v)}$ corresponding to chunk A_i of A, if it is statistically different with respect to that distribution. The vote is denoted by the indicator function I_{A_i}(v) which takes the value 1 if significant and 0 otherwise.
- The voting is repeated with each of the b pieces of A, each chunk yields a vote I_{A_i}(v) for each value v.
- The informativeness of a value v is measured by the proportion of votes: $K = \Sum_{i} I_{A_i}(v)/b$
- So crossover subsampling process results in total of $T = R \cdot b$ samples of size B and a collection of empirical sampling distributions $\{\tilde{F_{A_i}(v)}\}_{i=1}^b$ corresponding to b chunks.
**** Testing for statistical significance
- A value v is flagged if its propensity signature lies outside the chosen error bounds of its corresponding sampling distribution $\tilde{F_{A_i}(v)}$.
+ These bounds are computed for each attribute
- Each element of the signature is compared with the corresponding bootstrap distribution (??) and if any element lies outside the bounds (mean +- 2 standard deviation; 95% and 5% percentiles) it is deemed significant.
+ For our long going example, the $\tilde{s_{A}(ID_5)} = (1/3,0,0,0,0,2/3)$ is deemed significant, as we conclude the upper bound is $(0.2,0,0,0,0,1.8)$ and $1/3 > 0.2$.
- These statistically significant propensity signatures are used for explanations.
*** Glitch Explanations
- sss = Statistically significant signatures
- Let the collection of values v in A with statistically significant signatures be $V = (v_1,v_2,..,v_L)$
- A glitch explanation $E \in V$ is then a collection of values in A that have sss.
- This is why an explanation of one of the phone numbers was $E = (ID_5, NewHire)$.
- The size $S(E)$ is the smallest number of informative non-redundant values in the explanation.
- A threshold on informativeness K can be given, such that $K > \alpha$ for including a value v in an explanation.
*** Evaluating explanations
- They measure the effectiveness of the explanation by the statistical distortion of the data prevented by reclaiming the data instead, rather than changing it.
- They define statistical distortion to be the proportion of records that are touched by any data repair.
+ This is not the best of definitions, but this is not the focus of the article and they deem it is good enough.
- Let S be the reclaimed set from A, then the reduction r in the statistical distortion is given by: $r = \frac{|S|}{|A|}$, so they just divide the reclaimed set size with the size of the suspicious set..
- The merit r of an explanation E is the reduction in statistical distortion caused by reclaimed the records explained by E.
- So as above, when we reclaimed 2 out of 3: $r = 2/3$.
*** Constructing propensity signatures
- Each of the sets A and D - A contain a collection of distinct values. The empirical estimates of the propensity signatures are constructed within a single pass over the data for each distinct value in A and D - A (Lol this can be a lot).
- Let A have N_A rows (records). Suppose that v occurs n_k times in some column (attribute) C_k in the suspicious set A. Then $\tilde{p_k} = n_k/N_A$ be an empirical estimate of the probability p_k.
+ These estimates of p_k then form the propensity signatures by computing p_k for each attribute of the record.
+ Which again explains the 1/3 and 2/3 within the signature of ID_5.
- This is similar for the estimated propensity signature of v in D - A
- This is the maximum likelihood estimate
*** Conclusion
- Essentially this works well, but at a high cost in regards to computation and runtime.
- In the slides, they question if the authors subdivide the suspicious set into subsets of sets where each violate in the same way, like the phone number, where they have 3 phone numbers divided over 9 people. I think they do this yes, and it becomes apparent in their experiments. (Unless this is what the presenters mean..)
- Increasing value of q (q crossover!) gives less bootstrap samples, but increases the amount of suspicious data within each set.
** A framework for outlier description using constraint programming
- Could consider talking about this for the presentation instead.
- We wish to explain why outliers are outliers
- These guys propose a framework based on constraint programming to find an optimal subset of features that most differentiates the outliers and normal instances
- This framework offers great flexibility in incorporating diverse scenarios arising in practice, such as multiple explanations and human in the loop extensions
+ Both things that the empirical glitch thing supposedly support
*** Introduction
- Say an automobile company has a lot of recalls. So the recalled cars are outliers compared to the working cars, but WHY are they outliers, is the important information.
- *The general outlier description problem*: Given a collection of instances that are deemed normal, N (D-A from the other), and another separate collection deemed outliers, O (A in the other), where instances in both N and O are in the feature space S, find a feature mapping t : S -> F that maximises the difference between O and N according to some outlier property measure m : F -> R and an aggregate objective.
+ They don't mention what F is.
+ t and m can be viewed as the description/explanation of the outlying behaviour, where t describes the space where the behaviour is exhibited and m defines what we deem to be an outlier in this space.
+ They focus on specific t which projects the data to a subspace and use an m that measures the local neighborhood density around each point.
+ Their objective obj is the difference between the densities surrounding the outliers and normal instances.
- These guys then use Constraint Programming (CP), which allows the functions t, m and obj to take a wide variety of forms unincumbered by the limitations associated with mathmatical programming.
*** A Framework using CP Formulation
- From the general outlier description problem, the essence of outlier description is to search for t and m, that describe what makes the outliers different compared to the inliers.
- They restrict the feature mapping to selecting a subspace of the feature set.
- They use the local density criterion for outliers based upon the assumption that a normal instance should have many other instances in proximity whereas an outlier has much fewer neighbors.
+ Hmm
- A natural objective in this context is to maximise the difference of numbers of neighbors between normal points and outliers.
+ A large gap would substantiate the assumption of local density between outliers and normal points.
- *The Subspace Outlier Description Problem:* Given a set of normal instances N and a set of outliers O in a feature space S, find the tuple $(F, k_N, k_O, r)$ where $k_N - k_O$ is maximised, $F \subset S$ and $\forall x \in N$, $|\mathcal{N}_F(x,r)| \geq k_N$ and $\forall y \in O$, $|\mathcal{N}_F(y,r)| < k_O$, where $\mathcal{N}_F(x,r)$ is the set of instances within radius r of x in subspace F.
+ Sooooo, k_N and k_O are densities and we wish to maximise the difference between them, while it is true that for all normal instances, the number of instances while looking at some subset (which we need to find) of their attributes and within some radius r, should be larger than k_N, while the same goes for all outliers, except they should be less than k_O.
+ Normal points are locally denser than the outliers and the core of the problem is to find the feature subspace where this actually occurs.
+ t is characterised by F and zeros out the components of the instances not within the subset of components picked, F. $m(x) = |\mathcal{N}(x,r)|$ (The constraints??) and obj(A,B) = min A - max B
- They present three different CP optimisation models
1)(Learning a single outlier description) Direct translation from the subspace outlier description problem. F is a binary vector where the bits that are set in the solution correspond to the best subspace. Users only need to supply the bounds of hyperparameters, k_{max}, k_{min} and r_{max} such that $k_{min} \leq k_O \leq k_N \leq k_{max}$ and $0 \leq r \leq r_{max}$.
+ minimising the size of the subspace might also be relevant, as a smaller subspace is easier to interpret.
2)(Outliers in multiple subspaces) Outliers can reside in different (outlying) subspaces. In this setting, there could be multiple reasons/explanations why a point is an outlier. We will then have two sets of feature subspace selectors, F and G. Normal instances must satisfy the dense neighborhood condition in BOTH subspaces whereas an outlier is an outlier, if it is outlying in EITHER F or G. As we have two subspace selectors F and G, we also need to radii, r_F and r_G.
3)(Human in the loop) Often known outliers are hand labeled and as a result labels are considered more accurate. Normal points might however violate the normal density conditions as there might be not yet reported outliers within the normal set. As such, in this formulation, points in the normal set are allowed to violate the constraints and these points are then "contentious" points, which can then be examined by a human expert. an extra binary vector is added, indicating if a normal point violate the constraints and we then a bound w_{max} on the amount of 1's within this binary vector.
*** Encoding constraints
- Simply define some distance function and compute it for ALL points (union of N and O).
- We can then just check if the distance between points in N to all other points and the points in O to all other points and check the densities. This distance function can be something like the euclidean distance (the l-2 norm), if the data allows it.
*** Complexity of the first CP formulation
- F, k_N, k_O and r are explicitly defined variables. r must take on a discrete set. This can be done by specifying a step size, s,s such that $r \in \{0,s,2s,...,r_{max}\}$. This discretization does not apply to k_N and k_O, as these are natural integers. There is one $z_{ij}$ (the distance thing) for each pair of data instances and one constraint to set its value. (So a lot, specifically (n 2), the binomial thing). Once these are in place, enforcing the number of instances within a neighborhood is a single constraint for each instance, so n of these. We thus have 1 + (n 2) + n constraints and the size of F (p) + k_N, k_O (2) + r (1) + z_{ij} (n 2) variables.
*** Conclusion
- Scalability is quite poor due to the combinatorial optimisation in general but also the variables needed to encode the problem.
** Beyond Outlier Detection: LookOut for Pictorial Explanation
- Provide succient, interpretable and simple pictorial explanations of outlying behavior in multi-dimensional and real-valued datasets while respecting the limited attention of human analysts.
- Propose to output a few pictures (so called focus-plots, pairwise feature plots) from a few, carefully chosen feature sub-spaces.
- Their solution has a plot-selection objective and the algorithm approximates with optimality guarantees
- It scales linearly with the size of input outliers to explain the explanation budget (??)
- Their experiments show that LookOut performs near-ideally in terms of maximising explanation objective on several real datasets, while producing fast, visually interpretable and intuitive results in explaining groundtruth outliers from several real-world datasets.
*** Introduction
- It is extremely beneficial to provide explanations for incidents where outcomes of something raises alert (is an outlier), as these explanations can be used by some expert or analyst, empowering the analysts in sensemaking and reduce their efforts in troubleshooting and recovery. (Also such as the carmaker thing from the previous article)
- LookOut provides interpretable pictorial explanations through simple, easy-to-grasp focus plots which "incriminate" the given outliers the most.
- Given outliers from a dataset with real-valued features (NOTE REAL-VALUED!) they aim to find a few 2D plots on which the total "blame" that the outliers receive is maximised. These should be interpretable and succint such that they can show only a few plots to respect the humans attention, but these allow the human to quickly interpret the plots, spot the outliers and verify their abnormality given the discovered feature pairs.
- LookOut is an algorithm with a plot selection objective which quantifies "goodness" of an explanations and lends itself to monotone submodular function optimisation which is solved efficiently with optimality guarantees.
- It is domain-agnostic and detector-agnostic (outlier detector)
- It requires linear time on the number of plots to choose explanations, the number of outliers to explain and the user-specified budget for explanations.
- These guys believe that the constraint programming guys do not meet several key desiderata for outlier description:
1) Quantifiable explanation quality
2) Budget-consciousness towards analysts (but they do, as you can decide the size of the set F slightly)
3) Visual interpretability (this is true for the CP however, it is a binary vector..)
4) A scalable descriptor (the CP one sucks at this)
*** Prelims and problem statement
- Let V be the set of input data points, v \in V come from R^d and n = |V|. d = |F| is the dimensionality of the dataset and F = (f_1, f_2, ..., f_d) is the set of real-valued features. The set of outlying points is A. |A| = k.
- *Definition of Focus Plots:* Given a dataset of points V, a pair of features $f_x, f_y \in F$ (F is the set of real-valued features) and a set of outliers A, focus-plot $p \in P$ is a 2d scatter plot of all points with $f_x$ on x-axis, $f_y$ on y axis with drawing attention to the set of maximally explained outliers $A_p \subseteq A$ best explained by this feature pair.
- So their pictorial outlier explanation is a set of focus-plots, each of which "blames" or "explains away" a subset of the input outliers, whose outlierness is best showcased by the corresponding pair of features. This means they consider (d 2) (binomial coefficient) spaces, by generating all pairwise feature combinations. Within each 2d space, they score the points in A by their outlierness.
- Let all the (d 2) focus plots be denoted P.
- The goal is to output a small subset S of P on which points in A receive high outlier scores
*** Proposed Algo
- Generate focus plots, score them.
- The maxcover problem is NP hard, so they need to approximate it, when trying to select plots to explain all of the outliers.
- Their objective is to maximise the total maximum outlier score of each outlier amongst the selected plots: $f(S) = \Sum_{a_i \in A} max_{p_j \in S} s_{i,j}$, so they need to find the S which maximises this.
- This f function is non-negative, non-decreasing and submodular, so they can use a greedy algorithm with an approximation guarantee. (Submodularity: a set function whose value, informally, has the property that the difference in the incremental value of the function that a single element makes when added to an input set decreases as the size of the input set increases.)
- Thus, we can build a greedy algo, which just starts out with the empty S and greedily adds the plot which yields the largest marginal gain in function value.
- Apparently it has a 63% approximation guarantee.
*** Complexity
- Total time complexity is $O(l*\log(n')*(k+n')+klb)$ for sample size $n' < n$ and it is sublinear in total number of input points n.
- The time complexity of finding their outliers using something called IForest takes $O(t*log(n')*(k+n'))$.
- Computing marginal gain for each unselected plot takes $O(kl)$ time. Finding the maximum among all gains take $O(l)$ via linear scan. This process is repeated $b$ times for a budget of $b$.
- total number of plots $l = d^2$ is quadratic in number of features.
*** Discussion
- Scatter plots was picked since they are easy to understand and interpret. They are also universal and they show where outliers lie relative to the normal points.
- Scatter plots over decision trees, as decision trees become more difficult to interpret at large depths. They are also not budget concious.
- Time is linearly scaling.
* Explaining Classification
** "Why Should I Trust You?" - Explaining the Predictions of Any Classifier
- Machine models remain mostly black boxes.
- Understanding WHY some model predicts what it does is important in assessing trust in the prediction, which is fundamental when discussing whether to deploy a new model or not.
+ The understanding can also be important when trying to transform an untrustworthy prediction into a trustworthy one.
- LIME is a novel explanation technique, that explains the predictions of any black-box classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction.
- Using the previous individual predictions, they also show how to learn a model, by presenting representative individual predictions and their explanations in a non-redundant way, framing this task as a submodular optimisation problem (submodular so it can be greedy)
*** Introduction
- Machine Learning is at the core of many recent advances in scitech
- Whether humans are directly using machine learning classifiers as tools, or are deploying models within other products, a vital concern remains: If the users do not trust a model or a prediction, they will not use it.
- There are two definitions of trust:
1) Trusting a prediction, i.e. whether the user trusts the individual prediction to take action based on it
2) Trusting a model, i.e. whether the user trusts a model to behave in reasonable ways if deployed.
- Trust in predictions is important in decision making, such as when the model is used for medical diagnosis or terrorism detection. In these cases, you can't simply have blind faith in the predictions, as you might not be aware of exactly why the model predicts what it does.
- You also need to be confident that the model will behave well on real-world data, as opposed to the training data. This is often done with cross validation, but even cross validation does not mean that the model does not pick up on weird things not relevant to the problem.
- The authors have three major contributions:
1) LIME - An algorithm that can explain the prediction of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model
2) SP-LIME - A method that selects a set of representative instances of predictions with their explanation, to address the "trusting the model" problem, via submodular optimisation
3) Comprehrensive evaluation with simulated and human subjects, where they measure the impact of explanations on trust and associated tasks.
*** The case for explanations
- Explaining a predictions means presenting textual or visual artifacts that provide qualitative understanding of the relationship between the instance's components (words in text, patches in an image), and the models predictions.
- A doctor will need to understand specifically why a certain prediction is what it is. It is not enough simply to say "sick", without defining why the model believes the person is sick (things such as coughing, headache, so on).
- Every machine learning application also requires a certain measure of overall trust in the model. Development and evaluation of a classification model often consists of collecting annotated data, of which a held-out subset is used for automated evaluation. This is a useful pipeline for many applications, however evaluation on validation data may not correspond to performance in the wild, as practitioners may overestimate the accuracy of their models, while this accuracy may come from completely unrelated to the domain, things.
- An example could be the patient ID being heavily correlated with the target class in the training and validation data, which would result in a model placing heavy impact on the patient ID, when released into the wild, yielding a very bad predictor on real data, but a very accurate one on the training data.
+ This is known as data leakage.
- Dataset shift is when the training and test distributions are different.
+ Face recognition algorithms that are trained predominantly on younger faces, yet the dataset has a much larger proportion of older faces in it.
- Individual predictions can be used to select between models, in conjunction with their accuracy.
- A practitioner may choose to pick a less accurate model, if the user is aware of why the two models made their decisions.
**** Desired Characteristics for Explainers
- An essential criterion for explanations is that they must be interpretable
+ Provide qualitative understanding between the input variables and the response.
+ Interpretability must take into account the user's limitations, so a linear model may or may not be interpretable, if hundreds or thousands of features significantly contribute to a prediction, it is not reasonable to expect any user to comprehend why the prediction was made, even if individual weights can be inspected.
+ This implies that explanations should be easy to understand
- They should also display local fidelity (faithfulness). It is often impossible for an explanation to be completely faithful unless it is the complete description of the model itself. For an explanation to be meaningful it must at least be locally faithful, so it must correspond to how the model behaves in the vicinity of the instance being predicted.
+ Local fidelity does not imply global fidelity. Features that are globally important may not be important in the local context and vice versa.
+ global fidelity does imply local fidelity though.
- An explainer should also be able to explain any model, thus be model-agnostic. So the original model (the predictor) should be treated as a black-box.
- It should also provide a global perspective.
*** Local Interpretable Model-Agnostic Explanations
**** Interpretable Data Representations
- Local Interpretable Model-Agnostic Explanations (LIME)
+ Should identify an interpretable model over the interpretable representation that is locally faithful to the classifier
- Interpretable explanations need to use a representation that is understandable to humans, regardless of the actual features used by the model.
- An interpretable representation for text classification is a binary vector indicating the presence or absende of a word, even though the underlying classifier may use much more complex features such as word embeddings.
+ For image classification, it may be a binary vector indicating whether or not a super pixel is present or absent.
- They denote $x \in R^d$ as the original representation of an instance being explained and $x' \in \{0,1\}^{d'}$ as the binary for its intepretable representation.
**** Fidelity-Interpretability Trade-off
- An explanation is defined as a model $g \in G$, where G is a class of potentially interpretable models, such as linear models, so a model $g \in G$ can be readily presented to the user with visual or textual artifacts.
- The domain of $g$ is $\{0,1\}^{d'}$, so $g$ acts over absence/presence of the interpretable components. Not every $g \in G$ may be simple enough to be interpretable, so $\Omega(g)$ is a measure of complexity (as opposed to interpretability) of $g$ and it may be the amount of non-zero weights of a linear model.
- Let the model being explained be denoted $f : R^d -> R$. In classification, f(x) is the probability or a binary indicator that x belongs to a certain class.
- $\pi_x(z)$ is defined as a proximity measure between an instance $z$ to $x$, so it defines locality around x.
- $L(f,g,\pi_x) is a measure of how unfaithful g is in approximating f in the locality around $\pi_x$.
- To ensure both interpretability and local fidelity, we must minimise this $L$ while having $\Omega(g)$ be low enough to be interpretable by humans
+ Naturally, if $\Omega(g)$ is high, we allow g to be complex, thus increasing the fidelity of g in approximating f.
- $E(x) = argmin_{g \in G} L(f,g,\pi_x) + \Omega(g)$
- The authors focus on sparse linear models as explanations and performing the search
**** Sampling for Local Exploration
- Done via pertubations
- The locality-aware loss $L(f,g,\pi_x)$ should be minimised without making any assumptions on $f$, since the explainer should be model-agnostic.
- To learn the local behavior of f as the interpretable inputs vary, they approximate $L(f,g,\pi_x)$ by drawing samples weighted by $\pi_x$.
- Instances are sampled around $x'$ (the binary vector for the interpretable representation), by drawing non-zero elements of x' uniformly at random. Then, given a pertubed sample $z' \in \{0,1\}^{d'}$, they recover the sample in the original representation $z \in R^d$ and obtain $f(z)$, which is then used as a label for the explanation model.
- Given this dataset Z of pertubed samples with the associated labels, they optimise the E(x) function to get an explanation.
- So they sample instances both in the vicinity of x, which will have a high weight due to the proximity thing, as well as far away, which will have low weight. So even if the original model is too complex, LIME presents an explanation that is locally faithful (linear in this case), where the locality is captured by \pi(x).
**** Sparse Linear Explanations
- G is the class of linear models, so $g(z') = w_g \cdot z'$
- The locally weighted square loss is used as L and $\pi_x(z) = exp(-D(x,z)^2 / \sigma^2)$ is an exponential kernel defined on some distance function such as the L2 distance with width \sigma.
+ So $L(f,g,\pi_x) = \Sum_{z, z' \in Z} \pi_x(z) (f(z) - g(z'))^2$
- For text classification, they let the interpretable representation be a bag of words, so by setting a limit K n the number of words, $\Omega(g)$ describes the number of words and this has to be less K.
+ This article leaves K as a constant value
- \Omega(g) is the same for image classification, where they use "super-pixels" instead of words, so the interpretable representation on an image is a binary vector where 1 indicates the original super pixel is present and 0 indicates a grayed out super-pixel.
+ Note that this choice of \Omage(g) makes solving the E(x) function intractable, but it is approximated by selecting K features using Lasso and then learning the weights via least squares (K-Lasso)
- Their individual prediction algorithm produces an explanation for an individual prediction and as such, the complexity does not depend on the size of the dataset, but rather on the time to compute f(x), as this is done for each pertubed sample (of which the amount is N).
+ Explaining random forests of 1000 trees and N=5000 samples takes 3 seconds. Explaining each prediction of some image classification network takes around 10 minutes.
- Any choice of intepretable representation and G will have inherent drawbacks
1) The underlying model can be treated as a black-box, but certain interpretable representations will not be powerful enough to explain certain behaviours.
+ A model predicting sepia-toned images to be retro cannot be explained by presence or absence of super pixels as it's the entire tone of the image!
2) The choice of G, sparse linear models, means that if the underlying model is highly non-linear even in local predictions, there may not exist a faithful explanation. To remedy this, the faithfulness of the explanation on Z can be estimated and presented to the user
*** Two examples
- Won't really cover, but it's a text classification using SVMs and deep networks for images.
- We note, for text classification dumb email header information was used to make a classification, which is nonsense in the context.
*** Submodular Pick for Explaining Models
- An explanation for a single prediction does give some understanding into the reliability of the classifier to the user, but it is not sufficient to evaluate the model as a whole.
- Thus, they propose to give a global understanding of the model by explaining a set of individual instances. This process is still model agnostic, as the individual explanations are.
- These individual explanations need to be selected in a clever way, as users do not have time to sift through a large number of them.
- Define a time/patience budget B, that denotes the number of explanations humans are willing to look at.
- Given some set of instances X, they define a pick step as the task of selecting B instances for the user to inspect.
- This pick step should take into account the explanations that accompany predictions and it should pick a diverse, representative set of explanations to show the user, rather than just help the user pick these themselves.
- Given the set of explanations for a set of instances X (|X| = n), an n x d' explanation matrix W is constructed, that represents the local importance of the interpretable components for each instance. When using linear models as explanations, for an instance x_i and explanation g_i = E(x_i), they set $W_{ij} = |w_{g_{ih}}|$. For each component (column) j in W, they denote $I_j$ to be the global importance of that component in the explanation space. We now want I such that features that explain many different instances have higher importance scores, as such, for text applications they set $I_j$ to be the square root of the amount of instances having this feature. It should avoid picking instances with similar explanations.
- Thus, the final final picking function should seek to pick the instances displaying the most important features, while avoiding redundancy and picking as few as possible, to explain all features.
- This is NP Hard as it is a weighted coverage function. But as their coverage function is submodular, they can use it to approximate greedily.
+ It approximates with an approximation guarantee of $1 - 1/e$.
*** Indiviudal prediction algo
- Requires a classifier f, number of samples N, some instance x as well as its interpretable version x', the similarity kernel \pi_x and the length of the explanation K.
1) Z <- {}
2) for $i \in [1,N]$ do:
3) $z'_i <- sample_around(x')$
4) $Z <- Z \cup (z^'_i, f(z_i), \pi_x(z_i))$
5) end for
6) w <- K-Lasso(Z,K) (with z^'_i as features and f(z) as target.
7) return w
*** Explaining Models
- Requires instances X and budget B
- It for each instance x_i compute the invidiual prediction, gaining some weights like in the matrix, explaining if features are present or not
- It then for each intepretable feature compute the importance of it over the explanations of the instances
- It then greedily adds instances to the covering set using the covering function c.
*** Experiments
- They both simulate human experiments and perform actual human experiments.
- These all just show that their algorithm works well.
- It is a bit wonky that they use decision trees and such within their experiments, but never explain how this is achieved.
- Furthermore it is a bit wonky how they compare their algorithm to others, as they engineer all the data.
** Learning Credible Models
- A model should be capable of providing reasons for its predictions, so it must be interpretable.
- If the models reasoning does not conform with well-established knowledge, then the model may be interpretable, but lack credibility.
- These guys define credibility in the linear setting and focus on techniques for leanring models that are both accurate and credible.
- They propose a regularisation penality called expert yielded estimates (EYE), that incorporates expert knowledge about well known relationships among covariates and the outcome of interest.
- Models learned using the EYE penalty are significantly more credible than those learned using other penalties.
*** Introduction
- In health care, decision trees are preferred among physicians because of their high level of interpretability.
- Intepretability might not be enough, if the reasons provided by the model do not agree, at least in part, with well-established domain knowledge, practitioners may be less likely to trust and adopt the model.
- LASSE encourage sparsity in the learned feature weights, but in doing so may end up selecting features that are merely associated with the outcome rather than those that are known to affect the outcome.
- A credible model is an interpretable model that:
1) Provides reasons for its predictions that are, partly, inline with well-established domain knowledge
2) Does no worse than other models in terms of predictive performance
- The model should only agree with the well-established knowledge if it is consistent with the data.
+ This is because relying on domain expertise alone would defeat the purpose of data-driven algorithms and could result in worse performance.
- Definition of credibility is subjective, but these guys try to formalise it.
- Their proposed approach leverages domain expertise regarding known relationshipd between the set of covariates and the outcome. This domain expertise is used to guide the model in selecting among highly correlated features, while encouraging sparsity.
- They propose a general regularisation technique that aims to increase credibility without decreasing performance.
*** Proposed Approach
**** Definition and Notation
- Intepretability is a prerequisite for credibility.
- For linear models, interpretability is often defined as sparsity in the feature weights.
- The set of features is defined as D.
- Some domain expertise identifies $K \subseteq D$, a subset of the features as known or believed to be important.
- So among a group correlated features, a credible model will select those in K if the relationship is consistent with the data.
- Consider the following toy example where |C| = 2 and one of the features has been identified as being in K by the expert, while the other has not. One could arbitrarily select among these two correlated features, including only one in the model. To increase credibility, they encourage the model to select the known feature. This is mentioned in the formal definition of when a linear model is credible.
- A credible model is assumed to be sparse, as the expert knowledge is assumed to be sparse. Credible models will result in dense weights among the known features, if the expert knowledge provided is indeed supported by the data.
**** The expert yielded estimates (EYE) penalty
- The naive approach would be to constrain weights for known relevant factors with L2 norm which maintain a dense structure and then use L1 norm for D - K features, which will maintain a sparse structure.
- Due to sensitivity to the choice of hyperparameters for the naive solution, they propose the EYE penalty, which is obtained by fixing a level curve of q and scaling it for different contour levels.
- Which essentially just have to make sure that they do not bias the expert knowledge heavily
- EYE is beta free, where beta controls the trade-off between known and unknown features (and is a part of the naive way)
- EYE is a generalisation of LASSO, the l_2 norm. Setting r = 1 and 0 (r being an indicator array used in the minimisation problem: $\tilde{\Theta} = argmin_{\Theta} L(\Theta, X, y) + n \lambda J(\Theta, r)$ which explains which features are selected by the domain expert), they recover the l2 norm and LASSO penalties, respectively.
- EYE promotes sparse models
- EYE favors a solution that is sparse in the $\tilde{\Theta}$ function for attributes in D but not in K and dense in the values of K.
*** Experiments
- Measuring credibility: Criterion; Density in the set of known relevant features (K) and sparsity in the set of unknown (D \ K). Also: Maintained classification performance.
*** Conclusion
- Their incorporation of expert knowledge results in increased credibility, encouraging model adoption, while maintaining model performance.
- Through experiments on synthetic data (lol), they showed that sparsity inducing regularisation such as LASSO do not always produce credible models. In contrast, EYE produces a model that is provably credible in the least squares regression setting.
- EYE produced a model that was significantly better at highlighting known important factors while being comparable in terms of predictive performance with other regularisation techniques, when appliwed to two large-scale patient risk tasks.
- EYE does not lead to wose performance when the expert is wrong (this is ensured by them not biasing features heavily, when they come from K)
- They focused on lionear setting and one form of expert knowledge, which is a limitation!
- They do not claim that EYE is the optimal approach to yield credibility.
** Toeatds Explanation of DNN-based Prediction with Guided Feature Inversion
- Deep neural networks (DNN) have become an effective computational tool, the prediction results are often critised by the lack of interpretability, which is essential.
- Existing attempts based on local interpretations aim to identify relevant features contributing the most to the prediction of the DNN by monitoring the neighborhood of a given input (LIME!)
- These usually ignore the intermediate layers of the DNN (yes, they just focus on input/output, as they wish to be completely model agnostic). These might however contain rich information for interpretation.
- These guys propose to investigate a guided feature inversion framework for taking advantage of the deep architectures towards effective interpretation.
- Their proposed framework does not only determine the contribution of each feature in the input, but also provides insights into the decision-making process of DNN models.
- They further interact with the neuron of the target category at the output layer of the DNN, enforcing the interpretation result to be class-discriminative.
*** Introduction
- DNN models may learn biases from the training data.