adma_prep/notes.org

* Graph alignment
** REGAL
*** Intro
- network alignment, or the task of identifying corresponding nodes in different networks, has applications across the social and natural sciences.
- REGAL (REpresentation learning-based Graph ALignment) + Motivated by recent advancements in node representation learning for single-graph tasks
  + a framework that leverages the power of automatically learned node representations to match nodes across different graphs.
- xNetMF, an elegant and principled node embedding formulation that uniquely generalizes to multi-network problems.
- network alignment or matching, which is the problem of finding corresponding nodes in different networks. + Crucial for identifying similar users in different social networks or analysing chemical compounds
- Many existing methods try to relax the computationally hard optimization problem, as designing features that directly compared for nodes in different networks is not an easy task.
- we propose network alignment via matching latent, learned node representations.
- *Problem:* Given two graphs G_1 and G_2 with nodesets V_1 and V_2 and possibly node attributes A_1 and A_2 resp., devise an efficient network alignment method that aligns nodes by learning directly comparable node representations Y_1 and Y_2, from which a node mapping $\phi: V_1 \rightarrow V_2$ between the networks can be inferred.
- REGAL is a framework that efficiently identifies node matchings by greedily aligning their latent feature representations.
- They use Cross-Network Matrix Factorization (xNetMF) to learn the representations
  +  xNetMF preserves structural similarities rather than proximity-based similarities, allowing for generalization beyond a single network. + xNetMF is formulated as matrix factorization over a similarity matrix which incorporates structural similarity and attribute agreement between nodes in disjoint graphs.
  + Constructing the similarity matrix is tough, as is requires computing all pairs of similarities between nodes in the multiple networks, they extend the Nyström low-rank approximation, which is commonly used for large-scale kernel machines. + This makes xNetMF a principled and efficient implicit matrix factorization-based approach.
- our approach can be applied to attributed and unattributed graphs with virtually no change in formulation, and is unsupervised: it does not require prior alignment information to find high-quality matchings.
- Many well-known node embedding methods based on shallow architectures such as the popular skip-gram with negative sampling (SGNS) have been cast in matrix factorization frameworks. However, ours is the first to cast node embedding using SGNS to capture structural identity in such a framework
- we consider the significantly harder problem of learning embeddings that may be individually matched to infer node-level alignments.
*** REGAL Description
- Let G_1(V_1, E_1) and G_2(V_2, E_2) be two unweighted and undirected graphs (described in the setting of two graphs, but can be extended to more), with node sets V_1 and V_2 and edge sets E_1 and E_2; and possible node attribute sets A_1 and A_2.
  + Graphs does not have to be the same size
- Let n = |V_1| + |V_2|, so the amount of nodes across the two graphs.
- The steps are then:
  1) *Node Identity Extraction:* Extract structure and attribute-related info from all n nodes
  2) *Efficient Similarity-based Representation:* Obtains node embeddings, conceptually by factorising a similarity matrix of the node identities from step 1. However, the computation of this similarity matrix and the factorisation of it is expensive, so they extend the Nystrom Method for low-rank matrix approximation to perform an implicit similarity matrix factorisation by *(a)* comparing similarity of each node only to a sample of p << n so-called "landmark nodes" and *(b)* using these node-to-landmark similarities to construct the representations from a decomposition of its low-rank approximation.
  3) *Fast Node Representation Alignment:* Align nodes between the two graphs by greedily matching the embeddings with an efficient data structure (KD-tree) that allows for fast identification of the top-a most similar embeddings from the other graph.
- The first two steps are the xNetMF method
**** Step 1
- The goal of REGAL’s representation learning module, xNetMF, is to define node “identity” in a way that generalizes to multi-network problems.
- As nodes in multi-network problems have no direct connections to each other, their proximity can't be sampled by random walks on separate graphs. This is overcome by instead focusing on more broadly comparable, generalisable quantities: Structural Identity which relates to structural roles and Attribute-Based Identity.
- *Structural Identity*: In network alignment, the well-established assumption is that aligned nodes have similar structural connectivity or degrees. Thus, we can use the degrees of the neighbours of a node as structural identity. They also consider neighbors up to k hops from the original node.  + For some node $u \in V$, $R_u^k$ is then the set of nodes at exactly (up to??) k hops from $u$. We could capture the degrees of these nodes within a vector of length the highest degree within the graph $(D)$ $d_u^k$ where the i'th entry of $d_u^k(i)$ then denotes the amount of nodes in $R_u^k$ of degree $i$. This will however potentially be very long and very sparse, if a single node has a high degree, forcing up the length of $d_u^k$. Instead, nodes are bin'ned together into $b = [log_2(D)]$ logarithmically scaled buckets with entry $i$ of $d_u^k$ contains number of nodes $u \in R_u^k$ such that $floor([log_2(deg(u))]) = i$. Is both much shorter ($log_2(D)$) but also more robust to noise.
- *Attribute-Based Identity*: Given $F$ node attributes, they create for each node $u$ an $F$-dimensional vector $f_u$ representing the values of $u$. So $f_u(i)$ = the i'th attribute of $u$.
- *Cross-Network Node Similarity*: Relies on the structural and attribute information rather than direct proximity: $sim(u,v) = exp[-\gamma_s \cdot \left\lVert d_u - d_v \right\rVert_2^2 - \gamma_a \cdot dist(f_u, f_v)]$ where $\gamma_s, \gamma_a$ are scalar params controlling effect of structural and attribute based identity, $dist(f_u, f_v)$ is attribute-based dist of nodes $u$ and $v$ and $d_u = \sum_k=1^K \delta^{k-1} d_u^k$ describes the neighbor degree vector for $u$ aggregated over $K$ different hops where $\delta$ is a discount factor for greater hop distances and K is the maximum hop distance to consider. So they compare structural identities at several levels by combining the neighborhood degree distributions at several hop distances. The distance between attribute based identities depends on the type of node attributes, real-valued, categorial, so on. For categorical attributes, the number of disagreeing features can be used as an attribute-based distance measure.
**** Step 2
- Avoids random walks due to two reasons:
  1) The variance they introduce in the representation learning often makes embeddings across different networks non-comparable
  2) they can add to the computational expense. For example, node2vec’s total runtime is dominated by its sampling time.
- Use an implicit matrix factorisation-based approach that leverages a combined structural and attribute-based similarity matrix S, which is a result of the sim function from step 1, and considers similarities at different neighborhoods.
- We need to find $n \times p$ matrices $Y$ and $Z$ such that $S \approx YZ^T$ where $Y$ is the node embedding matrix and $Z$ is irrelevant. Thus, we need to find these node embeddings $Y$ WITHOUT actually computing $S$.
  + Finding Y can naturally be done by computing S via sim() and then factorise it (via some function using something called the Frobenius Norm as error function apparently). This is very expensive though. + Can also be done by creating a sparse matrix by computing only the "most important" similarities for each node, choosing only a small number of comparisons for instance by looking at similarity of node degree. This is fragile to noise though.
- We will approximate S with a low-rank matrix $\tilde{S}$ which is never explicitly computed. We randomly select $p << n$ "landmark" nodes chosen across both graphs G_1 and G_2 and then compute the similarities to all $n$ nodes in the these graphs using the sim() function. This yields a $n \times p$ similarity matrix $C$. (Note that we only compute it for the $p$ landmark nodes, yielding the $n \times p$ matrix). From $C$ we can extract a $p \times p$ "landmark-to-landmark" matrix, which is called $W$. $C$ and $W$ can be used to approximate the full similarity matrix which then allows us to obtain the node embeddings without ever computing and factorising the approximative similarity matrix $\tilde{S}$. To accomplish this, they extend the Nystrom method such that the low-rank matrix $\tilde{S}$ is given as: $\tilde{S} = CW^{\dag}C^T$. $C$ is the landmark-to-all similarity matrix and $W^\dag$ is the pseudoinverse (??) of $W$, the landmark-to-landmark similarity matrix. The landmark nodes are chosen randomly, as more elaborate methods such as looking at node centrality and such are much more inefficient and offers little to none improvements. Since \tilde{S} contains an estimate for all similarities within the graphs, it would still take $n^2$ space, but luckily we never have to compute this.
- We can actually get the node embeddings $Y$ from a decomposition of the equation for \tilde{S}.
- Given graphs G_1(V_1, E_1) and G_2(V_2, E_2) with $n \times n$ joint combined structural and attribute-based similarity matrix $S \approx YZ^T$, its node embeddings $Y$ can then be approximated as: $\tilde{Y} \approx CU\Sigma^{1/2}$, where $C$ is the $n \times p$ landmark-to-all matrix and $W^\dag = U\Sigma V^T$ is the full rank singular value decomposition of the pseudoinverse of the small $p \times p$ landmark-to-landmark sim matrix W.
  + Given the full rank SVD of $p \times p$ matrix $W^\dag$ as $U\Sigma V^T$, we can then write $S \approx \tilde{S} = C(U\Sigma V^T) C^T = (CU\Sigma^{1/2}) = (\Sigma^{1/2}V^T C^T) = \tilde{Y} \tilde{Z}^T$. + So we can compute $\tilde{Y}$ based on the SVD (Which is expensive) on the small matrix $W^\dag$ and the matrix $C$. The p-dimensional node embeddings of the two graphs are then subsets of the $\tilde{Y}$.
**** Step 3
- We have to efficiently align nodes, assuming $u \in V_1$, $v \in V_2$ may match if their xNetMF embeddings are similar. Let \tilde{Y}_1 and \tilde{Y}_2 denote the matrices of the p-dimensional embeddings of G_1 and G_2.
- We take the likeliness of (soft) alignment to be proportional to the similarity between the nodes’ embeddings. Thus, we greedily align nodes to their closest match in the other graph based on embedding similarity.
- A naive way of finding alignments for each node would be to compute similarities of all pairs between node embeddings (The rows of \tilde{Y}_1 and \tilde{Y}_2) and then choose the top-1 for each node. This is inefficient though.
- Instead, we store the embeddings of \tilde{Y}_2 in a k-d tree, which accelerates exact similarity search for nearest neighbor algorithms. For each node in G_1 we then query this tree with its embeddings to find the $a << n$ closest embeddings from nodes in G_2. This allows us to compute "soft" alignments where we return one or more nodes with the most similar embeddings. The similarity between the p-dimensional embeddings of $u$ and $v$ are defined as: $sim_{emb}(\tilde{Y}_1[u], \tilde{Y}_2[v]) = e^{-\left\lVert \tilde{Y}_1[u] - \tilde{Y}_2[v] \right\rVert_2^2}$, converting the euclidean distance to similarity.
*** Complexity Analysis
- We assume both graphs have $n_1 = n_2 = n$ nodes.
1) *Extracting Node Identity*: Takes approximately $O(nKd_{avg}^2)$ time finding neighborhoods up to distance $K$, by joining the neighborhoods of neighbors at the previous hop. We can construct $R_u^k = \Cup_{v \in R_u^{k-1}} R_v^1 - \Cup_{i=1}^{k-1} R_u^i$ for node $u$. Could also be solved using breadth-first-search in time $O(n^3)$.
2) *Computing Similarities*: Similarities are computed of the length-b features (weighted counts of node degrees in the k-hop neighborhoods split into b buckets) between each node and the p landmark nodes in time: $O(npb)$
3) *Obtaining Representations*: Constructing the pseudoinverse $W^\dag$ and computing the SVD of this $p \times p$ matrix takes time $O(p^3)$ and then multiplying it with $C$ in time $O(np^2)$. Since $p << n$, total time is $O(np^2)$.
4) *Aligning Embeddings*: Constructing k-d tree and using it to find the top alignments in G_2 for each of the n nodes in G_1 is average-case time complexity $O(nlog(n))$.
- Total time complexity is then: $O(n * max(pb, p^2, Kd_{avg}^2, log(n))$
- It suffices to pick small values $K$ and $p$ and picking $b$ logarithmically in $n$. $d_{avg}$ is oftentimes small in practice. $d_{avg}$ explains the average node degree.
*** Experiments
- They test on networks where they find a real network dataset with some adjacency matrix $A$. They then generate a new network with adjacency matrix $A' = P*A*P^T$ where $P$ is some randomly generated permutation matrix. Structural noise is added to $A'$ by removing edges with probability $p_s$ without disconnecting any nodes.
- For experiments with attributes, they generate synthetic attributes for nodes, if the graph does not have any. Noise is added by flipping binary values or choosing values randomly with probability $p_a$.
- In accuracy when noise is added, Regal using xNetMF and Regal using Struct2Vec (which is some other form of computing the embeddings) far outperform any other algorithms. Apparently struct2vec adds some noise as it samples something called contexts, which might add variance to it, which is why xNetMF likely wins with low noise. As noise grows however, struct2vec wins in accuracy, but not speed.
- When looking at attribute-based noise, REGAL outperforms FINAL(which uses a proximity embedding but handles attributes) in both accuracy and runtime, mostly. FINAL achieves slightly higher accuracy with small noise, due to it's reliance on attributes. FINAL incurs significant runtime increases as it uses extra attribute information.
- The sensitivity to changes in parameters is shown to be quite significant and they conclude that the discount factor \delta should be between 0.01 and 0.1. The hop distance K should be less than 3. Setting structural and attributed similarity to 1 does fairly well and that the top-a (using more than 1, such as 5 or 10) accuracy is significantly better than the top-1. Higher number of landmarks means higher accuracy and it should be $p = t*log_2(n)$ for $t \approx 10$.
- So it's highly scalable, it's suitable for cross-network analysis, it leverages the power of structural identity, does not require any prior alignment information, it's robust to different settings and datasets and it's very fast and quite accurate.
** Low Rank Spectral Network Alignment
*** Intro
- EigenAlign requires a memory which is linear in the size of the graphs, whereas most other methods require quadratic memory.
- The key step to this insight is identifying low-rank structure in the node-similarity matrix used by EigenAlign for determining matches.
- With an exact, closed-form low-rank structure, we then solve a maximum weight bipartite matching problem on that low-rank matrix to produce the matching between the graphs.
- For this task, we show a new, a-posteriori, approximation bound for a simple algorithm to approximate a maximum weight bipartite matching problem on a low-rank matrix.
- There are two major approaches to network alignment problems: local network alignment, where the goal is to find local regions of the graph that are similar to any given node, and global network alignment, where the goal is to understand how two large graphs would align to each other.
- The EigenAlign method uses the dominant eigenvector of a matrix related to the product-graph between the two networks in order to estimate the similarity. The eigenvector information is rounded into a matching between the vertices of the graphs by solving a maximum-weight bipartite matching problem on a dense bipartite graph
- a key innovation of EigenAlign is that it explicitly models nodes that may not have a match in the network. In this way, it is able to provably align many simple graph models such as Erdős-Rényi when the graphs do not have too much noise.
  + Even though is still suffers from the quadratic memory requirement.
*** Network Alignment formulations
**** The Canonical Network Alignment problem
- In some cases we additionally receive information about which nodes in one network can be paired with nodes in the other. This additional information is presented in the form of a bipartite graph whose edge weights are stored in a matrix L; if L_uv > 0, this indicates outside evidence that node u in G_A should be matched to node v in G_B.
**** Objective Functions for Network Alignment
- Describes the problem as seeking to find a matrix P which has 1 in index u,v, if u is matched with (only) v in the other graph.
- We then seek a matrix P which maximises the number of overlapping edges between G_A and G_B, so the number of adjacent node pairs should be mapped to adjacent node pairs in the other graph. We get an integer quadratic program.
- There is no downside to matches that do not produce an overlap, so edges in G_A which are mapped to non-edges (wtf is this?) in G_B, or vice versa.
- They define the $AlignmentScore(P) = s_O(#overlaps) + s_N(#non-informative) + s_C(#conflicts)$, where the different $s$ are weights, such that $s_O > s_N > s_C$. This score defines a matrix $M$, which is massive. This matrix M is used to define a quadratic assigment problem, which is equivalent to maximising ther AlignmentScore.
- One can however solve an eigenvector equation instead of the quadratic program, which is what EigenAlign does.
  1) Find the eigenvector $x$ of M that corresponds to the eigenvalue of largest magnitude. M is of dimension $n_A n_B \times n_A n_B$, where $n_A$ and $n_B$ are the number of nodes in G_A and G_B, so the eigenvector is of dimension $n_A n_B$ and can thus be reshaped into a matrix X of size $n_A \times n_B$ where each entry represents a score for every pair of nodes between the two graphs. This is the similarity matrix, as it reflects the topological similarity between vertices of G_A and G_B.
  2) Run bipartite matching on the similarity matrix X, that maximises the total weight of the final alignment.
- The authors show that the similarity matrix X can be represented through an exact low-rank factorisation. This allows them to avoid quadratic storage requirement of EigenAlign. They also present new fast techniques for bipartite matching problems on low-rank matrices. Together, this yields a far more scalable algorithm.
*** Low Rank Factors of EigenAlign
- Use power iteration (some algo used to compute an eigenvector + value from a diagonalisable matrix) on M to find the dominant eigenvector which can then be reshaped into the sim matrix X. This can also be solved as an optimisation problem.
- If matrix X is estimated with the power-method starting from a rank 1 matrix, then the kth iteration of the power method results in a rank k+1 matrix that can be explicitly and exactly computed.
- We wish to show that the matrix X can be factorised via an only two-factor decomposition: $X_k = UV^T$ for X of rank k.
** Cross-Network Embedding for Multi-Network Alignment
*** Intro
- Recently, data mining through analyzing the complex structure and diverse relationships on multi-network has attracted much attention in both academia and industry. One crucial prerequisite for this kind of multi-network mining is to map the nodes across different networks, i.e., so-called network alignment.
- CrossMNA is for multi-network alignment through investigating structural information only.
- Uses two types of node embedding vectors:
  1) Inter-vector for network alignment
  2) Intra-vector for other downstream network analysis tasks
- A crucial prerequisite for mining multi-networks is to map the nodes/participants among these related networks, network alignment.
- The shared participants among the networks are defined as anchor nodes, so they act like anchors aligning the networks they participate in, and the relationship among anchor nodes across networks are called anchor links.  + In many cases, a few anchor links can be known beforehand (such as when people link their twitter or such on facebook)
  + Network alignment seeks to infer these unknown or potential anchor links.
- Most previous work assumes topology consistency such that a node tends to have a consistent connectivity structure across networks
- Attribute based methods are not applicable in many realistic scenarios, as the attribute information may be unreliable or incomplete. (such as usernames, gender or other profile information) + REGAL supports using attribute based information
- CrossMNA uses an additional vector named the "network vector", which is proposed to extract the semantic meaning of the network, which can reflect the difference of global structure among the networks.
- Also uses two kinds of embedding:
  1) The inter-vector, which reflects the common features of the anchor nodes in different networks and is shared among the known anchor nodes
  2) The intra-vector, which preserves the specific structural feature for a node in its selected network and is generated through the combination of network vector and inter-vector.
- REGAL proposes an embedding-based method based on the assumption that nodes with similar structural connectivity or degrees have a high probability to be aligned (The whole point of that degree vector)
  + Although one node may share some similar features in related networks, its local structural connections can be entirely different in each network due to the distinctiveness in network semantic meanings (You are likely to connect to very different people on facebook vs linkedin)
- There are the following challenges for network alignment algos:
  1) Semantic diversity: Diversities in network semantics lead to the different interactional behaviours of the same node in each network, which adds inaccuracies to the alignment. This problem is further worsened, when considering +2 networks.
  2) Data imbalance: Has two aspects; first, the size of each network may vary (which is OK within REGAL), second, the number of anchor links between each pair of networks can be unequal. Any pair-wise learning method will suffer a lot from this, as they only consider pairs of networks. So how to make full use of the information across ALL the networks to deal with the data imbalance problem is also a challenge.
  3) Model storage: Network embedding is a practical approach to extract structural information of the node and has been applied in some network alignment methods (such as REGAL). However. In large-scale multi-networks, it is essential to take into account the space overhead of the methods (which is why REGAL never compute the similarity matrix, but comes up with the embedding in a clever way). REGAL still has to compute the embedding vector for most nodes though, which still takes up a lot of space. It does have the landmark nodes though, which alleviates this slightly.
- There is a network vector for each network. This reflects the difference of global structure among the networks. Thus, if the global structure of two networks is similar, their network vectors will be close in vector space.
- Each node has an inter-vector and an intra-vector. The inter-vector depicts the commonness of anchor nodes in different networks and is shared among the anchor nodes. The intra-vector reflects the specific structural feature of this node in its selected network, but is generated through a combination of the network vector and inter-vector
*** CrossMNA
- We suppose networks are unweighted and all the edges are directed, as an undirected edge can be divided into two directed edges.
- A set of networks is defined: G = ((G^1, G^2, ..., G^n), (A^{(1,2)}, A^{(1,3)}, ..., A^{(n-1, n)})), where each G^i represents a network and A^{{i,j)} represents the anchor nodes between G^i and G^j. Each network is defined from its nodes and edges.
- An anchor link between G^i and G^j is defined as (v_k^i, v_k^j) \in A^{(i,j)}. Anchor links follow the transitivity law
**** The Cross-Network Embedding
- The inter-vector u preserves the common features among the anchor nodes. Through training, the inter-vector of an unknown anchor node should get close to its counterparts in vector space. This inter-vector is difficult to learn directly, as there is no direct correlation between the unknown anchor nodes. They expect some anchor node in one network to both show similar structural features with its counterparts, but also distinctive connection relationships, due to the semantic meaning of a network.
- The intra-vector is straightforward, as it is easy to extract structural features of nodes into network embedding, named the intra-vector v. This contains both the commonness among counterparts and the specific local connections in its selected network due to the semantics, so it can NOT be applied to node matching, unless the impact of the network semantics can be removed.
- Thus, the authors are to present an equation to build a correlation among intra-vector, inter-vector and network semantics: v_i^k = u_i + r^k where r^k is the network vector which can extract the unique characteristics of G^k. Thus, we can learn the inter-vector of the anchor nodes indirectly, by training the combining-based intra-vectors. Thus, the network vector can be used to mitigate the "noise" added to the intra-vector. (u_4 = v^3_4 - r^3)
- The inter-vector and intra-vector can be in different vector spaces, which is solved by a transformation matrix W which is used to align them with different dimensions. So, v^k_i = Wu_i + r^k
- Like REGAL, they use a k-d tree to find the top-a most likely nodes in other networks. We need to compare the inter-vectors of nodes, to find the alignments.
*** Experiments
- CrossMNA outperforms REGAL in their tests, as REGAL assumes topology consistency (at least somewhat), so it performs poorly when they use datasets not having this.
- The dimension of the inter-vector and intra-vector (d1 and d2) should be set d1 = 200 or 300 and d2 = 30 or 50 ish to save memory in practice. When d1 grows, so does the performance
- With no prior known anchor links, CrossMNA uses as much space as REGAL
- CrossMNA is dramatically space efficient for large-scale multi-network applications though.
- The time complexity of learning embeddings is approximately $O(t*N(d1*d2*|V|+d2*|E|))$ where $t$ is the number of iterations, $N$ denotes the number of networks and $|V|$ and $|E|$ number of nodes and edges in each network.
- Time complexity of finding soft alignments between each two networks is $O(|V|*log(|V|))$.
* Graph Similarities
** Deep Graph Kernels
- In domains such as social networks, bioinformatics and robotics we are often interested in computing similarities between structured objects. Graphs offer a natural way to represent structured data.
- Consider the problem of identifying a subreddit on reddit. To tackle this problem, one can represent an online discussion thread as a graph where nodes represent users and edges represent whether two users interact. The task is then to predict which sub-community a certain discussion belongs to, based on its communication graph.
- One of the increasingly popular approaches to measure similarity between structured objects is that of kernel methods.
- A kernel method measures the similarity between two objects with a kernel function, an inner product in reproducing kernel Hilbert space (RHKS). The challenge for using kernel functions is to pick a suitable kernel that captures the semantics of the structure while being computationally tractable.
  + Roughly speaking, this means that if two functions and g in the RKHS are close in norm, i.e., ||f − g|| is small, then f and g are also pointwise close, i.e., |f(x) − g(x)| is small for all x.
- R-convolution is a general framework for handling discrete objects where the key idea is to recursively decompose structured objects into "atomic" (likely non-decomposable) sub-structures and define valid local kernels between them. Given a graph G, let \phi(G) denote a vector which contains counts of atomic sub-structures and (\cdot, \cdot)_H denote a dot product in RHKS H, then the kernel between two graphs G and G' is given by: K(G,G') = (\phi(G), \phi(G'))_H.
- This representation does however not take a number of important observations into account.
  1) Sub-structures that are used to compute the kernel matrix are not independent. A popular substructure, graphlets, which is used for decomposing graphs, are defined as induced, non-isomorphic sub-graphs of size k. Graphlets exhibit strong dependence relationships, size k+1 graphlets can be derived from size k graphlets by addition of nodes or edges.
  2) By increasing the size k, the number of unique graphlets increase exponentially, so when the number of features grows there is a sparsity problem: only a few substructures will be common across graphs (not entirely sure why this applies though). This leads to the diagonal dominance problem, which is when a given graph is likely to be similar to itself, but not any other.
- We would like a kernel matrix where all entries belonging to a class are similar to each other and dissimilar to everything else. Thus. Consider an alternative kernel between two graphs G and G': K(G,G') = \phi(G)^T * M * \phi(G') where M represents a |V| \times |V| positive semi-definite matrix that encodes relationship between sub-structures and V represents the "vocabulary" of sub-structures obtained from the training data. + This allows for one to design M in a clever way that respects the similarity within the given sub-structure space. This could be the edit-distance in spaces where there is a strong mathmatical relationship between sub-structures, such that one could design an M that respects the geometry of the space.
  + This geometry assumption can also be fulfilled by "learning" the geometry of the space through data.
- This paper propose recipes for designing such M matrices for graph kernels.
- They propose two recipes:
  1) They exploit an edit-distance relationship between sub-structures and directly compute M
  2) They propose a framework that computes an M matrix by learning latent representations of substructures
- Their contributions are:
  1) They propose a general framework that learns hidden representations of sub-structures used in graph kernels
  2) They demonstrate their framework on three popular graph kernels: Graphlet kernel, Weisfeiler-Lehman subtree kernels, Shortest-Path kernels
  3) They apply their framework to derive deep variants of string kernels which are a result of the R-convolution kernels
*** Graph Kernels
- The three graphs kernels are, based on limited-sized subgraphs, based on subtree patterns or based on walks and paths.
- Let GG be a set of n graphs G_1, .., G_n. Let Y represent a set of labels associated with each graph in GG, where Y = y_{G_1}, ..., y_{G_n}
- Given some G = (V, E) and H = (V_H, E_H), H is then a subgraph of G iff an injective mapping a : V_H -> V such that (v,w) \in E_H iff (a(v), a(w)) \in E.
- A graph G is labeled if there is a function l : V -> \Sigma that assigns labels from some alphabet \Sigma to vertices in G. Likewise, a graph is unlabeled if there is nothing to distinguish between nodes, apart from their interconnectiveness.
- K(G,G') is a kernel function which measures similarity between G and G'.
- The graph classification problem is to map graphs into two or more categories. Given a set of graphs GG and labels Y, we should learn to map graphs to labels within Y.
**** Graph Kernels based on subgraphs
- A graphlet Gr is an induced and non-isomorphic sub-graph of size k. Let V_k = (Gr_1, Gr_2, ..., Gr_{n_k}) be the set of size k graphlets where n_k denotes the number of unique graphlets of size k. Given two unlabeled graphs G and G', the graphlet kernel is defined: K_{GK}(G, G') = (f^G, f^{G'}) where f^G and f^{G'} are vectors of normalised counts, that is, the i'th component of f^G denotes the frequency of graphlet Gr_i occuring as a subgraph of G and (\cdot, \cdot) is the euclidean dot product.
**** Graph kernels based on subtree patterns
- These decompose the graph into its subtree patterns, the Weisfeiler-Lehman subtree kernels is in this family.
- Requires a labeled graph in which we can iterate over each vertex and its neighbors in order to create a multiset label.
- The multiset at every iteration consists of the label of the vertex and the sorted labels of its neighbors, this multiset is then given a new label, which can be used for the next iteration.
- To compare graphs, we then simply count the co-occurances of each label.
- Given G and G', the Weisfeiler-Lehman subtree kernel is then: K_{WL}(G,G') = (1^G, 1^{G'}), where (.,.) denotes the euclidean dot product. If we assume h iterations of relabeling, then 1^G consists of h blocks s.t. the i'th component in the j'th block of 1^G contains the frequency of which the i'th label was assigned to a node in the j'th iteration.
**** Graphs Kernels based on random-walks
- Decomposes graphs into random walks or paths and then counts the co-occurence of random walks or paths in two graphs.
- Let P_G represent the set of all shortest paths in G and p_i \in P_G denotes a triplet (l_s^i, l_e^i, n_k) where n_k is the length of the path and l_s^i and l_e^i are the larbels of the starting and ending vertices. The shortest path for LABELED graphs G and G' is then: K_{SP}(G,G') = (P^G, P^{G'}) where the i'th component of P^G contains the frequency of the i'th triplet occuring in graph G.
- Does this still use the euclidean dot product?
**** General
- All graphs kernels mentioned are instances of the R-convolution framework.
- The recipe for definining graph kernels is as follows:
  1) Recursively decompose a graph into its subgraphs (Graphlet kernel decomposes into sub-graphs, Weisfeiler-Lehman decomposes into subtrees and shortest-path decomposes into shortest-paths (lol))
  2) The decomposed sub-structures are then represented as a vector of frequencies where each item in the vector represents how many times a given sub-structure occurs in the graph
  3) The euclidean space or some other domain specific RKHS is used to define the dot product between the vectors of frequencies
*** Methodology
**** Sub-structure similarity via edit distance
- How to compute an M matrix by using the edit-distance relationship between sub-structures
- When substructures exhibit a clear mathmatical relationship, one can exploit the underlying similarities between substructures, to compute a matrix M.
- For graphlet kernels, one can use the edit-distance relationship to encode how similar graphlets are.
- Given graphlet Gr_i of size k and a graphlet Gr_j of size k+1, we can build an undirected edit-distance graph UED-Graph by adding an undirected edge from G_i to G_j iff G_i can be obtained from G_j by deleting a node from G_j or vice versa. Given such a UED-G, one can compute the shortest path between G_i and G_j in order to compute their edit distance. Now, we can simply compute the matrix M directly. However, the cost of computing the shortest-path distances on UED-G becomes very expensive as a function of k. Thus, one can instead of creating the matrix M of size |V| x |V|, create a much smaller one of size |V'| x |V'| for V' << V, but only taking the observed sub-structures into account.
**** Sub-Structure Similarity via Learning
- The second approach is to LEARN the latent respresentation of sub-structures, by using language modeling and deep learning techniques. These learned representations are then utilised to compute M that respects similarities between sub-structures.
- *Neural Language Models*: Traditional language models estimate the likelihood of a sequence of words appearing in a corpus. Given some sequence of training words (w_1, w_2, ..., w_T) n-gram based language models then aim to maximise the following probability: $Pr(w_t\ |\ w_1, ..., w_{t-1})$, so they estimate the likelihood of seeing some word, given all the prior.
- Recent work in language modeling focus on distributed vector representations of words, word embeddings. Neural language models improve classic n-gram language models, by using continuous vector representations for words.
- Note: Word embeddings are words mapped into a d-dimensional embedding space such that similar words are mapped to similar positions in that space.
- Unlike traditional n-gram models, these neural language models take advantage of the notion of context
  + A context is defined as a fixed number of preceding words
- The objective of word embedding models is to maximise $\Sum_{t=1}^T \log Pr(w_t\ |\ w_1, \dots, w_{t-1})$ where $w_{t-n+1}, \dots, w_{t-1}$ is the context of $w_t$.
- *Continuous Bag-of-words*: Used to approximate the above objective. Predicts the current word given the surrounding words within a given window. Similiar to feed-forward neural network language models where the non-linear hidden layer is removed and the projection layer is shared for all words.
- Tries to maximise the objective: $\Sum_{t=1}^T \log Pr(w_t\ |\ w_{t-c}, \dots, w_{t+c})$ where c is the length of the mentioned context. This objective is computed using softmax.
- *Skip-gram model*: Maximises co-occurence probability among the words that appear within a given window. So instead of predicting the current word based on surrounding words, we predict the surrounding words given the current word. So the objective of skip-gram is: $\Sum_{t=1}^T \log Pr(w_{t-c}, \dots, w_{t+c}\ |\ w_t)$ where the probability is computed as: $\Prod_{-c \leq j \leq c, j \neq 0} Pr(w_{t+j}|w_t)$ This probability is again computed sort of like the softmax.
- Hierarchical softmax and negative sampling are used in training the skip-gram and CBOW models.
- Hierarchical softmax uses a binary huffman tree
- Negative sampling selects the contexts at random instead of considering all words in the vocabulary. If a word w appears in the context of another word w', then the vector represenation of the word w is closer to the vector representation of w'.
- Once training converges, similar words are mapped to similar positions in the vector space. The learned word vectors are empirically shown to preserve semantics. Word vectors can be used to answer analogy question susing simple vector algebra where the result of a vector calculation v("Madrid") - v("Spain") + v("France") is closer to v("Paris") than any other word vector.  + So we view sub-structures in graph kernels as words that are generated from a special language. So different sub-structures compose graphs in a similar way that words compose a sentence when used together.
**** Deep Graph Kernels
- The framework takes list of graphs GG and decomposes each into substructures.
- List of substructures for each graph is treated as a sentence which is generated from some vocabulary V, where V is the unique set of observed sub-structures in the training data (that whole V' << V thing)
- We need to generate corpus where the co-occurence relationship is meaningful
- *Corpus generation for graphlet kernels*: Exhausting all graphlets is very expensive. Instead one can perform random sampling: Random sampling of graphlets of size k for a graph G involves placing a randomly generated window of size k x k on the adjacency matrix of size G and collecting the observed graphlet within this window. This is done n times, if we want n graphlets. As random sampling preserves no notion of co-occurence, the scheme is slightly altered by using the notion of neighborhoods. Whenever we sample a graphlet, we also sample its immediate neighbors. The graphlet and its neighbors are then interpreted as co-occured. Thus, graphlets with similar neighborhoods will acquire similar representations.
- *Corpus Gen for shortest path*: For every shortest path, take every sub-path as co-occured shortest path
- *Corpus gen for weisfeiler-lehman*: Not clear. I suppose it is all multiset labels for any given iteration h are co-occured.
**** Algorithm
1) Choose a graph kernel
2) Construct similarity matrix M:
  - Build substructure vocabulary V
  - Construct the co-occurences
  - Apple CBOW or Skip-gram to get the embeddings
  - Calculate sim matrix M
3) Decompose graph into substructures
4) Build histogram vector (the frequencies of the substructures) \phi(G)
5) Compute graph kernel as K(G, G') = \phi(G)^T * M*\phi(G')
*** Experiments
- Under noise, the edit-distance graphlet kernel beats the base kernel of all the datasets except for one. Likely due to EGK only using a mathematical relationship between sub-structures rather than learning a sophisticated relationship. The Deep learning graphlet kernel thing which learns (DGK) outperformed all base kernels significantly, except for one, which is a different one from which one beat EGK.
- In regards to accuracy, DGK slightly outperforms EGK on all datasets.
- Running time is measured in seconds. What.


** Matching Node Embeddings for Graph Similarity
- Most graph kernels focus on local properties of graphs and ignore global structure (Not really the case with DGK)
- In the heart of graph kernels lies a positive semidefinite kernel function k. Once such a function k : X x X -> R is defined for a set X, it is known that there exists a map $\phi : X -> H$ into a hilbert space H s.t. $k(x,x') = (\phi(x), \phi(x'))$ for all $x, x' \in X$ where (.,.) is the inner product in H.
- Most existing graph kernels compare specific substructures of graphs (this is what DGK does! Or at least the kernels they use)
  + So these algos focus on local properties of graphs and ignore global structure
- The goal of this paper is to fix the problems related to kernels focusing on local substructures. This is accomplished by using algos that utilise features describing global properties of graphs.
- They present two algos designed to compare pairs (pairs!) of graphs based on their global properties. They are applicable to both labeled and unlabeled graphs.
  1) Each graph is represented as a collection of the embeddings of its vertices.
  2) The vertices of the graphs are embedded in the euclidean space using the eigenvectors of the corresponding adjacency matrices
  3) The similarity between pairs of graphs is measured by computing a matching between their sets of embeddings
- Two algos are employed.
  1) One which casts the problem as an instance of the Earth Mover's Distance for a set of graphs to, for a set of graphs, build a similarity matrix. This sim matrix is however not always positive semidefinite, so an SVM classification algorithm using indefinite kernels which treats the indefinite similarity matrix as a noisy observation of the true positive semidefinite kernel.
  2) Corresponds to a technique adapted from the pyramid match kernel and yields a positive semidefinite matrix. This method is called the Pyramid Match graph Kernel.
*** Prelims
- Graphs are defined as usual: $G = (V,E)$.
- A set of labels L are defined as: $l : V -> L$, as a function which assigns labels to the vertices.
- Given a graph G, its vertices can be represented as points in a vector space using a node embedding algorithm. In this paper, embeddings are generated for vertices of a graph using the eigenvectors of its adjacency matrix A.  + Given the eigenvalue decomposition of A, $A = U\wedge U^T$, the i'th row of $u_i$ of $U$ corresponds to the embedding of vertex $u_i \in V$.
  + These capture global properties of graphs and offer a powerful and flexible mechanism for performing machine learning tasks on them. + A is real -> It's eigenvalues $\lambda_1, ..., \lambda_n$ are real.
  + The graphs contain no self-loops, so Tr(A) = 0 (wtf?), which means the eigenvalues sum to zero.
- The eigenvectors with the largest eigenvalues share some interesting properties.
- The eigenvector with the biggest eigenvalue is specifc, as the i'th component of this vector gives the eigenvector centrality score of vertex v_i in the graph. + Eigenvector centrality is a measure of global connectivity of a graph which is captured by the spectrum of the adjacency matrix.
- We can also use the next to largest eigenvalues. Note that we work with the magnitude, so the sign of the eigenvalue is irrelevant.
- A graph G can be represented as a bag-of-vectors. E.g. graph G can be represented as the set: $(u_1, u_2, ..., u_n)$, where each vector of the set corresponds to the representation of each vertex $u_i \in V$.
*** Earth Mover's Distance
- The similarity of two graphs G_1 and G_2 is formulated as the minimum travel cost between the two graphs, which is provided by a linear program which just minimises the distance between each vertex, which is the euclidean distance within the embedding space.
- Note these guys work on labeled data and we want the distance between pairs of vertices with different labels to be large, thus it is just set to the largest possible value (which is apparently $\sqrt{d}$.)
- Let D be the matrix from computing all the pair-wise distances between graphs G_1 and G_2. If D is then a Euclidean Distance Matrix (EDM), then we could use D to define a positive semi-definite kernel matrix K: $K = -\frac{1}{2} JDJ$, where $J$ is the centering matrix (wat).
- However, in our setting, K is not positive semi definite, as D is not euclidean, we can use the SVM trick to convert it into such.
*** Pyramid Match Graph Kernel
- Based on the Pyramid Match Kernel
- Generates PSD (positive semi definite) kernel matrices
- The basic idea is to map the bag-of-vector representations of graphs to multi-resolution histograms and then compare these histograms with a weighted histogram intersection measure in order to find an approximate correspondence between the two sets of vectors.
- The algorithm works by partitioning the feature space into regions of increasingly larger size and taking a weighted sum of the matches that occur at each level.
  + Two points are said to match if they fall into the same region. + Something something histogram intersection
- Apparently the pyramid match kernel is a Mercer kernel, so by computing it for all pairs of graphs, a PSD kernel matrix can be built.
- Complexity is $O(dnL)$ for n nodes.
- Works for unlabeled graphs
- Can be modified to work for labeled graphs.
  + Only vertices that share label should be able to be matched + Instead of representing each graph as a set of vectors, they can be represented as a set of sets of vectors, where each internal set corresponds to a specific label and contains embeddings of the vertices with that label.
*** Experiments
- They test on all the graph kernels used in DGK (Weisfeiler-Lehman, Graphlet, Shortest path and random walk)
- The EMD and Pyramid Match (PM) methods works well for unlabeled graphs
- The PM is in general very good
- They also did very well on labeled graphs.
- The EMD and PM are slow though
** NetLSD: Hearing the Shape of a Graph
- Ideally graph comparisons should be invariant to the order of nodes and the sizes of compared graphs, adaptive to the scale of graph patterns and scalable
- They present the Network Laplacian Spectral Descriptor (NetLSD)
- Is a permutation, size-invariant, scale-adaptive and efficiently computable graph representation method that allows for straightforward comparisons of large graphs.
  + Permutation is the order and formally states that if two graphs are isomorphic, they should have distance 0 + Scale-adaptive means it can handle comparisons both at a local level but also at a global level and that the representation should contain both local and gloval features.
  + Size-invariance means that if two graphs essentially show the same thing but at different sizes, their distance from each other should be 0.
- NetLSD extracts a compact signature that inherits the formal properties of the Laplacian spectrum, specifically its heat or wave kernel.
- Grounded in spectral graph theory, NetLSD allows for constant time similarity computations at several scales
- So essentially, we can view a graph of nodes and edges as a graph with an x and y axis, where the x axis is the scale
*** Related work
**** Direct methods
- Stuff like graph edit distance is heavy
**** Kernel methods
- No graph kernel achieves both scale-adaptive and size-invariant graph comparisons.
- Kernels are expensive to compute
**** Statistical representations
- Quadratic time complexity
**** Spectral representations
- Spectral graph theory is effective in the comparison of 3D objects
- Apparently it's clever
*** Problem statement
- G = (V,E)
- A representation is a function $\sigma : G -> R^N$ from any graph G within a collection of graphs, to an infinitely dimensions real vector. The element j of the representation is denoted at $\sigma_j(G)$.
- A representation-based distance is a function $d^\sigma : R^N \times R^N -> R_0^+$ on the representations of two graphs $G_1, G_2 \in G$, that returns a positive real number.
- The distance should be pseudometric, so it should be symmetric and support the triangle inequality.
*** NetLSD
- A useful metaphor is that of heating the graph's nodes and observing the heat diffusion as time passes. Another is that of a system of masses corresponding to the graph's nodes and springs corresponding to its edges. The propagation of mechanical waves through the graph is another way to capture its structural invariants. In both cases, the overall process describes the graph in a permutation-invariant manner, and embodies more global information as time elapses. Their representation employs a trace signature encoding such a heat diffusion or wave propagation process over time.
- Two graphs are compared via the L_2 distance among trace signatures sampled at selected time scales.
**** Spectra as representations
- The spectrum of a graph is defined as the eigenvalues of the laplacian of the adjacency matrix.
- The laplacian spectrum encodes important graph properties such as the normalised cut size used in spectral clustering. Likewise, the normalised laplacian spectrum can determine whether a graph is bipartite, but not the number of its edges.
- Thus, rather than consider the laplacian spectrum per se, they consider an associated heat diffusion process on the graph to obtain a more expressive representation in a manner reminiscent of random walk models.
- So the main idea is that we consider the heat equation based on the laplacian of the adjacency matrix: $\frac{\delta u_t}{\delta t} = - L u_t$ where $u_t$ are scalar values on vertices representing the heat of each vertex at time t. The solution to this provides the heat at each vertex at time t, when the initial heat $u_0$ is init with a fixed value on one of the vertices. + It's closed-form solution (what) is given by some n x n heat kernel matrix H_t which can be computed directly by exponentiating the laplacian eigenspectrum
- However, as the heat kernel involves pairs of nodes, it is not directly usable to compare graphs, so they consider instead the heat trace at time t: $h_t = tr(H_t) = \Sum_j e^{-t\lambda_j}$.
- The NetLSD representation consists then of a heat trace signature of graph G, i.e. a collection of heat traces at different time scales, $h(G) = \{h_t\}_{t>0}$.
- An alternative is the wave equation, which is pretty much the same: $\frac{\delta^2 u_t}{\delta t^2} = - L u_t$ and the wave trace signature is then: $w_t = tr(W_t) = \Sum_j e^{-it \lambda_j}$.
**** Scaling to large graphs
- Full eigendecomposition of the laplacian: $L = \Phi \wedge \Phi^T$, takes $O(n^3)$ time and $\Theta(n^2)$ memory.
- This allows them to compute the signatures of graphs efficiently, but the direct computation is impossible, so they need to approximate the heat trace signatures.
- The first proposal is to use a Taylor Expansion, which allows them to compare two graphs locally in $O(m)$. (Note that m is the amount of edges perhaps).
  + Is useful on very large graphs on which eigendecomposition is prohibitive, however, for manageable graph sizes we adopt a more accurate strategy based on approximating the eigenvalue growth rate.
- They compute k eigenvalues on both ends of the spectrum (what) and interpolate a linear growth of the interloping eigenvalues.
**** Properties of heat trace
- *Permutation invariance:* Isomorphic graphs are isospectral, hence their respective heat trace signatures are equal
- *Scale-adaptivity:* The value of t (that time thing) can be tuned to either produce local connectivty (at low values) and global connectivity at large values.
- *Size-invariance:* We can normalise the heat trace signatures, thus making it size-invariant.
*** Experiments
- NetLSD is very scalable.
* Explaining Outliers and Glitches
** Empirical Glitch Explanations
- Data glitches are unusual observations that do not conform to data quality expectations + Can be both logical, semantic or statistical
- Data integretity constraints can potentially flag large sections of data as being non-compliant, which is not ideal, as ignoring or repairing significant sections of the data could bias the results and conclusions drawn from the analyses
- In the context of big data, large numbers and volumes of feeds from disparate (a lot of different) sources are integrated and as such, it is likely that significant portions of this data seems noncompliant, while it is actually legitimate data.
- They introduce Empirical Glitch Explanations, which are concise, multi-dimensional descriptions of subsets of potentially dirty data and propose a scalable method for empirically generating such explanatory characterisations
- These explanations could serve two valuable functions
  1) Provide a way of identifying legitimate data and releasing it back into the pool of clean data, reducing cleaning-related statistical distortion of the data
  2) Used to refine existing data quality constraints and generate and formalise domain knowledge
*** Introduction
- Much attention has been paid to identifying data quality constraint violations and developing cleaning strategies, while not much focus has been on whether all data that is noncompliant should be repaired or all data which violates a constraint should be treated homogeneously (i.e. all data violating is treated equally)
- By incorrectly (or when it's not needed) repairing noncompliant data, we risk changing the data to such an extent that it is unrecognisable and thus we suffer high statistical distortion.
  + And conclusions drawn from this could be misleading
- Data constraints are usually fairly broad and as a result, they flag much data as suspect. It is thus critical to study this data for additional, potentially explanatory relationships in the data that could reduce the cost and distortion associated with cleaning and it might yield additional knowledge of the data.
- Data quality is highly domain and context dependent and any empirical method that allows for the gathering of domain knowledge is valuable in itself.
- This paper shows that it is possible that significant portions of data violating constraints, actually have valid explanations and can thus be released back into the pool of clean data unaltered.
- Identifying empirical explanations for seemingly suspicious data based on attribute patterns is a valuable contribution to the data quality process.
**** An example
- They have the constraint "Any given phone number must have only one record associated with it" on data from some big database.
- They then find several duplicates, where they present three instances where each phone number occurs thrice.
- In this example, the first phone number actually has missing fields as well and is thus likely a result of corrupt data and this should likely be regarded as bad data.
- The other two cases however are likely results of phone numbers used for some specific purpose and should as such be added back to the clean pool and then the constraint should be extended to allow for these two types of cases, as there might be more in the future. This arguing allows to put 66% of the violating data back into the clean data pool.
**** Related Work
- Statistical distortion is the distortion in data caused by well-intentioned data repair efforts and it was introduced as a critical criterion for measuring the utility of data cleaning strategies.
**** Their contributions
- They seek to explain seemingly anomalous data by empirically discovering patterns and characterising subsets that can be returned to the clean data pool, thus reducing statisticial distortion as no unnecessary repairs have to be done.
- They introduce the notion of explainable glitches which are seeming violations that can be collectively described by a succint empirical description. These explanations have the potential to explain the glitches, either by consulting subject matter experts or other heuristics. These can serve two valuable functions:
  1) Provide a way of identifying legitimate data and releasing it back and in doing so reducing statistical distortion
  2) Refine the existing data quality constraints and generate and formalise domain knowledge
- They propose a robust and scalable method for empirically generating the explanations by developing the new notion of cross subsampling which create subsets that are similar to the noncompliant set. In doing so, they reduce the redundancy of the resampling procedure caused by the disparity in sizes between dataset D and the suspicious subset A, ensuring that the results are statistically significant.
- Define two objective metrics, size and merit, for evaluating and ranking explanations. These make the method flexible and customisable depending on the application.
- They evaluate the methodology within a comprehensive experimental framework using real and synthetic data sets and explore the robustness and scalability of explanations. They are able to retain 99% of the data flagged as suspicious.
*** Problem description
1. We are given a dataset D with N rows (records) and d columns (attributes) and a constraint C.
2. Constraints are rules (logical, semantic, statistical) that are imposed on data to ensure conformity to expectations about the data; "Any given phone number must have only one record associated with it".
3. Let subset A consist of all suspicious records in D that violate C. In the example, A would consist of the 9 records violating C, as the three phone numbers each belonged to 3 records.
4. In absence of explanations, the problematic set Q that needs to be cleaned is given by Q = A.
5. The objective is then to reduce the size of Q by identifying portions of A that can in fact be explained as clean, using characteristics derived from the other attributes and data values.
6. This cuts the cost of cleaining and reduce distortion.
7. A cleaning process typically change the data by making educated guesses on the correct data.
8. We wish to generate empirical explanations E, each of which will describe a set of records $P \subseteq A$. Explanations are of the form $\{s_j\}$, where s_j describes a condition on a value v_j in the suspicious set A.
9. These explanations for the phone number example could be: E_1 = "blank is frequent and occurs in multiple attributes", E_2 = "ID_5 in attributes 1 and 6, new hire, d2300", E_3 = "ID_13, A132, D8000" where E_1 essentially means that the data is bad, as a lot of the attributes are left blank, but E_2 might mean that the "new hires are assigned their supervisor's phone number" and E_3 could mean that "members of the same department are working for the same supervisor and they share the same physical room and thus phone". So only E_1 was essentially problematic.
*** Their approach
- Take a nonparametric approach + This ensures a general applicability that is agnostic to any underlying data distributions
- Main steps are:
  1) Identify the set A by applying constraint C to D. In the absence of any explanation, the entire set A is deemed suspicious.
  2) For each value $v \in A$, generate a propensity signature $s$. This signature is probabilistic and should capture the propensity (an inclination or natural tendency to behave in a particular way, tilbøjelighed) of occurence of a value $v$ across all records and attributes of $A$.
  3) Rank the signatures based on their suspiciousness, using statistical criteria. The significant signatures together constitute an explanation E = {s_j}. These signatures can be used collectively in a conjuctive disjunctive or in some other manner to define the explanation.
  4) Apply the explanation E to A, yielding a set of records P of A.
  5) Quantify the effectiveness of an explanation using the size and merit in reducing the statistical distortion of impacted records.
**** Suspicious Set
- Given C, this is applied to D to identify A. Identifying A is easy for obvious glitches such as missing values or duplicates. It is however complex in more non-trivial cases, such as disguised missing (in which unknown, inapplicable, or otherwise nonspecified responses are encoded as valid data values, can arise from poorly designed questionnaires (e.g.,inapplicable or ambiguously worded questions), errors madeby the interviewer (e.g., omitted questions), or nonresponseby the interview subject (e.g., subject can’t remember orrefuses to answer)) values and where the glitches are masked or hidden.
- If glitch detection is dependent on thresholds, for example outliers, then determining A is more task dependent.
- Methods for formulating C and determining A are outside the scope of this paper.
- They just assume C and A clearly specified.
- Usually |A| << |D|.
- Let "good" or non-suspicious data $A' = D - A$ (i.e. data not violating the constraint).
- They need to identify the values $v \in A$ that exhibit different statistical behavior in A and A'.
**** Propensity Signatures
- Let $v$ be a value in A. I assume this is any value of any attribute within the records of A, but this is not directly mentioned. To capture the behavior of this value $v$, they propose propensity signatures.
- Let p_k be the probability of v occuring in attribute C_k of A, then:
- /The propensity signature of a value v in the set A is a d-dimensional vector given by: $s_A(v) = (p_1, ..., p_k, ..., p_d)$, k=1,d and it captures the propensity of occurence of v in A.
- Also defined for the values of A', however large P's.
- Propensity signatures focus on the occurence of a value across all records and attributes in a sample.
- We do not know distributions of v a priori, so they use the empirical estimates of propensity signatures \tilde(s(v)) to identify the set of suspicious values $V = \{v\}$ that have statistically different signatures in the suspicious data set A, compared to the good data A'.
- For the phone number example: $\tilde{s_{A}(ID_5)} = (1/3,0,0,0,0,2/3)$ and $\tilde{s_{A}(NewHire)} = (0,2/3,0,0,0,0)$. This is likely due to $ID_5$ being in all three of the records, but one of them had this as their own ID while the other $2/3$ had this value as their supervisor ID. Also, $2/3$ of them are "New Hires".
**** Statistical Significance
- How is it determined if the propensity signature of a value is statistically significant?
  + Could compute distances of propensity signatures of all values in A from the corresponding value signatures in the good A' and then rank values based on signature distances. The ones with the highest distance could be considered "different". + However, the signatures of different values are not comparable, nor their distances between.
**** Crossover Subsampling
- An alternative approach to the above problem, is to use re-sampling, where samples are drawn repeatedly, compute propensity signatures of a given value in each sample and construct a sampling distribution of the propensity signatures.
- From this sampling distribution, we can infer the expected signature of the given value as well as the expected variability in its signature.
- since all signatures in the sampling distribution pertain to the same value, the question of comparability does not arise.
- As |A| << |D|, simply using random sampling is not good enough, as we need to ensure that our sampled sets are of the same size.
- We would like to construct specialized sumsamples that share some characteristics of A, in addition to being like-sized
- Thus Crossover subsampling is introduced.
- With crossover subsampling, they are guaranteed that every record in A is represented in a specified proportion of the subsamples
- A q-crossover subsample of size B drawn from two sets D and A \subset D where the size of A is |A| = B, is defined to be a set that contains q proportion of samples from A and the rest from D - A, and every record in A occurs in exactly q proportion of the subsamples.
- This is constructed by:
  1) partioning A into b = 1/q chunks of size M = B/b (A has size B), A = A_1 + A_2 + ... + A_b
  2) cross each piece A_i with a random piece of size B - M drawn from D - A_i to create a like-sized sample of size B. + This process is replicated R times, holding A_i fixed, but drawing randomly without replacement from D - A_i. This yields R samples of size B, corresponding to A_i.
  3) The sampling distribution of propensity signatures of each value v in A_i from these R replications corresponding to chunk A_i is denoted $\tilde{F_{A_i}(v)}$.
  4) The estimated signature $\tilde{s_{A}(v)}$ is compared against $\tilde{F_{A_i}(v)}$, establising if that particular value has a statistically different pattern of occurences in A_i.
- Each chunk A_i gets to vote on the suspicioness of value v.
- A value v in a set A is voted to be suspicious with respect to the empirical sampling distribution $\tilde{F_{A_i}(v)}$ corresponding to chunk A_i of A, if it is statistically different with respect to that distribution. The vote is denoted by the indicator function I_{A_i}(v) which takes the value 1 if significant and 0 otherwise.
- The voting is repeated with each of the b pieces of A, each chunk yields a vote I_{A_i}(v) for each value v.
- The informativeness of a value v is measured by the proportion of votes: $K = \Sum_{i} I_{A_i}(v)/b$
- So crossover subsampling process results in total of $T = R \cdot b$ samples of size B and a collection of empirical sampling distributions $\{\tilde{F_{A_i}(v)}\}_{i=1}^b$ corresponding to b chunks.
**** Testing for statistical significance
- A value v is flagged if its propensity signature lies outside the chosen error bounds of its corresponding sampling distribution $\tilde{F_{A_i}(v)}$.
  + These bounds are computed for each attribute
- Each element of the signature is compared with the corresponding bootstrap distribution (??) and if any element lies outside the bounds (mean +- 2 standard deviation; 95% and 5% percentiles) it is deemed significant. + For our long going example, the $\tilde{s_{A}(ID_5)} = (1/3,0,0,0,0,2/3)$ is deemed significant, as we conclude the upper bound is $(0.2,0,0,0,0,1.8)$ and $1/3 > 0.2$.
- These statistically significant propensity signatures are used for explanations.
*** Glitch Explanations
- sss = Statistically significant signatures
- Let the collection of values v in A with statistically significant signatures be $V = (v_1,v_2,..,v_L)$
- A glitch explanation $E \in V$ is then a collection of values in A that have sss.
- This is why an explanation of one of the phone numbers was $E = (ID_5, NewHire)$.
- The size $S(E)$ is the smallest number of informative non-redundant values in the explanation.
- A threshold on informativeness K can be given, such that $K > \alpha$ for including a value v in an explanation.
*** Evaluating explanations
- They measure the effectiveness of the explanation by the statistical distortion of the data prevented by reclaiming the data instead, rather than changing it.
- They define statistical distortion to be the proportion of records that are touched by any data repair.
  + This is not the best of definitions, but this is not the focus of the article and they deem it is good enough.
- Let S be the reclaimed set from A, then the reduction r in the statistical distortion is given by: $r = \frac{|S|}{|A|}$, so they just divide the reclaimed set size with the size of the suspicious set..
- The merit r of an explanation E is the reduction in statistical distortion caused by reclaimed the records explained by E.
- So as above, when we reclaimed 2 out of 3: $r = 2/3$.
*** Constructing propensity signatures
- Each of the sets A and D - A contain a collection of distinct values. The empirical estimates of the propensity signatures are constructed within a single pass over the data for each distinct value in A and D - A (Lol this can be a lot).
- Let A have N_A rows (records). Suppose that v occurs n_k times in some column (attribute) C_k in the suspicious set A. Then $\tilde{p_k} = n_k/N_A$ be an empirical estimate of the probability p_k. + These estimates of p_k then form the propensity signatures by computing p_k for each attribute of the record.
  + Which again explains the 1/3 and 2/3 within the signature of ID_5.
- This is similar for the estimated propensity signature of v in D - A
- This is the maximum likelihood estimate
*** Conclusion
- Essentially this works well, but at a high cost in regards to computation and runtime.
- In the slides, they question if the authors subdivide the suspicious set into subsets of sets where each violate in the same way, like the phone number, where they have 3 phone numbers divided over 9 people. I think they do this yes, and it becomes apparent in their experiments. (Unless this is what the presenters mean..)
- Increasing value of q (q crossover!) gives less bootstrap samples, but increases the amount of suspicious data within each set.
** A framework for outlier description using constraint programming
- Could consider talking about this for the presentation instead.
- We wish to explain why outliers are outliers
- These guys propose a framework based on constraint programming to find an optimal subset of features that most differentiates the outliers and normal instances
- This framework offers great flexibility in incorporating diverse scenarios arising in practice, such as multiple explanations and human in the loop extensions + Both things that the empirical glitch thing supposedly support
*** Introduction
- Say an automobile company has a lot of recalls. So the recalled cars are outliers compared to the working cars, but WHY are they outliers, is the important information.
- *The general outlier description problem*: Given a collection of instances that are deemed normal, N (D-A from the other), and another separate collection deemed outliers, O (A in the other), where instances in both N and O are in the feature space S, find a feature mapping t : S -> F that maximises the difference between O and N according to some outlier property measure m : F -> R and an aggregate objective.
  + They don't mention what F is. + t and m can be viewed as the description/explanation of the outlying behaviour, where t describes the space where the behaviour is exhibited and m defines what we deem to be an outlier in this space.
  + They focus on specific t which projects the data to a subspace and use an m that measures the local neighborhood density around each point. + Their objective obj is the difference between the densities surrounding the outliers and normal instances.
- These guys then use Constraint Programming (CP), which allows the functions t, m and obj to take a wide variety of forms unincumbered by the limitations associated with mathmatical programming.
*** A Framework using CP Formulation
- From the general outlier description problem, the essence of outlier description is to search for t and m, that describe what makes the outliers different compared to the inliers.
- They restrict the feature mapping to selecting a subspace of the feature set.
- They use the local density criterion for outliers based upon the assumption that a normal instance should have many other instances in proximity whereas an outlier has much fewer neighbors.
  + Hmm
- A natural objective in this context is to maximise the difference of numbers of neighbors between normal points and outliers. + A large gap would substantiate the assumption of local density between outliers and normal points.
- *The Subspace Outlier Description Problem:* Given a set of normal instances N and a set of outliers O in a feature space S, find the tuple $(F, k_N, k_O, r)$ where $k_N - k_O$ is maximised, $F \subset S$ and $\forall x \in N$, $|\mathcal{N}_F(x,r)| \geq k_N$ and $\forall y \in O$, $|\mathcal{N}_F(y,r)| < k_O$, where $\mathcal{N}_F(x,r)$ is the set of instances within radius r of x in subspace F.
  + Sooooo, k_N and k_O are densities and we wish to maximise the difference between them, while it is true that for all normal instances, the number of instances while looking at some subset (which we need to find) of their attributes and within some radius r, should be larger than k_N, while the same goes for all outliers, except they should be less than k_O.
  + Normal points are locally denser than the outliers and the core of the problem is to find the feature subspace where this actually occurs.
  + t is characterised by F and zeros out the components of the instances not within the subset of components picked, F. $m(x) = |\mathcal{N}(x,r)|$ (The constraints??) and obj(A,B) = min A - max B
- They present three different CP optimisation models
1)(Learning a single outlier description) Direct translation from the subspace outlier description problem. F is a binary vector where the bits that are set in the solution correspond to the best subspace. Users only need to supply the bounds of hyperparameters, k_{max}, k_{min} and r_{max} such that $k_{min} \leq k_O \leq k_N \leq k_{max}$ and $0 \leq r \leq r_{max}$. + minimising the size of the subspace might also be relevant, as a smaller subspace is easier to interpret.
2)(Outliers in multiple subspaces) Outliers can reside in different (outlying) subspaces. In this setting, there could be multiple reasons/explanations why a point is an outlier. We will then have two sets of feature subspace selectors, F and G. Normal instances must satisfy the dense neighborhood condition in BOTH subspaces whereas an outlier is an outlier, if it is outlying in EITHER F or G. As we have two subspace selectors F and G, we also need to radii, r_F and r_G.
3)(Human in the loop) Often known outliers are hand labeled and as a result labels are considered more accurate. Normal points might however violate the normal density conditions as there might be not yet reported outliers within the normal set. As such, in this formulation, points in the normal set are allowed to violate the constraints and these points are then "contentious" points, which can then be examined by a human expert. an extra binary vector is added, indicating if a normal point violate the constraints and we then a bound w_{max} on the amount of 1's within this binary vector.
*** Encoding constraints
- Simply define some distance function and compute it for ALL points (union of N and O).
- We can then just check if the distance between points in N to all other points and the points in O to all other points and check the densities. This distance function can be something like the euclidean distance (the l-2 norm), if the data allows it.
*** Complexity of the first CP formulation
- F, k_N, k_O and r are explicitly defined variables. r must take on a discrete set. This can be done by specifying a step size, s,s such that $r \in \{0,s,2s,...,r_{max}\}$. This discretization does not apply to k_N and k_O, as these are natural integers. There is one $z_{ij}$ (the distance thing) for each pair of data instances and one constraint to set its value. (So a lot, specifically (n 2), the binomial thing). Once these are in place, enforcing the number of instances within a neighborhood is a single constraint for each instance, so n of these. We thus have 1 + (n 2) + n constraints and the size of F (p) + k_N, k_O (2) + r (1) + z_{ij} (n 2) variables.
*** Conclusion
- Scalability is quite poor due to the combinatorial optimisation in general but also the variables needed to encode the problem.
** Beyond Outlier Detection: LookOut for Pictorial Explanation
- Provide succient, interpretable and simple pictorial explanations of outlying behavior in multi-dimensional and real-valued datasets while respecting the limited attention of human analysts.
- Propose to output a few pictures (so called focus-plots, pairwise feature plots) from a few, carefully chosen feature sub-spaces.
- Their solution has a plot-selection objective and the algorithm approximates with optimality guarantees
- It scales linearly with the size of input outliers to explain the explanation budget (??)
- Their experiments show that LookOut performs near-ideally in terms of maximising explanation objective on several real datasets, while producing fast, visually interpretable and intuitive results in explaining groundtruth outliers from several real-world datasets.
*** Introduction
- It is extremely beneficial to provide explanations for incidents where outcomes of something raises alert (is an outlier), as these explanations can be used by some expert or analyst, empowering the analysts in sensemaking and reduce their efforts in troubleshooting and recovery. (Also such as the carmaker thing from the previous article)
- LookOut provides interpretable pictorial explanations through simple, easy-to-grasp focus plots which "incriminate" the given outliers the most.
- Given outliers from a dataset with real-valued features (NOTE REAL-VALUED!) they aim to find a few 2D plots on which the total "blame" that the outliers receive is maximised. These should be interpretable and succint such that they can show only a few plots to respect the humans attention, but these allow the human to quickly interpret the plots, spot the outliers and verify their abnormality given the discovered feature pairs.
- LookOut is an algorithm with a plot selection objective which quantifies "goodness" of an explanations and lends itself to monotone submodular function optimisation which is solved efficiently with optimality guarantees.
- It is domain-agnostic and detector-agnostic (outlier detector)
- It requires linear time on the number of plots to choose explanations, the number of outliers to explain and the user-specified budget for explanations.
- These guys believe that the constraint programming guys do not meet several key desiderata for outlier description:
  1) Quantifiable explanation quality
  2) Budget-consciousness towards analysts (but they do, as you can decide the size of the set F slightly)
  3) Visual interpretability (this is true for the CP however, it is a binary vector..)
  4) A scalable descriptor (the CP one sucks at this)
*** Prelims and problem statement
- Let V be the set of input data points, v \in V come from R^d and n = |V|. d = |F| is the dimensionality of the dataset and F = (f_1, f_2, ..., f_d) is the set of real-valued features. The set of outlying points is A. |A| = k.
- *Definition of Focus Plots:* Given a dataset of points V, a pair of features $f_x, f_y \in F$ (F is the set of real-valued features) and a set of outliers A, focus-plot $p \in P$ is a 2d scatter plot of all points with $f_x$ on x-axis, $f_y$ on y axis with drawing attention to the set of maximally explained outliers $A_p \subseteq A$ best explained by this feature pair.
- So their pictorial outlier explanation is a set of focus-plots, each of which "blames" or "explains away" a subset of the input outliers, whose outlierness is best showcased by the corresponding pair of features. This means they consider (d 2) (binomial coefficient) spaces, by generating all pairwise feature combinations. Within each 2d space, they score the points in A by their outlierness.
- Let all the (d 2) focus plots be denoted P.
- The goal is to output a small subset S of P on which points in A receive high outlier scores
*** Proposed Algo
- Generate focus plots, score them.
- The maxcover problem is NP hard, so they need to approximate it, when trying to select plots to explain all of the outliers.
- Their objective is to maximise the total maximum outlier score of each outlier amongst the selected plots: $f(S) = \Sum_{a_i \in A} max_{p_j \in S} s_{i,j}$, so they need to find the S which maximises this.
- This f function is non-negative, non-decreasing and submodular, so they can use a greedy algorithm with an approximation guarantee. (Submodularity: a set function whose value, informally, has the property that the difference in the incremental value of the function that a single element makes when added to an input set decreases as the size of the input set increases.)
- Thus, we can build a greedy algo, which just starts out with the empty S and greedily adds the plot which yields the largest marginal gain in function value.
- Apparently it has a 63% approximation guarantee.
*** Complexity
- Total time complexity is $O(l*\log(n')*(k+n')+klb)$ for sample size $n' < n$ and it is sublinear in total number of input points n.
- The time complexity of finding their outliers using something called IForest takes $O(t*log(n')*(k+n'))$.
- Computing marginal gain for each unselected plot takes $O(kl)$ time. Finding the maximum among all gains take $O(l)$ via linear scan. This process is repeated $b$ times for a budget of $b$.
- total number of plots $l = d^2$ is quadratic in number of features.
*** Discussion
- Scatter plots was picked since they are easy to understand and interpret. They are also universal and they show where outliers lie relative to the normal points.
- Scatter plots over decision trees, as decision trees become more difficult to interpret at large depths. They are also not budget concious.
- Time is linearly scaling.
* Explaining Classification
** "Why Should I Trust You?" - Explaining the Predictions of Any Classifier
- Machine models remain mostly black boxes.
- Understanding WHY some model predicts what it does is important in assessing trust in the prediction, which is fundamental when discussing whether to deploy a new model or not. + The understanding can also be important when trying to transform an untrustworthy prediction into a trustworthy one.
- LIME is a novel explanation technique, that explains the predictions of any black-box classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction.
- Using the previous individual predictions, they also show how to learn a model, by presenting representative individual predictions and their explanations in a non-redundant way, framing this task as a submodular optimisation problem (submodular so it can be greedy)
*** Introduction
- Machine Learning is at the core of many recent advances in scitech
- Whether humans are directly using machine learning classifiers as tools, or are deploying models within other products, a vital concern remains: If the users do not trust a model or a prediction, they will not use it.
- There are two definitions of trust:
  1) Trusting a prediction, i.e. whether the user trusts the individual prediction to take action based on it
  2) Trusting a model, i.e. whether the user trusts a model to behave in reasonable ways if deployed.
- Trust in predictions is important in decision making, such as when the model is used for medical diagnosis or terrorism detection. In these cases, you can't simply have blind faith in the predictions, as you might not be aware of exactly why the model predicts what it does.
- You also need to be confident that the model will behave well on real-world data, as opposed to the training data. This is often done with cross validation, but even cross validation does not mean that the model does not pick up on weird things not relevant to the problem.
- The authors have three major contributions:
  1) LIME - An algorithm that can explain the prediction of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model
  2) SP-LIME - A method that selects a set of representative instances of predictions with their explanation, to address the "trusting the model" problem, via submodular optimisation
  3) Comprehrensive evaluation with simulated and human subjects, where they measure the impact of explanations on trust and associated tasks.
*** The case for explanations
- Explaining a predictions means presenting textual or visual artifacts that provide qualitative understanding of the relationship between the instance's components (words in text, patches in an image), and the models predictions.
- A doctor will need to understand specifically why a certain prediction is what it is. It is not enough simply to say "sick", without defining why the model believes the person is sick (things such as coughing, headache, so on).
- Every machine learning application also requires a certain measure of overall trust in the model. Development and evaluation of a classification model often consists of collecting annotated data, of which a held-out subset is used for automated evaluation. This is a useful pipeline for many applications, however evaluation on validation data may not correspond to performance in the wild, as practitioners may overestimate the accuracy of their models, while this accuracy may come from completely unrelated to the domain, things.
- An example could be the patient ID being heavily correlated with the target class in the training and validation data, which would result in a model placing heavy impact on the patient ID, when released into the wild, yielding a very bad predictor on real data, but a very accurate one on the training data.
  + This is known as data leakage.
- Dataset shift is when the training and test distributions are different. + Face recognition algorithms that are trained predominantly on younger faces, yet the dataset has a much larger proportion of older faces in it.
- Individual predictions can be used to select between models, in conjunction with their accuracy.
- A practitioner may choose to pick a less accurate model, if the user is aware of why the two models made their decisions.
**** Desired Characteristics for Explainers
- An essential criterion for explanations is that they must be interpretable
  + Provide qualitative understanding between the input variables and the response. + Interpretability must take into account the user's limitations, so a linear model may or may not be interpretable, if hundreds or thousands of features significantly contribute to a prediction, it is not reasonable to expect any user to comprehend why the prediction was made, even if individual weights can be inspected.
  + This implies that explanations should be easy to understand
- They should also display local fidelity (faithfulness). It is often impossible for an explanation to be completely faithful unless it is the complete description of the model itself. For an explanation to be meaningful it must at least be locally faithful, so it must correspond to how the model behaves in the vicinity of the instance being predicted. + Local fidelity does not imply global fidelity. Features that are globally important may not be important in the local context and vice versa.
  + global fidelity does imply local fidelity though.
- An explainer should also be able to explain any model, thus be model-agnostic. So the original model (the predictor) should be treated as a black-box.
- It should also provide a global perspective.
*** Local Interpretable Model-Agnostic Explanations
**** Interpretable Data Representations
- Local Interpretable Model-Agnostic Explanations (LIME) + Should identify an interpretable model over the interpretable representation that is locally faithful to the classifier
- Interpretable explanations need to use a representation that is understandable to humans, regardless of the actual features used by the model.
- An interpretable representation for text classification is a binary vector indicating the presence or absende of a word, even though the underlying classifier may use much more complex features such as word embeddings.
  + For image classification, it may be a binary vector indicating whether or not a super pixel is present or absent.
- They denote $x \in R^d$ as the original representation of an instance being explained and $x' \in \{0,1\}^{d'}$ as the binary for its intepretable representation.
**** Fidelity-Interpretability Trade-off
- An explanation is defined as a model $g \in G$, where G is a class of potentially interpretable models, such as linear models, so a model $g \in G$ can be readily presented to the user with visual or textual artifacts.
- The domain of $g$ is $\{0,1\}^{d'}$, so $g$ acts over absence/presence of the interpretable components. Not every $g \in G$ may be simple enough to be interpretable, so $\Omega(g)$ is a measure of complexity (as opposed to interpretability) of $g$ and it may be the amount of non-zero weights of a linear model.
- Let the model being explained be denoted $f : R^d -> R$. In classification, f(x) is the probability or a binary indicator that x belongs to a certain class.
- $\pi_x(z)$ is defined as a proximity measure between an instance $z$ to $x$, so it defines locality around x.
- $L(f,g,\pi_x) is a measure of how unfaithful g is in approximating f in the locality around $\pi_x$.
- To ensure both interpretability and local fidelity, we must minimise this $L$ while having $\Omega(g)$ be low enough to be interpretable by humans + Naturally, if $\Omega(g)$ is high, we allow g to be complex, thus increasing the fidelity of g in approximating f.
- $E(x) = argmin_{g \in G} L(f,g,\pi_x) + \Omega(g)$
- The authors focus on sparse linear models as explanations and performing the search
**** Sampling for Local Exploration
- Done via pertubations
- The locality-aware loss $L(f,g,\pi_x)$ should be minimised without making any assumptions on $f$, since the explainer should be model-agnostic.
- To learn the local behavior of f as the interpretable inputs vary, they approximate $L(f,g,\pi_x)$ by drawing samples weighted by $\pi_x$.
- Instances are sampled around $x'$ (the binary vector for the interpretable representation), by drawing non-zero elements of x' uniformly at random. Then, given a pertubed sample $z' \in \{0,1\}^{d'}$, they recover the sample in the original representation $z \in R^d$ and obtain $f(z)$, which is then used as a label for the explanation model.
- Given this dataset Z of pertubed samples with the associated labels, they optimise the E(x) function to get an explanation.
- So they sample instances both in the vicinity of x, which will have a high weight due to the proximity thing, as well as far away, which will have low weight. So even if the original model is too complex, LIME presents an explanation that is locally faithful (linear in this case), where the locality is captured by \pi(x).
**** Sparse Linear Explanations
- G is the class of linear models, so $g(z') = w_g \cdot z'$
- The locally weighted square loss is used as L and $\pi_x(z) = exp(-D(x,z)^2 / \sigma^2)$ is an exponential kernel defined on some distance function such as the L2 distance with width \sigma. + So $L(f,g,\pi_x) = \Sum_{z, z' \in Z} \pi_x(z) (f(z) - g(z'))^2$
- For text classification, they let the interpretable representation be a bag of words, so by setting a limit K n the number of words, $\Omega(g)$ describes the number of words and this has to be less K.
  + This article leaves K as a constant value
- \Omega(g) is the same for image classification, where they use "super-pixels" instead of words, so the interpretable representation on an image is a binary vector where 1 indicates the original super pixel is present and 0 indicates a grayed out super-pixel.  + Note that this choice of \Omage(g) makes solving the E(x) function intractable, but it is approximated by selecting K features using Lasso and then learning the weights via least squares (K-Lasso)
- Their individual prediction algorithm produces an explanation for an individual prediction and as such, the complexity does not depend on the size of the dataset, but rather on the time to compute f(x), as this is done for each pertubed sample (of which the amount is N).
  + Explaining random forests of 1000 trees and N=5000 samples takes 3 seconds. Explaining each prediction of some image classification network takes around 10 minutes.
- Any choice of intepretable representation and G will have inherent drawbacks
  1) The underlying model can be treated as a black-box, but certain interpretable representations will not be powerful enough to explain certain behaviours.  + A model predicting sepia-toned images to be retro cannot be explained by presence or absence of super pixels as it's the entire tone of the image!
  2) The choice of G, sparse linear models, means that if the underlying model is highly non-linear even in local predictions, there may not exist a faithful explanation. To remedy this, the faithfulness of the explanation on Z can be estimated and presented to the user
*** Two examples
- Won't really cover, but it's a text classification using SVMs and deep networks for images.
- We note, for text classification dumb email header information was used to make a classification, which is nonsense in the context.
*** Submodular Pick for Explaining Models
- An explanation for a single prediction does give some understanding into the reliability of the classifier to the user, but it is not sufficient to evaluate the model as a whole.
- Thus, they propose to give a global understanding of the model by explaining a set of individual instances. This process is still model agnostic, as the individual explanations are.
- These individual explanations need to be selected in a clever way, as users do not have time to sift through a large number of them.
- Define a time/patience budget B, that denotes the number of explanations humans are willing to look at.
- Given some set of instances X, they define a pick step as the task of selecting B instances for the user to inspect.
- This pick step should take into account the explanations that accompany predictions and it should pick a diverse, representative set of explanations to show the user, rather than just help the user pick these themselves.
- Given the set of explanations for a set of instances X (|X| = n), an n x d' explanation matrix W is constructed, that represents the local importance of the interpretable components for each instance. When using linear models as explanations, for an instance x_i and explanation g_i = E(x_i), they set $W_{ij} = |w_{g_{ih}}|$. For each component (column) j in W, they denote $I_j$ to be the global importance of that component in the explanation space. We now want I such that features that explain many different instances have higher importance scores, as such, for text applications they set $I_j$ to be the square root of the amount of instances having this feature. It should avoid picking instances with similar explanations.
- Thus, the final final picking function should seek to pick the instances displaying the most important features, while avoiding redundancy and picking as few as possible, to explain all features.
- This is NP Hard as it is a weighted coverage function. But as their coverage function is submodular, they can use it to approximate greedily.
  + It approximates with an approximation guarantee of $1 - 1/e$.
*** Indiviudal prediction algo
- Requires a classifier f, number of samples N, some instance x as well as its interpretable version x', the similarity kernel \pi_x and the length of the explanation K.
1) Z <- {}
2) for $i \in [1,N]$ do:
3) $z'_i <- sample_around(x')$
4) $Z <- Z \cup (z^'_i, f(z_i), \pi_x(z_i))$
5) end for
6) w <- K-Lasso(Z,K) (with z^'_i as features and f(z) as target.
7) return w
*** Explaining Models
- Requires instances X and budget B
- It for each instance x_i compute the invidiual prediction, gaining some weights like in the matrix, explaining if features are present or not
- It then for each intepretable feature compute the importance of it over the explanations of the instances
- It then greedily adds instances to the covering set using the covering function c.
*** Experiments
- They both simulate human experiments and perform actual human experiments.
- These all just show that their algorithm works well.
- It is a bit wonky that they use decision trees and such within their experiments, but never explain how this is achieved.
- Furthermore it is a bit wonky how they compare their algorithm to others, as they engineer all the data.
** Learning Credible Models
- A model should be capable of providing reasons for its predictions, so it must be interpretable.
- If the models reasoning does not conform with well-established knowledge, then the model may be interpretable, but lack credibility.
- These guys define credibility in the linear setting and focus on techniques for leanring models that are both accurate and credible.
- They propose a regularisation penality called expert yielded estimates (EYE), that incorporates expert knowledge about well known relationships among covariates and the outcome of interest.
- Models learned using the EYE penalty are significantly more credible than those learned using other penalties.
*** Introduction
- In health care, decision trees are preferred among physicians because of their high level of interpretability.
- Intepretability might not be enough, if the reasons provided by the model do not agree, at least in part, with well-established domain knowledge, practitioners may be less likely to trust and adopt the model.
- LASSE encourage sparsity in the learned feature weights, but in doing so may end up selecting features that are merely associated with the outcome rather than those that are known to affect the outcome.
- A credible model is an interpretable model that:
  1) Provides reasons for its predictions that are, partly, inline with well-established domain knowledge
  2) Does no worse than other models in terms of predictive performance
- The model should only agree with the well-established knowledge if it is consistent with the data. + This is because relying on domain expertise alone would defeat the purpose of data-driven algorithms and could result in worse performance.
- Definition of credibility is subjective, but these guys try to formalise it.
- Their proposed approach leverages domain expertise regarding known relationshipd between the set of covariates and the outcome. This domain expertise is used to guide the model in selecting among highly correlated features, while encouraging sparsity.
- They propose a general regularisation technique that aims to increase credibility without decreasing performance.
*** Proposed Approach
**** Definition and Notation
- Intepretability is a prerequisite for credibility.
- For linear models, interpretability is often defined as sparsity in the feature weights.
- The set of features is defined as D.
- Some domain expertise identifies $K \subseteq D$, a subset of the features as known or believed to be important.
- So among a group correlated features, a credible model will select those in K if the relationship is consistent with the data.
- Consider the following toy example where |C| = 2 and one of the features has been identified as being in K by the expert, while the other has not. One could arbitrarily select among these two correlated features, including only one in the model. To increase credibility, they encourage the model to select the known feature. This is mentioned in the formal definition of when a linear model is credible.
- A credible model is assumed to be sparse, as the expert knowledge is assumed to be sparse. Credible models will result in dense weights among the known features, if the expert knowledge provided is indeed supported by the data.
**** The expert yielded estimates (EYE) penalty
- The naive approach would be to constrain weights for known relevant factors with L2 norm which maintain a dense structure and then use L1 norm for D - K features, which will maintain a sparse structure.
- Due to sensitivity to the choice of hyperparameters for the naive solution, they propose the EYE penalty, which is obtained by fixing a level curve of q and scaling it for different contour levels.
- Which essentially just have to make sure that they do not bias the expert knowledge heavily
- EYE is beta free, where beta controls the trade-off between known and unknown features (and is a part of the naive way)
- EYE is a generalisation of LASSO, the l_2 norm. Setting r = 1 and 0 (r being an indicator array used in the minimisation problem: $\tilde{\Theta} = argmin_{\Theta} L(\Theta, X, y) + n \lambda J(\Theta, r)$ which explains which features are selected by the domain expert), they recover the l2 norm and LASSO penalties, respectively.
- EYE promotes sparse models
- EYE favors a solution that is sparse in the $\tilde{\Theta}$ function for attributes in D but not in K and dense in the values of K.
*** Experiments
- Measuring credibility: Criterion; Density in the set of known relevant features (K) and sparsity in the set of unknown (D \ K). Also: Maintained classification performance.
*** Conclusion
- Their incorporation of expert knowledge results in increased credibility, encouraging model adoption, while maintaining model performance.
- Through experiments on synthetic data (lol), they showed that sparsity inducing regularisation such as LASSO do not always produce credible models. In contrast, EYE produces a model that is provably credible in the least squares regression setting.
- EYE produced a model that was significantly better at highlighting known important factors while being comparable in terms of predictive performance with other regularisation techniques, when appliwed to two large-scale patient risk tasks.
- EYE does not lead to wose performance when the expert is wrong (this is ensured by them not biasing features heavily, when they come from K)
- They focused on lionear setting and one form of expert knowledge, which is a limitation!
- They do not claim that EYE is the optimal approach to yield credibility.

** Toeatds Explanation of DNN-based Prediction with Guided Feature Inversion
- Deep neural networks (DNN) have become an effective computational tool, the prediction results are often critised by the lack of interpretability, which is essential.
- Existing attempts based on local interpretations aim to identify relevant features contributing the most to the prediction of the DNN by monitoring the neighborhood of a given input (LIME!)
- These usually ignore the intermediate layers of the DNN (yes, they just focus on input/output, as they wish to be completely model agnostic). These might however contain rich information for interpretation.
- These guys propose to investigate a guided feature inversion framework for taking advantage of the deep architectures towards effective interpretation.
- Their proposed framework does not only determine the contribution of each feature in the input, but also provides insights into the decision-making process of DNN models.
- They further interact with the neuron of the target category at the output layer of the DNN, enforcing the interpretation result to be class-discriminative.
*** Introduction
- DNN models may learn biases from the training data, in which case intepretability can be used to debug (LIME does this)
- Existing interpretation methods focus on two types of interpretations, model-level and instance-level.
  1) Model-level focus on finding a good prototype in the input domain that is interpretable and can represent the abstract concept learned by a neuron or a group of neurons of the DNN.
  2) Instance-level targets to answer what features of an input it to active the DNN neurons to make a specific prediction (LIME)
- Instance-level usually follow the idea of local interpretation. + Let x be an input for aNN, the prediction of the DNN is denoted as a function f(x). Through monitoring the prediction response provided by the function f around the neighborhood of a given point x, the features in x which cause a larger change of f will be treated as more relevant to the final prediction.
- These methods tend to not use the intermediate leayers.
- It has been shown that some input trigger weird stuff, thus tricking the DNN to make unexpected output, which will ruin the whole input/output blackbox philosophy, so looking at intermediate layers here will help.
- Feature inverstion (Feature inversion aims to map the feature generated at any layer of a DNN back to a plausible input. Each layer in a DNN maps an input feature to an output feature and in the process ignores the input content that does not seem relevant to the classification task.) has been studied for visualising and understanding intermediate feature representations of DNN.
- the inversion results indicate that as the information propagates from the input layer to the output layer, the DNN classifier gradually compresses the input information while discarding information irrelevant to the prediction task.
- Inversion results from a specific layer also reveals the amount of information contained in that layer.
- These guys propose an instance-level DNN interpretation model by performing guided image feature inversion. Leveraging the observations found in their preliminiary experiments that the higher layer of DNN do capture the high-level content of the input as well as its spatial arrangement. They present guided feature reconstructions to explicitly preserve the object localisation information in a "mask", so as to provide insights of what information is actually employed by the DNN for the prediction.
- They establish connections between input and the target object by fine-tuning the interpretation result obtained from guided feature inversions.
- They show that the intermediate activation values at higher convolutional layers of DNN are able to behave as a stronger regulariser.
*** Interpretation of DNN-Based Prediction
- The main idea of their proposed framework is to identify the image regions that simultaeneously encode the location information of the target object and match the feature representation of the original image.
- Let c be the target object (in classification, the target output, the one we want to hit), that they want to interpret, and x_i corresponds to the i'th feature, then the interpretation for x is encoded by a score vector s \in R^d where each score element s_i \in [0,1] represents how relevant that feature is for explaining f_c(x). The input vector x_a corresponds to the pixels of an image and the score s will be a saliency map (or attribution map) where the pixels with higher scores represents higher relevance for the classification task.
- It has been studied that the deep image representation extracted from a layer in CNN could be inverted to a reconstructured image which captures the property and invariance encoded in that layer.
- Feature inversion can reveal how much information is preserved in the feature at a specific layer.
- These inversions reveal that at the early layers much of the information from the original image is still preserved, but at the last layers only the rough shapes are, so CNNs gradually filter out unrelated information for the classification task, so we are interested in looking at the early layers, to provide explanation for classification results.
- Given a pre-trained L layers CNN model, let the intermediate feature representations at layer l \in {1,2,...,l} be denoted as a function f^l(x_a) of the input image x_a. We then need to compute the approximated inversion f^{-1} of the representation f^{l_0}(x_a).
- So we know early on which things are somewhat focused on, but we won't know until the later layers, with confidence, which part of the input information is ultimately preserved for final prediction. They use regularisation to ensure this.
- they can compute the contributing factors in the input as the pixel-wise difference between x_a and x^*, the optimal inversion result which is obtained from gradient descent, getting a saliency map s. This is not feasible however, as the saliency map is noisy.
  + Note: A saliency map is an image that shows each pixel's unique quality. So it should simplify and/or change the representation of an image into something that is more meaningful and easier to analyse.
- To tackle the problem of the noisy saliency map, they instead propose the guided feature inversion method, where the expected inversion image representation is reformulated as the weighted sum of the original image x_a and another noise background image p.
- They generate the mask m to illustrate which objects are of importance. It is largely generated from the early layers and as a result it may capture multiple things if multiple things are in the foreground. + This is fixed by strongly activating the softmax probability value as the last hidden layer L of the CNN for a given target c and reduce the activation for other classes. They can then render out irrelevant information with respect to target class c, including image background and other classes of foreground objects.
- This might still genrate undesirable artifacts without regularisations imposed to the optimisation process. They propose to exquisitely design the regularisstion term of the mask m to overcome the artifacts problem. They propose to impose a stronger natural image prior by utilising the intermediate activation features of CNN. (Essentially, the issue is that the mask m might highlights nonsense and random artifacts. If they use the early layers, which are very responsive to the different target objects, they have a higher chance of highlighting proper things.
*** Their algo
- So they first generate the mask. This mask is used to blur out anything not relevant to what we wish to predict (the mask is a heatmap, so anything not presented by this heatmap is blurred out). This blurred-out image is then run through the predictor CNN by using the weights obtained in the mask step with the class discriminative loss.
- They use class discriminative interpreation (the softmax thing) such that when they find two things such as the elephant and the zebra, they can run the prediction network twice, as they know there are two things.
- To make it more robust and avoid the mentioned artifacts, the activations of intermediate layers can be included in the feature inversion method. Done by defining the mask m as the weighted sum of channels at a specific layer l_1.
* Indexing Methods
** Merging What's Cracked, Cracking What's Merged: Adaptive Indexing in Main-Memory Column-Stores
- Adaptive indexing is characterised by the partial creation and redefinement of the index as side effects of query execution.
- Dynamic or shitfing workloads may benefit from preliminary index structures focused on the columns and specific key ranges actually queried - Without incurring the cost of full index construction.
- The costs and benefits of adaptive indexing techniques should be compared in terms of initialisation costs, the overhead imposed upon queries and the rate at which the index converges to a state that is fully-refined for a particular workload component.
- These guys seek a hybrid technique between database cracking and adaptive merging, which are two techniques for adaptive indexing.
- Adaptive merging has a relatively high init cost, but converges rapidly, while database cracking has a low init cost but converges rather slowly.
*** Introduction
- Current index selection tools rely on monitoring database requests and their execution plans, then invoking creation or removal of indexes on tables and views.
- In the context of dynamic worklods, such tools tend to suffer from the following three weaknesses:
  1) The interval between monitoring and index creation can exceed the duration of a specific request pattern so there is no benefit from the change
  2) Even if it does not exceed this duration, there is no index support during the interval, as data access during the monitoring interval neither benefits from nor aids index creation efforts and the eventual index creation imposes an additional load that interferes with the query execution
  3) Traditional indexes on tables cover all rows equally, even if some rows are needed often and some never.
- The goal is to enable incremental, efficient adaptive indexing, so index creation and optimisation as side effects on query execution, with the implicit benefit that only tables, columns and key ranges truly queried are optimised.
- Use two measures to characterise how quickly and efficiently a technique adapts index structures to a dynamic workload:
  1) The init cost incurred by the first query
  2) The number of queries that must be processed before a random query benefits from the index structure without incurring any overhead.
- The first query captures the worst-case costs and benefits of adaptive merging and it is why this is focused on.
- The more often a key range is queried, the more its representation is optimised. Columns that are never queried are not indexed and key ranges that are not queried are not optimised.
- Overhead for incremental index creation is minimal and disappears when a range has been fully-optimised.
- Draw the graph showing where adaptive merging is expensive to begin with but converges quickly, database cracking is slow and where the two hybrids are. (The good and the bad)
- This paper provides the first detailed comparison between these two techniques (merging and cracking)
- Most previous approaches to runtime index tuning are non-adaptive, so index tuning and query processing operations are independent of each other. They monitor the running workload and then decide which indexes to create or drop based on the observations. Both having an impact on the database workload. Once a decision is made, it affects ALL KEY RANGES in an index. Since some data items are more heavily queried than others, the concept of partial indexes arose.
- Soft-indexes can be seen as adaptive indexing. They continually collect statistics for recommended indexes and then periodically and automatically solve the index selection problem. Like adaptive, they are picked based on query processing, but unlike adaptive, they are not incremental, so each recommended index is created and optimised to completion (so fully?).
- Adaptive indexing and approaches that monitor queries and then build indexes (The ones mentioned as previous approaches) are mutually compatible. Policies established by the monitor-and-tune techniques could provide information about the benefit and importance of different indexes and then adaptive indexing could create and refine recommended index structures while minimising the additional workload.
**** Database Cracking
- Combines features of automatic index selection and partial indexes.
- Reorganises data within the query operators, integrating the re-organisation effort into query execution. When a new column is queried by a predicate for the first time, a new cracker index is init. As the column is used in the predicates for further queries, the cracker index is refined by range partionining until sequentially searching a partition is faster than binary searching in the AVL tree guiding a search to the appropriate partition.
- Keys in a cracker index are partitioned into disjoint key ranges, but left unsorted within each partition. Each range query analyses the cracker index, scans key ranges that fall entirely within the query range and uses the two end points of the query range to further partition the appropriate two key ranges. So, each partitioning step will create two new sub-partitions. A range is partitioned into 3 if both end points fall into the same key range. This will happen in the first partitioning step.
- So essentially, for each query, the data will be partitioned into subsets of ranges, without any of them ever being sorted. If you keep track of which subsets are for which keys, you can easily answer queries, as you can skip checking most of them.
**** Adaptive Merging
- Database cracking functions like an incremental quicksort with each query forcing one or two partitioning steps.
- Adaptive Merging functions as an incremental merge sort, where one merge step is applied to all key ranges in a query's result.
- The first query to use a given column in a predicate produces sorted runs and each subsequent query upon that same column applies to at most one additional merge step.
- Each merge step only affects those key ranges that are relevant to actual queries, leaving records in all other key ranges in their initial places.
- This merge logic takes place as a side effect of queries.
- So the first query triggers the creation of some sorted runs, thus loading the data into equally sized partitions and sorting each in memory. It then retrieves the relevant values (via index lookup because the runs are sorted) and merges them out of the runs and into a final partition. Similarly results happen from a second query, where the results from a query are merged out of the runs and into the final partitions. Subsequent queries continue to merge results from the runs until the final partition has been fully optimised for the current workload. (The final partition being the one containing the relevant values in sorted order)
*** Hybrid Algos
- Database cracking converges so slowly, since at most two new partition boundaries are generated per query, meaning that the technique will require thousands of queries to converge on an index for the focus range.
- The first query is very expensive for adaptive merging though, as it has to pay for the initial runs.
- The number of queries required to have a key range fully optimised is due to:
  1) Merging with a high fan-in rather than partitioning with a low fanout of two or three
  2) Merging a query's entire key range rather than only dividing the two partitions with the query's boundary keys.
- The different in the cost of the first query is just due to the cost of sorting the initial runs.
- So the goal is to merge the best qualities of adaptive merging and database cracking.
- They strive to maintain the lightweight footprint of cracking which imposes a minimal overhead on queries, and at the same time quickly achieve query performance comparable to a fully sorted arrays or indexes as adaptive merging manages to achieve.
**** Data Structures
- Each logical column in their model is represented by multiple pairs of arrays containing row identifiers and key values. Two data structures organise these pairs of arrays. All tuples are initially assigned to arbitrary unsorted "initial partitions". As a side-effect of query processing, tuples are then moved into "final partitions" represented merged ranges of key values. Once all data is consumed from an initial partition P, then P is dropped. These are like adaptive merging's run and merge partitions, except that they do not necessarily sort the key values, plus the whole architecture has been redesigned for column-stores.
- Each init partition uses a table of contents to keep track of the key ranges it contains and a single master table of contents - the adaptive index itself - keeps track of the content of both the init and final partitions. Both tables are updated as key value ranges are moved from the initial to the final partitions.
- The data structure and physicial organisation used is that of partial sideways cracking. The final partitions respect the archietcture of the article definining partial sideways cracking, such that they can reuse the techniques of this for complex queries, updates and partial materialisation.
**** Select Operator
- As with database cracking, the hybrids here result in a new select operator each (??)
- The input for a select operator is a single column and a filtering predicate while the output is a set of rowIDs.
- In the case of adaptive indexing, all techniques collect all qualifying tuples for a given predicate in a contiguous area. Thus, they can return a view of the result of a select operator over the adaptive index. (View is something database related)
**** Complex queries
- The qualifying rowIDs can be used by subsequent operators in a query plan.
- Their hybrid maintains the same interface and architecture as with sideways cracking, that enable complex queries.
- The main idea is that the query plan use a new of operators that include steps for adaptive tuple reconstruction to avoid random access caused by the reorganisation steps of adaptive indexing.
- As such, these guys just focus on the select operator.
*** Strategies for organising partitions
- Hybrid algos follow the same general strategy as the implementation of adaptive merging, while trying to mimic cracking-like physical re-organisation steps that result in crack columns in the sideways cracking form.
- The first query of each column splits the column's data into initial partitions that each fit in memory. As queries are then processed, qualifying key values are then moved into the final partitions.
- Tables of contents and the adaptive index are updated to reflect which key ranges have been moved into the final partitions so that subsequent queries know which parts of the requested key ranges to retrieve from final partitions and which from int partitions.
- The hybrid algos differ from original adaptive indexing and from each other in how and when they incrementally sort the tuples in the initial and final partitions. They consider three different ways of physically reordering tuples in a parition:
  1) Sorting
  2) Cracking
  3) Radix clustering (?)
**** Sorting
- Fully sorting initial partitions upon creation comes at a high up-front investment (as this is what adaptive merging does..)
- Fully sorting a final partition is typically less expensive, as the amount of data to be sorted at a single time is limited to the query's result.
- The gain of exploiting sort is fast convergence to the optimal state.
- Adaptive merging uses sorting for both the initial and final partitions.
**** Cracking
- Database cracking comes at a minimal investment, as it only performs at most two partitioning steps in order to isolate the requested key range for a given query.
- Subsequent queries exploit past partitioning steps and need to crack progressibly smaller and smaller pieces to refine the ordering.
- Contrary to sorting, this has very low overhead but very slow convergence.
- In the hybrids, if the query's result is contained in the final partitions, then the overhead is small, as only a single or at most two partitions need to be cracked. If the query requires new values from the initial partitions, then potentially every initial partition needs to be cracked, causing overhead.
- For the hybrids, they have redesigned the cracking algorithms such that the first query in a hybrid that cracks the initial partitions is able to perform the cracking and the creation of th einitial partitions in a single monolithic step as opposed to a copy step first and then a crack step.
**** Radix Clustering
- A light-weight single-pass "best effort" (they do not require equally sized clusters) radix-like range-clustering into 2^k clusters as follows:
 + Given the smallest (v_) and largest (v^) value in the partition, they assume an order-preserving injective function f : [v_, v^] -> N_0, with f(v_) = 0, that assigns each value v \in [v_, v^] a numeric code c \in N_0. They use f(v) = v - v_ for v \in [v_, v^] \subseteq Z and f(v) = A(v) - A(v_) for characters v_, v, v^ \in {'A', ..., 'Z', 'a', ..., 'z'}, where A() yields the character's ASCII code.
- With this they perform a single radix-sort step on c = f(v) using the k most significant bits of c^ = f(v^), i.e. the result cluster of value v is determined by those k bits of its code c that the positions of the k most significant bits of the largest code c^. Investing in an extra initial scan over the partition to count the actual bucket sizes, they are able to create in one single pass a continuous range-clustered partition. With a table of content that keeps track of the cluster boundaries, the result is identical to that of a sequence of cracking operations that cover all 2^k - 1 cluster boundaries.
- TODO: I don't remember radix-sort ..
*** Hybrid Algos
- They apply sorting, cracking and radix clustering on both the initial and final partitions and combine them arbitrarily, yielding 9 hybrid algorithms.
- Hybrid: + Sort Sort
  + Sort Radix + Sort Crack
  + Radix Sort + Radix Radix
  + Radix Crack + Crack Sort
  + Crack Radix + Crack Crack
- As HSS is simply the original adaptive merging and they want to avoid the high up-front investment, all the HS* won't be considered.
- An example of HCC:
  + With the first query, the data is loaded into four initial (unsorted) partitions that hold disjoint row ID ranges. Then each initial partition is cracked on the given key range "d - i" in this case, and the qualifying key values are moved into the final partition, that also forms the result of the first query. (Note here that they crack the initial partitions, creating A BUNCH OF small partitions where some of them contain the values they need in the final partition. The partitions they need only, are then merged into the final partition, leaving all the other, also potentially tiny, partitions, as they are, still unsorted.) + The second query's key range "f - m" partly overlaps with the first query's key range, so the previous final partition holding keys from "d - i" is cracked on "f" to isolate the overlapping range "f - i". Then all initial partitions are cracked on "m" to isolate keys from "j - m" (note that they naturally do not have any keys prior to "i" anymore and "j" follows "i"), and move these into a new value partition. The result of the second query is then available and can be merged into the final partitions.
- An example of HRR:
  + Data is loaded into four initial partitions and radix-clustered on the k = 1 most significant bits of some codes given in the append. The clusters that hold the requested key range boundaries (as we have split the four initial clusters in "subclusters"), are cracked on the range query (which is d and i again), creating a whole bunch of small clusters. The important ones are then merged together, to form the final partition. The newly formed partitions are yet again radix clustered into the k=1 most significant bits of some other table given in the appendix, allowing for future queries. For future queries we do the same. Crack when needed and merged, then radix clustered.
- Compared to HCC and HRR, variations HCR and HRC swap the treatment of final partitions during the merge step, i.e. HCR uses cracking for initial partitions but radix cluster for final partitions while HRC use radix-cluster for initial partitions but cracking for final partitions.
- HCS and HRS invest in sorting each final partition on creation, just as the original adaptive merging (HSS).
- An adaptive indexing technique is characterised by how lightweight adaption it achieves (the cost of the first few queries representing a workload change) and by how fast it, in terms of time and queries needed, converges to the performance of a perfect index.
- Several params may affect performance, i.e. query selectivity, data skew, updates, concurrenct queries, disk based processing etc.
- A hybrid variation using sorting will have the edge in an environment with concurrent queries or with limited memory, as less queries will require physical re-organisation. + Such a hybrid will suffer in an environment with excess updates.
*** Experiments
- They implemented everything using a MonetDB (whatever this is).
- They use simple queries only using basic select and ranges.
- Original cracking is a bit slower than scanning the entire thing and sorting it. It quickly becomes faster though. It also has a very "smooth" adaption, but never gets to the level of fully indexing and adaptive merging (although it looks like it might get there)
- Original merging is much slower than cracking to begin with, only beating the fully indexing method. It also takes quite a while to adapt, as the first 6 queries are slower than the scan. After this however, it starts beating anything except fully indexing, so it converges quickly, once it gets over the starting issues.
- Each hybrid version occupies a different spot between adaptive merging and cracking, which can be seen as the two extremes of adaptive indexing.
**** The cracking ones
- The HCC variation improves heavily over plain cracking by reducing the cost of the first query to a level of scan, so the overhead of adaptive indexing disappears. Contrast to original cracking, the hybrid operates on batches of the column at  time an it uses the new cracking that creates and cracks initial partitions in one go. The HCC maintains the smooth behaviour of cracking, but it does not achieve the fast convergence of adaptive merging.
- The HCS uses sorting to speed up adaption and this can be seen to work, as it mimics the adaptive merging method and actually achieves best case quickly. Compared to adaptive merging, it has significantly lower init cost. HCS is however slightly slower than HCC for the first query and it is still slower than scan for the first 10 queries while HCC is never slower than a scan. This is likely due to the investment in sorting the final partitions which is also why it adapts to best case quicker.
- HCR achieves nice balance between HCC and HCS, as the hybrid invests in clustering as opposed to sorting the final partitions, so it beats scan quicker, but seems to converge at something not best case due to this. So although it does not match adaptive merging in best case speeds, its performance is several orders of magnitude faster than original cracking and also that of scan. It also converges very quickly. All of this at zero cost, since no overhead is imposed fort he first part of the workload sequence. Clustering partitions is more eager than cracking but also more lazy than fully sorting, so we achieve the balance.
**** The radix ones
- All use radix clustering for the initial partitions as opposed to cracking
- So all hybrid variations become more eager during the first query, but this also means that they are all slightly more expensive that the HC* hybrids.
**** Selectivity
- I assume they mean how large the ranges are?
- With smaller selectivity, it takes more queries to reach optimal performance, Likely because the chances of requiring merging actions are higher with smaller selectivity since less data is merged with any given query.
- With smaller selectivity, the difference in convergence is less significant between the lazy HCC and the more eager HCS. The lazy algorithms maintain their lightweight init advantage. OG cracking and Adaptive merging show similar behavior, cracking resembles HCC and adaptive merging resembles HCS.
- One could artifically force more active merging steps, which would increase (actually decrease the time before convergence I suppose) convergence speed.
**** Summary
- The HCS gets very close to the ideal hybrid.
- HCR has the lightweight footprint of a scan query and it can still reach optimal performance quickly.
- HCS is only two times slower than a scan for the first query, but reaches optimal performance very quickly.
- HCR provides a smooth adaption, never being slower than a scan.
- HCS and HCR are both valid choices, HCR is to be used for the most lightweight adaption while HCS is to be used when we want the fastest adaption.
*** Conclusion
- Their initial experiments yielded an insight about adaptive merging. Data moves out of init partitions and into final partitions. Thus, the more times an initial partition has already been searched, the less likely it is to be searched again. A final partition is searched by every query, either because it contains the result of else because results are moved into it. So effort spoent on refining an initial partition is less likely to pay off than the effort invested on final partitions.
- Their hybrid algos exploit this distinction and apply different refinement strategies to initial versus final partitions.
- This enables an entirely new approach to physicial database design.
- The init database contains no indexes (or only indexes for primary keys and uniqueness constraints). Query processing initially relies on large scans, yet all scans contribute to index optimisation in the key ranges of actual interest.
** Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging
- All mainsteam LSM-tree (Log Structured Merge Tree) based key-value stores in the literature and in industry suboptimally trade between the I/O cost of updates on one hand and the I/O cost of lookups and storage space on the other. This is because they perform equally expensive merge operations across all levels of the LSM-tree to bound the number of runs that a lookup has to probe and to remove obsolete entries to reclaim storage space.
- With state-of-the-art designs, however, merge operations from all levels of the LSM-tree but the largest (i.e. most merge operations) reduce point lookup cost, long range lookup cost, and storage space by a negligible amount, while significantly adding to the amortised cost of updates.
- To address this problem (Not sure which problem. Perhaps that is it slow?), they propose a new design that removes merge operations from all levels of LSM-trees but the largest.
- Lazy leveling improves the worst-case complexity of update cost while maintaining the same bounds on point lookup cost, long range lookup cost and storage space.
- They introduce fluid LSM-trees; a generalisation of the entire LSM-tree design space that can be parameterised to assume any existing design.
- Relative to lazy leveling, Fluid LSM-tree can optimise more for updates by merging less at the largest level, or it can optimise more for short range lookups by merging more at all other levels.
*** Introduction
- Key-value store is a database that maps from search keys to their corresponding value.
- To persist key-value entries in storage, most key-value stores today use LSM-trees.
**** LSM-Trees
- LSM-trees buffer inserted/updated entries in main memory and flushes the buffer as a sorted run (run?) to secondary storage every time that it fills up. LSM-Trees main gimmick is that they use how secondary storage works, i.e. that sequentially accessing and writing is much much faster than randomly.
- LSM-trees later sort-merges these runs (runs?) to bound the number of runs that a lookup has to prove and to remove obsolete entries, i.e. for which there exists a more recent entry with the same key.
- LSM-trees organise runs (runs? - Perhaps a "run" is just the bundled set of operations) into levels of of exponentially increasing capacities whereby larger levels contain older runs.
- As entries are updated, a point lookup finds the most recent version of an entry by probing the levels from smallest to largest, thus encountering the newest first and terminating once this is found. I suspest it also checks the buffer first?
- A range lookup is more tricky and has to access the relevant key range from across all runs at all levels and to eliminate obsolete entries from the result set.
- To speed up lookups on invidual runs, modern designs use fence pointers for every run. So for every run there is a set of fence pointers that contain the first key of every block of the run, which allows lookups to access a particular key within a run with just one I/O.
- Furthermore, for every run there exists a bloom filter, which allows point lookups to skip runs that do not contain the target key.
- The problem is: The frequency of merge operations in LSM-trees control an intrinsic trade-off between the I/O cost of updates on one hand and the I/O cost of lookups and storage space amplifiction (caused by the presence of obsolete entries) on the other. Existing designs trade suboptimally among these metrics.
- By analysing the design space of state-of-the-art LSM trees, they pinpoint the problem to the fact that the worst-case update cost, point lookup cost, range lookup cost, and space-amplification derive differently from across different levels.
  + Updates derive their I/O cost equally from merge operations across all levels + Point lookups I/O mostly target the largest level (due to some construction of the bloom filters)
  + The majority of I/Os caused by long range lookups target the largest level due to the exponentially increasing levels + Short range lookups derive their I/O cost equally from across all levels
  + The highest fraction of obsolete entries are in the largest level, since the newest will be updates.
- So the worst case point lookup cost, long range lookup cost and space-amplification derive mostly from the largest level, merge operations at all levels of the LSM-tree but the largest (i.e. most merge operations) hardly improve on these metrics while significantly adding to the amortised cost of updates. THis leads to suboptimal trade-offs.
**** Solution
- They expand the LSM-tree design space with Lazy Leveling, a new design that removes merging from all but the largest level of LSM-trees (so it fits better with the other things)
- They introduce Fluid LSM-trees as a generalisation of the LSM-tree, that enables transitioning (what??) fluidly across the whole LSM-tree design space. It controls the frequency of merge operations separately for the largest level and for all other levels. This means it can optimise more for updates by merging less at the largest level or it can optimise more for short range lookups by merging more at all other levels.
- Everything is put together in Dostoevsky: A space-time optimised evolvable scalable key-value store. Dostoevsky analytically finds the tuning of the Fluid LSM-tree that maximises the throughput for a particular application workload and hardware subject to a user constraint on space-amplification. This is done by pruning the search space to quickly find the best tuning and physically adapting to it during runtime, so it can either adapt to faster lookups or faster updates, depending on the application. Wait. By sort-merging the runs, do they mean that they run merge-sort on the runs????????????
- They show that state-of-the-art LSM-trees all perform equally expensive merge operations across all levels, yet merge operations at all but the largest level improve point lookup cost, long range lookup cost and space-amplification by a negligible amount while adding significantly to the amortised cost of updates. So it's very dumb to run these merge operations on any level but the largest, as it has little to no benefit and is expensive. This is why lazy leveling solves this problem, while improving the cost on updates, it has no effect on the others.
*** More LSM tree background
- LSM trees are optimised for writing
- It initially buffers all updates, insertions and deletes in main memory. When this buffer fills up, LSM-trees flushes the buffer to secondary storage as a sorted run (ehm..). LSM-trees merge-sorts runs in order to bound the number of runs that a lookup has to access in secondary storage and to remove obsolete entries to reclaim space. So the deeper the level, the bigger it is as well, since runs are merged together and then increase in level.
- Runs are conceptually organised into L levels of exponentially increasing sizes. Level 0 is the buffer in main memory and runs belonging to all others levels are in secondary storage.
- The balance between I/O cost of merging and cost of lookups and space-amplification can be tuned two ways. First there is the size ratio T between the capacities of adjacent levels. T controls the number of levels of the LSM-tree and thus the overall number of times that an entry gets merged across all levels. The second is the merge policy, which controls the number of times an entry gets merged within a level. This can either be tiering or levelling.  + In tiering, runs are merged within a level only when the level reaches capacity (so then things gets merged and moved up a level).
  + In levelling, runs are merged within a level whenever a new run comes in.
- In both cases, the merge is triggered by the buffer flushing and causing level 1 to reach capacity. So with tiering, all runs at level 1 gets merged and placed at level2. With levelling the merge also includes the preesting run at level 2. (As a new run will come in to level 2)
*** General background stuff
- As updates are performed out-of-place (meaning that you do not change the original entry, it's immutable, you simply make a new entry), multiple versions of an entry with the same key may exist across levels. LSM-trees help with this, as if an entry is inserted into the buffer and the buffer already contains an entry with the same key, the new is considered the correct and it replaces the old. Also, when two runs that contain an entry with the same key are merged, only the entry from the newer run is kept, as it is more recent. To ensure consistency with the last point, runs can only be merged with the next older or the next younger run.
- A point lookup just traverses from smaller to largest level
- Range lookups has to find the most recent versions of all entries within the target key range. This is done by merge sorting the relevant key range across all runs at all levels. While merge-sorting it identifies entries with the same key across different runs and discards older versions.
- Deletes are supported by a ine-bit flag. When a deleted entry is merged, it is removed.
- Fence Pointers: All Major LSM-tree based key-value stores index the first key of every block of every run in main memory These are called fence pointers and they take up a lot of space. O(N/B) (N = total number of entires; B = number of entries that fit into a storage block). This allows a lookup to find the relevant key-range at every run with one I/O.
- Bloom filters are used to speed up point lookups. Each run has a bloom filter in main memory. Bloom filters can return a false positive and as such it may be run several times with different results. A point lookup probes a bloom filter before accessing the corresponding run in the storage. If the filter returns a true positive, the lookup accesses the run with one I/O using the fence pointers, finds the matching entry and terminates. If the filter returns a negative, the lookup skips the run thereby saving one I/O. If it returns a false positive, meaning the lookup wastes one I/O by accessing the run, not finding a matching entry and having to continue searching for the target key in the next run.
*** Design Space and Problem Analysis
- Entry updates (inserts) are paid for through the subsequent merge operations that the updated entry participates in. We assume that the now obsolete entry does not get removed until the updated version reaches the last level. With tiering, this then costs O(1) per level across all O(L) levels. Having a block size of B, means we get O(L/B) I/Os. With leveling it's slightly more tricky, O((L*T)/B), since an entry gets merged on average T/2 or O(T) times per level (since we now merge even if we are not big enough to reach the next level, so we might merge more often per level), given, with O(L) levels, O(L*T), which is then divided by the blocksize.
- When analysing point lookups, they assume they do not find the entry, so everything has to be looked through. They also assume their bloom filters returns all false positives. In this case, with leveling, it issues O(L) wastes I/Os and with tiering it's O(T*L) I/Os. (I suspect this is because leveling always keeps the runs at each level merged, so there is less to look through when scanning all the different levels, while tiering may have multiple runs at a level. In practice this won't happen, due to the bloom filters not having a large chance of giving a false positive.
- Analysis of space amplification boils down to worst case being if all levels L-1 .. L-(L-1) contains updates to entries in level L.
- When the size ratio between levels (T) is 2, leveling and tiering becomes the same and when T increases, lookup cost and space-amplification decrease/increase and update cost increases/decreases. So the trade-off space is partitioned: leveling has strictly better lookup costs and space-amplification and strictly worse update cost than tiering.
*** Lazy leveling
- Merge policy that eliminates merging at all but the largest level of LSM-tree.
- Relative to leveling, lazy leveling improves the cost complexity of updates, maintains the same complexity for point lookups, long-range lookups and space-amp and it provides competitive complexity for the rest.
- Lazy leveling at its core is a hybrid of leveling and tiering: it applies leveling at the largest level and tiering at all other levels. As a result, the number of runs at the largest level is 1 and the number of runs at all other levels is at most T-1 since the merge operation takes place when the T'th run arrives.
- They change the bloom filters slightly, to accomodate having more runs at the earlier levels.
*** Fluid LSM-Tree
- It controls the frequency of merge operations separately for the largest level and for all other levels.
- There are at most Z runs at the largest level and at most K runs at each of the smaller levels.
- So, K = T-1 and Z = 1 gives lazy leveling, K = 1 and Z = 1 gives leveling and K = T-1 and Z = T-1 gives tiering.
- So it can transition back and forth.
*** Dostoevsky
- Models and optimises throughput with respect to update cost W, zero-result point lookup cost R, non-zero result point lookup cost V and range lookup cost Q. Proportions in the workload of these are monitored and their costs are weighted using coefficients w, r, v and q. This weighted cost is multiplied by the time to read a block from storage and taking the inverse to obtain the weighted worst-case throughput r.
- Dostoevsky maximises the equation which is a result of what just mentioned, by iterating over different values of the parameters T, K and Z. It prunes the search space using two insights.
  1) LSM-tree has at most L_{max} levels, each of which has a corresponding size ratio T, so there are only ... meaningful values of T to test.
  2) The lookup cost R, Q and V increase monotonically with respect to K and Z, whereas update cost W decreases monotonically with respect to them. As a result, the optimisation equation is convex with respect to K and Z and they can divide and conquer their value spaces and converge to the optimum with logarithmic runtime compelxity.
*** Conclusion
- Dostoevsky dominates everything.