Done with four, working on fifth

This commit is contained in:
Alexander Munch-Hansen 2020-01-12 16:22:50 +01:00
parent ee51566008
commit 695895ca34

134
notes.org
View File

@ -694,4 +694,136 @@
- Their proposed framework does not only determine the contribution of each feature in the input, but also provides insights into the decision-making process of DNN models.
- They further interact with the neuron of the target category at the output layer of the DNN, enforcing the interpretation result to be class-discriminative.
*** Introduction
- DNN models may learn biases from the training data.
- DNN models may learn biases from the training data, in which case intepretability can be used to debug (LIME does this)
- Existing interpretation methods focus on two types of interpretations, model-level and instance-level.
1) Model-level focus on finding a good prototype in the input domain that is interpretable and can represent the abstract concept learned by a neuron or a group of neurons of the DNN.
2) Instance-level targets to answer what features of an input it to active the DNN neurons to make a specific prediction (LIME)
- Instance-level usually follow the idea of local interpretation.
+ Let x be an input for aNN, the prediction of the DNN is denoted as a function f(x). Through monitoring the prediction response provided by the function f around the neighborhood of a given point x, the features in x which cause a larger change of f will be treated as more relevant to the final prediction.
- These methods tend to not use the intermediate leayers.
- It has been shown that some input trigger weird stuff, thus tricking the DNN to make unexpected output, which will ruin the whole input/output blackbox philosophy, so looking at intermediate layers here will help.
- Feature inverstion (Feature inversion aims to map the feature generated at any layer of a DNN back to a plausible input. Each layer in a DNN maps an input feature to an output feature and in the process ignores the input content that does not seem relevant to the classification task.) has been studied for visualising and understanding intermediate feature representations of DNN.
- the inversion results indicate that as the information propagates from the input layer to the output layer, the DNN classifier gradually compresses the input information while discarding information irrelevant to the prediction task.
- Inversion results from a specific layer also reveals the amount of information contained in that layer.
- These guys propose an instance-level DNN interpretation model by performing guided image feature inversion. Leveraging the observations found in their preliminiary experiments that the higher layer of DNN do capture the high-level content of the input as well as its spatial arrangement. They present guided feature reconstructions to explicitly preserve the object localisation information in a "mask", so as to provide insights of what information is actually employed by the DNN for the prediction.
- They establish connections between input and the target object by fine-tuning the interpretation result obtained from guided feature inversions.
- They show that the intermediate activation values at higher convolutional layers of DNN are able to behave as a stronger regulariser.
*** Interpretation of DNN-Based Prediction
- The main idea of their proposed framework is to identify the image regions that simultaeneously encode the location information of the target object and match the feature representation of the original image.
- Let c be the target object (in classification, the target output, the one we want to hit), that they want to interpret, and x_i corresponds to the i'th feature, then the interpretation for x is encoded by a score vector s \in R^d where each score element s_i \in [0,1] represents how relevant that feature is for explaining f_c(x). The input vector x_a corresponds to the pixels of an image and the score s will be a saliency map (or attribution map) where the pixels with higher scores represents higher relevance for the classification task.
- It has been studied that the deep image representation extracted from a layer in CNN could be inverted to a reconstructured image which captures the property and invariance encoded in that layer.
- Feature inversion can reveal how much information is preserved in the feature at a specific layer.
- These inversions reveal that at the early layers much of the information from the original image is still preserved, but at the last layers only the rough shapes are, so CNNs gradually filter out unrelated information for the classification task, so we are interested in looking at the early layers, to provide explanation for classification results.
- Given a pre-trained L layers CNN model, let the intermediate feature representations at layer l \in {1,2,...,l} be denoted as a function f^l(x_a) of the input image x_a. We then need to compute the approximated inversion f^{-1} of the representation f^{l_0}(x_a).
- So we know early on which things are somewhat focused on, but we won't know until the later layers, with confidence, which part of the input information is ultimately preserved for final prediction. They use regularisation to ensure this.
- they can compute the contributing factors in the input as the pixel-wise difference between x_a and x^*, the optimal inversion result which is obtained from gradient descent, getting a saliency map s. This is not feasible however, as the saliency map is noisy.
+ Note: A saliency map is an image that shows each pixel's unique quality. So it should simplify and/or change the representation of an image into something that is more meaningful and easier to analyse.
- To tackle the problem of the noisy saliency map, they instead propose the guided feature inversion method, where the expected inversion image representation is reformulated as the weighted sum of the original image x_a and another noise background image p.
- They generate the mask m to illustrate which objects are of importance. It is largely generated from the early layers and as a result it may capture multiple things if multiple things are in the foreground.
+ This is fixed by strongly activating the softmax probability value as the last hidden layer L of the CNN for a given target c and reduce the activation for other classes. They can then render out irrelevant information with respect to target class c, including image background and other classes of foreground objects.
- This might still genrate undesirable artifacts without regularisations imposed to the optimisation process. They propose to exquisitely design the regularisstion term of the mask m to overcome the artifacts problem. They propose to impose a stronger natural image prior by utilising the intermediate activation features of CNN. (Essentially, the issue is that the mask m might highlights nonsense and random artifacts. If they use the early layers, which are very responsive to the different target objects, they have a higher chance of highlighting proper things.
*** Their algo
- So they first generate the mask. This mask is used to blur out anything not relevant to what we wish to predict (the mask is a heatmap, so anything not presented by this heatmap is blurred out). This blurred-out image is then run through the predictor CNN by using the weights obtained in the mask step with the class discriminative loss.
- They use class discriminative interpreation (the softmax thing) such that when they find two things such as the elephant and the zebra, they can run the prediction network twice, as they know there are two things.
- To make it more robust and avoid the mentioned artifacts, the activations of intermediate layers can be included in the feature inversion method. Done by defining the mask m as the weighted sum of channels at a specific layer l_1.
* Indexing Methods
** Merging What's Cracked, Cracking What's Merged: Adaptive Indexing in Main-Memory Column-Stores
- Adaptive indexing is characterised by the partial creation and redefinement of the index as side effects of query execution.
- Dynamic or shitfing workloads may benefit from preliminary index structures focused on the columns and specific key ranges actually queried - Without incurring the cost of full index construction.
- The costs and benefits of adaptive indexing techniques should be compared in terms of initialisation costs, the overhead imposed upon queries and the rate at which the index converges to a state that is fully-refined for a particular workload component.
- These guys seek a hybrid technique between database cracking and adaptive merging, which are two techniques for adaptive indexing.
- Adaptive merging has a relatively high init cost, but converges rapidly, while database cracking has a low init cost but converges rather slowly.
*** Introduction
- Current index selection tools rely on monitoring database requests and their execution plans, then invoking creation or removal of indexes on tables and views.
- In the context of dynamic worklods, such tools tend to suffer from the following three weaknesses:
1) The interval between monitoring and index creation can exceed the duration of a specific request pattern so there is no benefit from the change
2) Even if it does not exceed this duration, there is no index support during the interval, as data access during the monitoring interval neither benefits from nor aids index creation efforts and the eventual index creation imposes an additional load that interferes with the query execution
3) Traditional indexes on tables cover all rows equally, even if some rows are needed often and some never.
- The goal is to enable incremental, efficient adaptive indexing, so index creation and optimisation as side effects on query execution, with the implicit benefit that only tables, columns and key ranges truly queried are optimised.
- Use two measures to characterise how quickly and efficiently a technique adapts index structures to a dynamic workload:
1) The init cost incurred by the first query
2) The number of queries that must be processed before a random query benefits from the index structure without incurring any overhead.
- The first query captures the worst-case costs and benefits of adaptive merging and it is why this is focused on.
- The more often a key range is queried, the more its representation is optimised. Columns that are never queried are not indexed and key ranges that are not queried are not optimised.
- Overhead for incremental index creation is minimal and disappears when a range has been fully-optimised.
- Draw the graph showing where adaptive merging is expensive to begin with but converges quickly, database cracking is slow and where the two hybrids are. (The good and the bad)
- This paper provides the first detailed comparison between these two techniques (merging and cracking)
- Most previous approaches to runtime index tuning are non-adaptive, so index tuning and query processing operations are independent of each other. They monitor the running workload and then decide which indexes to create or drop based on the observations. Both having an impact on the database workload. Once a decision is made, it affects ALL KEY RANGES in an index. Since some data items are more heavily queried than others, the concept of partial indexes arose.
- Soft-indexes can be seen as adaptive indexing. They continually collect statistics for recommended indexes and then periodically and automatically solve the index selection problem. Like adaptive, they are picked based on query processing, but unlike adaptive, they are not incremental, so each recommended index is created and optimised to completion (so fully?).
- Adaptive indexing and approaches that monitor queries and then build indexes (The ones mentioned as previous approaches) are mutually compatible. Policies established by the monitor-and-tune techniques could provide information about the benefit and importance of different indexes and then adaptive indexing could create and refine recommended index structures while minimising the additional workload.
**** Database Cracking
- Combines features of automatic index selection and partial indexes.
- Reorganises data within the query operators, integrating the re-organisation effort into query execution. When a new column is queried by a predicate for the first time, a new cracker index is init. As the column is used in the predicates for further queries, the cracker index is refined by range partionining until sequentially searching a partition is faster than binary searching in the AVL tree guiding a search to the appropriate partition.
- Keys in a cracker index are partitioned into disjoint key ranges, but left unsorted within each partition. Each range query analyses the cracker index, scans key ranges that fall entirely within the query range and uses the two end points of the query range to further partition the appropriate two key ranges. So, each partitioning step will create two new sub-partitions. A range is partitioned into 3 if both end points fall into the same key range. This will happen in the first partitioning step.
- So essentially, for each query, the data will be partitioned into subsets of ranges, without any of them ever being sorted. If you keep track of which subsets are for which keys, you can easily answer queries, as you can skip checking most of them.
**** Adaptive Merging
- Database cracking functions like an incremental quicksort with each query forcing one or two partitioning steps.
- Adaptive Merging functions as an incremental merge sort, where one merge step is applied to all key ranges in a query's result.
- The first query to use a given column in a predicate produces sorted runs and each subsequent query upon that same column applies to at most one additional merge step.
- Each merge step only affects those key ranges that are relevant to actual queries, leaving records in all other key ranges in their initial places.
- This merge logic takes place as a side effect of queries.
- So the first query triggers the creation of some sorted runs, thus loading the data into equally sized partitions and sorting each in memory. It then retrieves the relevant values (via index lookup because the runs are sorted) and merges them out of the runs and into a final partition. Similarly results happen from a second query, where the results from a query are merged out of the runs and into the final partitions. Subsequent queries continue to merge results from the runs until the final partition has been fully optimised for the current workload. (The final partition being the one containing the relevant values in sorted order)
*** Hybrid Algos
- Database cracking converges so slowly, since at most two new partition boundaries are generated per query, meaning that the technique will require thousands of queries to converge on an index for the focus range.
- The first query is very expensive for adaptive merging though, as it has to pay for the initial runs.
- The number of queries required to have a key range fully optimised is due to:
1) Merging with a high fan-in rather than partitioning with a low fanout of two or three
2) Merging a query's entire key range rather than only dividing the two partitions with the query's boundary keys.
- The different in the cost of the first query is just due to the cost of sorting the initial runs.
- So the goal is to merge the best qualities of adaptive merging and database cracking.
- They strive to maintain the lightweight footprint of cracking which imposes a minimal overhead on queries, and at the same time quickly achieve query performance comparable to a fully sorted arrays or indexes as adaptive merging manages to achieve.
**** Data Structures
- Each logical column in their model is represented by multiple pairs of arrays containing row identifiers and key values. Two data structures organise these pairs of arrays. All tuples are initially assigned to arbitrary unsorted "initial partitions". As a side-effect of query processing, tuples are then moved into "final partitions" represented merged ranges of key values. Once all data is consumed from an initial partition P, then P is dropped. These are like adaptive merging's run and merge partitions, except that they do not necessarily sort the key values, plus the whole architecture has been redesigned for column-stores.
- Each init partition uses a table of contents to keep track of the key ranges it contains and a single master table of contents - the adaptive index itself - keeps track of the content of both the init and final partitions. Both tables are updated as key value ranges are moved from the initial to the final partitions.
- The data structure and physicial organisation used is that of partial sideways cracking. The final partitions respect the archietcture of the article definining partial sideways cracking, such that they can reuse the techniques of this for complex queries, updates and partial materialisation.
**** Select Operator
- As with database cracking, the hybrids here result in a new select operator each (??)
- The input for a select operator is a single column and a filtering predicate while the output is a set of rowIDs.
- In the case of adaptive indexing, all techniques collect all qualifying tuples for a given predicate in a contiguous area. Thus, they can return a view of the result of a select operator over the adaptive index. (View is something database related)
**** Complex queries
- The qualifying rowIDs can be used by subsequent operators in a query plan.
- Their hybrid maintains the same interface and architecture as with sideways cracking, that enable complex queries.
- The main idea is that the query plan use a new of operators that include steps for adaptive tuple reconstruction to avoid random access caused by the reorganisation steps of adaptive indexing.
- As such, these guys just focus on the select operator.
*** Strategies for organising partitions
- Hybrid algos follow the same general strategy as the implementation of adaptive merging, while trying to mimic cracking-like physical re-organisation steps that result in crack columns in the sideways cracking form.
- The first query of each column splits the column's data into initial partitions that each fit in memory. As queries are then processed, qualifying key values are then moved into the final partitions.
- Tables of contents and the adaptive index are updated to reflect which key ranges have been moved into the final partitions so that subsequent queries know which parts of the requested key ranges to retrieve from final partitions and which from int partitions.
- The hybrid algos differ from original adaptive indexing and from each other in how and when they incrementally sort the tuples in the initial and final partitions. They consider three different ways of physically reordering tuples in a parition:
1) Sorting
2) Cracking
3) Radix clustering (?)
**** Sorting
- Fully sorting initial partitions upon creation comes at a high up-front investment (as this is what adaptive merging does..)
- Fully sorting a final partition is typically less expensive, as the amount of data to be sorted at a single time is limited to the query's result.
- The gain of exploiting sort is fast convergence to the optimal state.
- Adaptive merging uses sorting for both the initial and final partitions.
**** Cracking
- Database cracking comes at a minimal investment, as it only performs at most two partitioning steps in order to isolate the requested key range for a given query.
- Subsequent queries exploit past partitioning steps and need to crack progressibly smaller and smaller pieces to refine the ordering.
- Contrary to sorting, this has very low overhead but very slow convergence.
- In the hybrids, if the query's result is contained in the final partitions, then the overhead is small, as only a single or at most two partitions need to be cracked. If the query requires new values from the initial partitions, then potentially every initial partition needs to be cracked, causing overhead.
- For the hybrids, they have redesigned the cracking algorithms such that the first query in a hybrid that cracks the initial partitions is able to perform the cracking and the creation of th einitial partitions in a single monolithic step as opposed to a copy step first and then a crack step.
**** Radix Clustering
- A light-weight single-pass "best effort" (they do not require equally sized clusters) radix-like range-clustering into 2^k clusters as follows:
+ Given the smallest (v_) and largest (v^) value in the partition, they assume an order-preserving injective function f : [v_, v^] -> N_0, with f(v_) = 0, that assigns each value v \in [v_, v^] a numeric code c \in N_0. They use f(v) = v - v_ for v \in [v_, v^] \subseteq Z and f(v) = A(v) - A(v_) for characters v_, v, v^ \in {'A', ..., 'Z', 'a', ..., 'z'}, where A() yields the character's ASCII code.
- With this they perform a single radix-sort step on c = f(v) using the k most significant bits of c^ = f(v^), i.e. the result cluster of value v is determined by those k bits of its code c that the positions of the k most significant bits of the largest code c^. Investing in an extra initial scan over the partition to count the actual bucket sizes, they are able to create in one single pass a continuous range-clustered partition. With a table of content that keeps track of the cluster boundaries, the result is identical to that of a sequence of cracking operations that cover all 2^k - 1 cluster boundaries.
- TODO: I don't remember radix-sort ..
*** Hybrid Algos
- They apply sorting, cracking and radix clustering on both the initial and final partitions and combine them arbitrarily, yielding 9 hybrid algorithms.
- Hybrid:
+ Sort Sort
+ Sort Radix
+ Sort Crack
+ Radix Sort
+ Radix Radix
+ Radix Crack
+ Crack Sort
+ Crack Radix
+ Crack Crack
- As HSS is simply the original adaptive merging and they want to avoid the high up-front investment, all the HS* won't be considered.
- An example of HCC:
+ With the first query, the data is loaded into four initial (unsorted) partitions that hold disjoint row ID ranges. Then each initial partition is cracked on the given key range "d - i" in this case, and the qualifying key values are moved into the final partition, that also forms the result of the first query. (Note here that they crack the initial partitions, creating A BUNCH OF small partitions where some of them contain the values they need in the final partition. The partitions they need only, are then merged into the final partition, leaving all the other, also potentially tiny, partitions, as they are, still unsorted.)
+ The second query's key range "f - m" partly overlaps with the first query's key range, so the previous final partition holding keys from "d - i" is cracked on "f" to isolate the overlapping range "f - i". Then all initial partitions are cracked on "m" to isolate keys from "j - m" (note that they naturally do not have any keys prior to "i" anymore and "j" follows "i"), and move these into a new value partition. The result of the second query is then available and can be merged into the final partitions.
- An example of HRR:
+ Data is loaded into four initial partitions and radix-clustered on the k = 1 most significant bits of some codes given in the append. The clusters that hold the requested key range boundaries (as we have split the four initial clusters in "subclusters"), are cracked on the range query (which is d and i again), creating a whole bunch of small clusters. The important ones are then merged together, to form the final partition. The newly formed partitions are yet again radix clustered into the k=1 most significant bits of some other table given in the appendix, allowing for future queries. For future queries we do the same. Crack when needed and merged, then radix clustered.