diff --git a/notes.org b/notes.org index 0ccff73..67a0bdf 100644 --- a/notes.org +++ b/notes.org @@ -733,6 +733,7 @@ - The costs and benefits of adaptive indexing techniques should be compared in terms of initialisation costs, the overhead imposed upon queries and the rate at which the index converges to a state that is fully-refined for a particular workload component. - These guys seek a hybrid technique between database cracking and adaptive merging, which are two techniques for adaptive indexing. - Adaptive merging has a relatively high init cost, but converges rapidly, while database cracking has a low init cost but converges rather slowly. +- by storing data in columns rather than rows, the database can more precisely access the data it needs to answer a query rather than scanning and discarding unwanted data in rows. Query performance is increased for certain workloads. *** Introduction - Current index selection tools rely on monitoring database requests and their execution plans, then invoking creation or removal of indexes on tables and views. - In the context of dynamic worklods, such tools tend to suffer from the following three weaknesses: @@ -930,3 +931,72 @@ 2) The lookup cost R, Q and V increase monotonically with respect to K and Z, whereas update cost W decreases monotonically with respect to them. As a result, the optimisation equation is convex with respect to K and Z and they can divide and conquer their value spaces and converge to the optimum with logarithmic runtime compelxity. *** Conclusion - Dostoevsky dominates everything. +** Indexing for Interactive Exploration of Big Data Series +- Might actually use this over the merging and cracking paper, although the algorithms look aids. +- In several time critical scenarios analysts need to be able to query the data produced by applications, as soon as they become available, which is not currently possible with the state-of-the-art indexing methods and for very large data series collections. +- These guys produce the first adaptive indexing mechanism (Dostoevsky is from 2018), specifically tailored to solve the problem of indexing and querying very large data series collections. +- The main idea is that instead of building the complete index over the complete data set up-front and then querying only later, they interactively and adaptively build parts of the index, only for parts of the data on which users actually query. +- Instead of waiting for extended periods of time for the index creation, users can immediately start exploring the data. +- Their approach can gracefully handle large data series collections (fuck that sentence), while drastically reducing the data to query delay. +*** Introduction +- These guys are trying to work in a big data scenario. So a bunch of data of sensing, networking, data processing and so on. +- A data series T = (p_1, ..., p_n) is defined as a sequence of points p_i = (v_i, t_i) where each point is associated with a value v_i and a time t_i in which this recording was made. +- Analysts need to examine the sequence of values, rather than the individual points independently (so perhaps this is like a range query?, solved by both Dostoevsky and the merging thing). +- You can't rely on full sequential scans for every single query, this is simply too slow. You have to use indexing. +- Target of indexing techniques is to make query processing efficient enough, such that the analysts can repeatedly run several exploratory queries with quick response times, +- The time it takes to build a data series index is just too long though. +- It takes more than a full day to build a state-of-the-art index (iSAD 2.0) over a data set of 1 billion data series. + + Main costs come from reading the data to be indexed, spilling the indexed data and structures to disk, incurring the computation costs of figuring out where each new data entry belongs to (in the index structure) +- This grows to be impossible for these large data sets +- This just boils down to analysts that shouldn't have to wait several days before being able to query the data. +- These guys propose an adaptive data series indexing to solve the problem, as it minimises the index creation time, allowing users to query the data soon after its generation and several times faster compared to state-of-the-art indexing approaches (which are suuuuuuuuuuuper slow). +- As more queries are posed, the index is refined and subsequent queries enjoy even better execution times. +- A data-series index is a tree based index that is tailored to answer similarity search queries over data series collections, thus requiring very different techniques from column-based methods such as Cracking-thingy, able to simultaneously index multiple arrays. +- No special set-up is required for their method, regarding critical low-level details such as leaf size and tree depth. + + In fact, their method starts off with rather big leaf size and a shallow tree in order to minimise init cost for new data, but then as queries are processed, they adapt and automatically expand hot subtrees and adjust leaf sizes in the hot branches of the index to minimise querying (I assume by hot they mean things accessed) +*** Prelims +- Similarity searches are of the form "Find me the data series in the database which is most similar to X" for data series X. +- Common approach is to perform a dimensionality reduction technique and then use this representation for indexing. +- Their work follows the same high level principles, but it is the first to introduce an adaptive indexing mechanism for data series in order to assist exploratory similarity search in big data sets. +- The intuition behind adaptive indexing is that instead of building database indexes up-front, indexes are built during query processing, adapting to the workload. In particular, algorithms are focused on how to incrementally sort columns in main-memory column-stores (Cracking paper which is from 2011!). Each index refinement step performed during a single query can be seen as a single step of an incremental quick-sort action. As more queries touch a column, this given column reaches closer to a sorted state. The benefit is that adaptive indexing avoids fully sorting columns up front at a high init cost. +- Contrary to working with arrays as in column-store in the case of cracking, these guys work is based on tree-structures, which are suited for data series indexing where they index more than one column at a time. +- One could consider storing a data series as a row in a column-store, so each point being a separate attribute and then use adaptive indexing, however then they lose the locality property as accessing one data series would require accessing several different files. Sideways cracking (which is what the cracking paper uses) has been propsed in order to handle multiple columns in a column-store, but this is a completely different paradigm, indexing a single relational table across one dimension at a time, and essentially relying in replication to align columns. +- Also, contrary to indexing relation data where a global ordering can be imposed (also like the cracking paper where we can have a global ordering within the partitions), i.e. incrementally creating a range index, in these guys case a global ordering is not possible and they are answering nearest neighbor queries. +- Their index introduces several novel techniques for adaptive data series indexing, such as creating only a partial tree structure deep enough to not penalise the first queries with a lot of splits and filling it on demand, as well as adapting leaf sizes on the fly and with varying leaf sizes across the index. +- The Piecewise Aggregate Approximation (PAA) representation is the idea of segmented means. This representation allows for dimensionality reduction in the time domain and is what is used in SAX or Symbolic Aggregate appXimation. This works by partitioning the value space in segments of sizes that follow the normal distribution. Each PAA value can then be represented by a character that corresponds to the segment that it falls into. +- SAX was later extended to iSAX, which was mentioned in the intro. This considers variable cardinality for each character of a SAX representation. An iSAX representation is composed of a set of characters that form a word. Each character in a word is accompanied by a number that denotes its cardinality (which is the number of bits that describe this character). Thus, 00_2 10_2 01_2 and 00_2 11_2 01_2 can only be these two words, but 00_2 1_1 01_2 can be both of them, as the second character 1_1 and represent both 10 and 11. So by starting with cardinality 1 for each character in the root node then by gradually performing splits by increasing the cardinality by one character, one can build a tree index. +- iSAX 2.0 is also based on this property. This also implements fast bulk loading. +- These guys build adaptive indexing using the iSAX (they say state of the art, so it's likely iSAX 2.0??) representations. +*** The Adaptive Data Series Index +- They preent ADS, a design which introduces the concept of adaptively and incrementally loading data series in the index. Then they present ADS+ which introduces the concept of adaptive splits and adaptive leaf sizes. They then present PADS+, an aggressive variation of ADS+, which is tailored for even better performance in skewed workloads. +**** ADS +- Shifts the index construction bottleneck of the leaf nodes index to query time. +- During the index creation phase, ADS creates a tree which contains only the iSAX representation for each data series; the actual data series remain in the raw files and are only loaded in an adaptive way if a relevant query arrives. +- iSAX 2.0 a priori load all raw data series in the index at leaves of the tree, in order to reduce random I/O during query processing. +- The index creation phase takes place before queries can be processed, but is kept very lightweight. +- ADS builds a minimal tree during this phase, a tree which does not contain any dat aseries. The tree contains only iSAX representations. The process starts with a full scan on the raw file to create an iSAX representation for each data series entry. For data series they also record its offset in the raw data file so future queries can easily retrieve the raw value. To minimise the random memory access and random I/O, they use a set of buffers in main memory to temporarily hold data to be added with the index. Essentially they just index all raw data.. +- The actual data series are only needed during query time, to give a correct answer. During the index creation time, the iSAX representations are sufficient to build the index tree. +- ADS avoids dealing with the raw data series, other than the single scan on the raw file to create the iSAX representations. Furthermore, it does not move the raw data series through the tree and lastly it does not place the raw data series in the leaf nodes. The data stays in the raw file. +**** Querying and Refining ADS +- In addition to answering a query q, the query process refines the index during the processing steps of q. These extra index refinement steps do not take place after the query is answered; they develop completely on the fly and are necessary in order to answer q. +- When new queries arrive, which do not follow the patterns in previous request, ADS needs to enrich the index with more information. +- When a query arrives, it is converted to an iSAX representation. Then the index tree is traversed searching for a leaf with an iSAX rep similar to that of the query. Whether such a leaf exists already or not, depends not only on the data, but also on past queries. +- To enrich a partial leaf (contains only iSAX rep but not any data series, then the missing data series are fetched from the raw file), ADS fetches the partial leaf from disk and reads all the positions in the raw file of the data series that belong in this leaf. A partial leaf DOES hold its position in the raw file. +*** The ADS+ Index (Adaptive Leaf Size) +- ADS reduces index creation time by avoiding the insertion of raw data series in the index (so it doesnt fetch data for the leafs until required to) until a relevant query arrives. +- There is opportunity for significant further optimisation; By studying the operations that get executed during adaptive index building and refinement, they found that the time during split operations in the index tree is a major cost component. +- Splits are expensive, as they cause data transfer to and from disk to update node data. Main param that affects split cost is the leaf size, i.e. a tree with a big leaf size has a smaller number of nodes overall, causing less splits. So a big leaf size reduces index creation time. Big leaves also penalise query costs and vice versa, when reaching a big leaf during a search they have to scan more data series than with a small leaf (as you scan them all I suppose). +- Main intuition is that one can quickly build the index tree using a large leaf size, saving time from very expensive split operations and rely on queries that are then going to force splits in order to reduce the leaf sizes in hot areas of the index. +- ADS+ uses two different leaf sizes. One for index creation which is large and a small one for small query-times. +- When a query that needs to search a partial leaf arrives, ADS+ refines its index structure by recursively splitting the target leaf until the target sub-leaf becomes smaller or equal to the query-time leaf size. +- When it performes these splits, it only materialises the leaf they are actually interested in. They only need the iSAX representations to perform these, so they don't have to materialise the other leaves. +*** Partial ADS+ +- ADs+ reduces the indexing cost by omitting the raw data from the index creation process, ADS and ADS+ still need to spend time for creating the basic index structure, so users still have to wait for the creation of all of those iSAX representations. +- The main intuition is to gradually build parts of the index tree, and only for small subsets of the data as queries arrive. +- Does not build an index tree at all. There is only a root node with a set of buffers that contain the iSAX representations. + + So PADS+ still creates the iSAX representations, it just doesn't build the tree. +- It now then creates the tree as it receives queries, by looking at the buffers to find a matching iSAX representation +*** Updates +- Handling inserts simply requires appending the new data series in the raw file, while only its iSAX rep and its position in the raw file is pushed through the index tree. +*** Conclusion +- Their new adaptive indexing approach copes significantly better with the ever growing data series collections and can answer several thousands of queries in the time that state of the art indexing approaches are still in the indexing phase.