diff --git a/notes.org b/notes.org index 4d51fd3..0ccff73 100644 --- a/notes.org +++ b/notes.org @@ -862,3 +862,71 @@ - Their hybrid algos exploit this distinction and apply different refinement strategies to initial versus final partitions. - This enables an entirely new approach to physicial database design. - The init database contains no indexes (or only indexes for primary keys and uniqueness constraints). Query processing initially relies on large scans, yet all scans contribute to index optimisation in the key ranges of actual interest. +** Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging +- All mainsteam LSM-tree (Log Structured Merge Tree) based key-value stores in the literature and in industry suboptimally trade between the I/O cost of updates on one hand and the I/O cost of lookups and storage space on the other. This is because they perform equally expensive merge operations across all levels of the LSM-tree to bound the number of runs that a lookup has to probe and to remove obsolete entries to reclaim storage space. +- With state-of-the-art designs, however, merge operations from all levels of the LSM-tree but the largest (i.e. most merge operations) reduce point lookup cost, long range lookup cost, and storage space by a negligible amount, while significantly adding to the amortised cost of updates. +- To address this problem (Not sure which problem. Perhaps that is it slow?), they propose a new design that removes merge operations from all levels of LSM-trees but the largest. +- Lazy leveling improves the worst-case complexity of update cost while maintaining the same bounds on point lookup cost, long range lookup cost and storage space. +- They introduce fluid LSM-trees; a generalisation of the entire LSM-tree design space that can be parameterised to assume any existing design. +- Relative to lazy leveling, Fluid LSM-tree can optimise more for updates by merging less at the largest level, or it can optimise more for short range lookups by merging more at all other levels. +*** Introduction +- Key-value store is a database that maps from search keys to their corresponding value. +- To persist key-value entries in storage, most key-value stores today use LSM-trees. +**** LSM-Trees +- LSM-trees buffer inserted/updated entries in main memory and flushes the buffer as a sorted run (run?) to secondary storage every time that it fills up. LSM-Trees main gimmick is that they use how secondary storage works, i.e. that sequentially accessing and writing is much much faster than randomly. +- LSM-trees later sort-merges these runs (runs?) to bound the number of runs that a lookup has to prove and to remove obsolete entries, i.e. for which there exists a more recent entry with the same key. +- LSM-trees organise runs (runs? - Perhaps a "run" is just the bundled set of operations) into levels of of exponentially increasing capacities whereby larger levels contain older runs. +- As entries are updated, a point lookup finds the most recent version of an entry by probing the levels from smallest to largest, thus encountering the newest first and terminating once this is found. I suspest it also checks the buffer first? +- A range lookup is more tricky and has to access the relevant key range from across all runs at all levels and to eliminate obsolete entries from the result set. +- To speed up lookups on invidual runs, modern designs use fence pointers for every run. So for every run there is a set of fence pointers that contain the first key of every block of the run, which allows lookups to access a particular key within a run with just one I/O. +- Furthermore, for every run there exists a bloom filter, which allows point lookups to skip runs that do not contain the target key. +- The problem is: The frequency of merge operations in LSM-trees control an intrinsic trade-off between the I/O cost of updates on one hand and the I/O cost of lookups and storage space amplifiction (caused by the presence of obsolete entries) on the other. Existing designs trade suboptimally among these metrics. +- By analysing the design space of state-of-the-art LSM trees, they pinpoint the problem to the fact that the worst-case update cost, point lookup cost, range lookup cost, and space-amplification derive differently from across different levels. + + Updates derive their I/O cost equally from merge operations across all levels + + Point lookups I/O mostly target the largest level (due to some construction of the bloom filters) + + The majority of I/Os caused by long range lookups target the largest level due to the exponentially increasing levels + + Short range lookups derive their I/O cost equally from across all levels + + The highest fraction of obsolete entries are in the largest level, since the newest will be updates. +- So the worst case point lookup cost, long range lookup cost and space-amplification derive mostly from the largest level, merge operations at all levels of the LSM-tree but the largest (i.e. most merge operations) hardly improve on these metrics while significantly adding to the amortised cost of updates. THis leads to suboptimal trade-offs. +**** Solution +- They expand the LSM-tree design space with Lazy Leveling, a new design that removes merging from all but the largest level of LSM-trees (so it fits better with the other things) +- They introduce Fluid LSM-trees as a generalisation of the LSM-tree, that enables transitioning (what??) fluidly across the whole LSM-tree design space. It controls the frequency of merge operations separately for the largest level and for all other levels. This means it can optimise more for updates by merging less at the largest level or it can optimise more for short range lookups by merging more at all other levels. +- Everything is put together in Dostoevsky: A space-time optimised evolvable scalable key-value store. Dostoevsky analytically finds the tuning of the Fluid LSM-tree that maximises the throughput for a particular application workload and hardware subject to a user constraint on space-amplification. This is done by pruning the search space to quickly find the best tuning and physically adapting to it during runtime, so it can either adapt to faster lookups or faster updates, depending on the application. Wait. By sort-merging the runs, do they mean that they run merge-sort on the runs???????????? +- They show that state-of-the-art LSM-trees all perform equally expensive merge operations across all levels, yet merge operations at all but the largest level improve point lookup cost, long range lookup cost and space-amplification by a negligible amount while adding significantly to the amortised cost of updates. So it's very dumb to run these merge operations on any level but the largest, as it has little to no benefit and is expensive. This is why lazy leveling solves this problem, while improving the cost on updates, it has no effect on the others. +*** More LSM tree background +- LSM trees are optimised for writing +- It initially buffers all updates, insertions and deletes in main memory. When this buffer fills up, LSM-trees flushes the buffer to secondary storage as a sorted run (ehm..). LSM-trees merge-sorts runs in order to bound the number of runs that a lookup has to access in secondary storage and to remove obsolete entries to reclaim space. So the deeper the level, the bigger it is as well, since runs are merged together and then increase in level. +- Runs are conceptually organised into L levels of exponentially increasing sizes. Level 0 is the buffer in main memory and runs belonging to all others levels are in secondary storage. +- The balance between I/O cost of merging and cost of lookups and space-amplification can be tuned two ways. First there is the size ratio T between the capacities of adjacent levels. T controls the number of levels of the LSM-tree and thus the overall number of times that an entry gets merged across all levels. The second is the merge policy, which controls the number of times an entry gets merged within a level. This can either be tiering or levelling. + + In tiering, runs are merged within a level only when the level reaches capacity (so then things gets merged and moved up a level). + + In levelling, runs are merged within a level whenever a new run comes in. +- In both cases, the merge is triggered by the buffer flushing and causing level 1 to reach capacity. So with tiering, all runs at level 1 gets merged and placed at level2. With levelling the merge also includes the preesting run at level 2. (As a new run will come in to level 2) +*** General background stuff +- As updates are performed out-of-place (meaning that you do not change the original entry, it's immutable, you simply make a new entry), multiple versions of an entry with the same key may exist across levels. LSM-trees help with this, as if an entry is inserted into the buffer and the buffer already contains an entry with the same key, the new is considered the correct and it replaces the old. Also, when two runs that contain an entry with the same key are merged, only the entry from the newer run is kept, as it is more recent. To ensure consistency with the last point, runs can only be merged with the next older or the next younger run. +- A point lookup just traverses from smaller to largest level +- Range lookups has to find the most recent versions of all entries within the target key range. This is done by merge sorting the relevant key range across all runs at all levels. While merge-sorting it identifies entries with the same key across different runs and discards older versions. +- Deletes are supported by a ine-bit flag. When a deleted entry is merged, it is removed. +- Fence Pointers: All Major LSM-tree based key-value stores index the first key of every block of every run in main memory These are called fence pointers and they take up a lot of space. O(N/B) (N = total number of entires; B = number of entries that fit into a storage block). This allows a lookup to find the relevant key-range at every run with one I/O. +- Bloom filters are used to speed up point lookups. Each run has a bloom filter in main memory. Bloom filters can return a false positive and as such it may be run several times with different results. A point lookup probes a bloom filter before accessing the corresponding run in the storage. If the filter returns a true positive, the lookup accesses the run with one I/O using the fence pointers, finds the matching entry and terminates. If the filter returns a negative, the lookup skips the run thereby saving one I/O. If it returns a false positive, meaning the lookup wastes one I/O by accessing the run, not finding a matching entry and having to continue searching for the target key in the next run. +*** Design Space and Problem Analysis +- Entry updates (inserts) are paid for through the subsequent merge operations that the updated entry participates in. We assume that the now obsolete entry does not get removed until the updated version reaches the last level. With tiering, this then costs O(1) per level across all O(L) levels. Having a block size of B, means we get O(L/B) I/Os. With leveling it's slightly more tricky, O((L*T)/B), since an entry gets merged on average T/2 or O(T) times per level (since we now merge even if we are not big enough to reach the next level, so we might merge more often per level), given, with O(L) levels, O(L*T), which is then divided by the blocksize. +- When analysing point lookups, they assume they do not find the entry, so everything has to be looked through. They also assume their bloom filters returns all false positives. In this case, with leveling, it issues O(L) wastes I/Os and with tiering it's O(T*L) I/Os. (I suspect this is because leveling always keeps the runs at each level merged, so there is less to look through when scanning all the different levels, while tiering may have multiple runs at a level. In practice this won't happen, due to the bloom filters not having a large chance of giving a false positive. +- Analysis of space amplification boils down to worst case being if all levels L-1 .. L-(L-1) contains updates to entries in level L. +- When the size ratio between levels (T) is 2, leveling and tiering becomes the same and when T increases, lookup cost and space-amplification decrease/increase and update cost increases/decreases. So the trade-off space is partitioned: leveling has strictly better lookup costs and space-amplification and strictly worse update cost than tiering. +*** Lazy leveling +- Merge policy that eliminates merging at all but the largest level of LSM-tree. +- Relative to leveling, lazy leveling improves the cost complexity of updates, maintains the same complexity for point lookups, long-range lookups and space-amp and it provides competitive complexity for the rest. +- Lazy leveling at its core is a hybrid of leveling and tiering: it applies leveling at the largest level and tiering at all other levels. As a result, the number of runs at the largest level is 1 and the number of runs at all other levels is at most T-1 since the merge operation takes place when the T'th run arrives. +- They change the bloom filters slightly, to accomodate having more runs at the earlier levels. +*** Fluid LSM-Tree +- It controls the frequency of merge operations separately for the largest level and for all other levels. +- There are at most Z runs at the largest level and at most K runs at each of the smaller levels. +- So, K = T-1 and Z = 1 gives lazy leveling, K = 1 and Z = 1 gives leveling and K = T-1 and Z = T-1 gives tiering. +- So it can transition back and forth. +*** Dostoevsky +- Models and optimises throughput with respect to update cost W, zero-result point lookup cost R, non-zero result point lookup cost V and range lookup cost Q. Proportions in the workload of these are monitored and their costs are weighted using coefficients w, r, v and q. This weighted cost is multiplied by the time to read a block from storage and taking the inverse to obtain the weighted worst-case throughput r. +- Dostoevsky maximises the equation which is a result of what just mentioned, by iterating over different values of the parameters T, K and Z. It prunes the search space using two insights. + 1) LSM-tree has at most L_{max} levels, each of which has a corresponding size ratio T, so there are only ... meaningful values of T to test. + 2) The lookup cost R, Q and V increase monotonically with respect to K and Z, whereas update cost W decreases monotonically with respect to them. As a result, the optimisation equation is convex with respect to K and Z and they can divide and conquer their value spaces and converge to the optimum with logarithmic runtime compelxity. +*** Conclusion +- Dostoevsky dominates everything.