Fifth article I want to prepare

This commit is contained in:
Alexander Munch-Hansen 2020-01-12 17:02:00 +01:00
parent 695895ca34
commit 388b2e8cc6

View File

@ -827,3 +827,38 @@
+ The second query's key range "f - m" partly overlaps with the first query's key range, so the previous final partition holding keys from "d - i" is cracked on "f" to isolate the overlapping range "f - i". Then all initial partitions are cracked on "m" to isolate keys from "j - m" (note that they naturally do not have any keys prior to "i" anymore and "j" follows "i"), and move these into a new value partition. The result of the second query is then available and can be merged into the final partitions. + The second query's key range "f - m" partly overlaps with the first query's key range, so the previous final partition holding keys from "d - i" is cracked on "f" to isolate the overlapping range "f - i". Then all initial partitions are cracked on "m" to isolate keys from "j - m" (note that they naturally do not have any keys prior to "i" anymore and "j" follows "i"), and move these into a new value partition. The result of the second query is then available and can be merged into the final partitions.
- An example of HRR: - An example of HRR:
+ Data is loaded into four initial partitions and radix-clustered on the k = 1 most significant bits of some codes given in the append. The clusters that hold the requested key range boundaries (as we have split the four initial clusters in "subclusters"), are cracked on the range query (which is d and i again), creating a whole bunch of small clusters. The important ones are then merged together, to form the final partition. The newly formed partitions are yet again radix clustered into the k=1 most significant bits of some other table given in the appendix, allowing for future queries. For future queries we do the same. Crack when needed and merged, then radix clustered. + Data is loaded into four initial partitions and radix-clustered on the k = 1 most significant bits of some codes given in the append. The clusters that hold the requested key range boundaries (as we have split the four initial clusters in "subclusters"), are cracked on the range query (which is d and i again), creating a whole bunch of small clusters. The important ones are then merged together, to form the final partition. The newly formed partitions are yet again radix clustered into the k=1 most significant bits of some other table given in the appendix, allowing for future queries. For future queries we do the same. Crack when needed and merged, then radix clustered.
- Compared to HCC and HRR, variations HCR and HRC swap the treatment of final partitions during the merge step, i.e. HCR uses cracking for initial partitions but radix cluster for final partitions while HRC use radix-cluster for initial partitions but cracking for final partitions.
- HCS and HRS invest in sorting each final partition on creation, just as the original adaptive merging (HSS).
- An adaptive indexing technique is characterised by how lightweight adaption it achieves (the cost of the first few queries representing a workload change) and by how fast it, in terms of time and queries needed, converges to the performance of a perfect index.
- Several params may affect performance, i.e. query selectivity, data skew, updates, concurrenct queries, disk based processing etc.
- A hybrid variation using sorting will have the edge in an environment with concurrent queries or with limited memory, as less queries will require physical re-organisation.
+ Such a hybrid will suffer in an environment with excess updates.
*** Experiments
- They implemented everything using a MonetDB (whatever this is).
- They use simple queries only using basic select and ranges.
- Original cracking is a bit slower than scanning the entire thing and sorting it. It quickly becomes faster though. It also has a very "smooth" adaption, but never gets to the level of fully indexing and adaptive merging (although it looks like it might get there)
- Original merging is much slower than cracking to begin with, only beating the fully indexing method. It also takes quite a while to adapt, as the first 6 queries are slower than the scan. After this however, it starts beating anything except fully indexing, so it converges quickly, once it gets over the starting issues.
- Each hybrid version occupies a different spot between adaptive merging and cracking, which can be seen as the two extremes of adaptive indexing.
**** The cracking ones
- The HCC variation improves heavily over plain cracking by reducing the cost of the first query to a level of scan, so the overhead of adaptive indexing disappears. Contrast to original cracking, the hybrid operates on batches of the column at time an it uses the new cracking that creates and cracks initial partitions in one go. The HCC maintains the smooth behaviour of cracking, but it does not achieve the fast convergence of adaptive merging.
- The HCS uses sorting to speed up adaption and this can be seen to work, as it mimics the adaptive merging method and actually achieves best case quickly. Compared to adaptive merging, it has significantly lower init cost. HCS is however slightly slower than HCC for the first query and it is still slower than scan for the first 10 queries while HCC is never slower than a scan. This is likely due to the investment in sorting the final partitions which is also why it adapts to best case quicker.
- HCR achieves nice balance between HCC and HCS, as the hybrid invests in clustering as opposed to sorting the final partitions, so it beats scan quicker, but seems to converge at something not best case due to this. So although it does not match adaptive merging in best case speeds, its performance is several orders of magnitude faster than original cracking and also that of scan. It also converges very quickly. All of this at zero cost, since no overhead is imposed fort he first part of the workload sequence. Clustering partitions is more eager than cracking but also more lazy than fully sorting, so we achieve the balance.
**** The radix ones
- All use radix clustering for the initial partitions as opposed to cracking
- So all hybrid variations become more eager during the first query, but this also means that they are all slightly more expensive that the HC* hybrids.
**** Selectivity
- I assume they mean how large the ranges are?
- With smaller selectivity, it takes more queries to reach optimal performance, Likely because the chances of requiring merging actions are higher with smaller selectivity since less data is merged with any given query.
- With smaller selectivity, the difference in convergence is less significant between the lazy HCC and the more eager HCS. The lazy algorithms maintain their lightweight init advantage. OG cracking and Adaptive merging show similar behavior, cracking resembles HCC and adaptive merging resembles HCS.
- One could artifically force more active merging steps, which would increase (actually decrease the time before convergence I suppose) convergence speed.
**** Summary
- The HCS gets very close to the ideal hybrid.
- HCR has the lightweight footprint of a scan query and it can still reach optimal performance quickly.
- HCS is only two times slower than a scan for the first query, but reaches optimal performance very quickly.
- HCR provides a smooth adaption, never being slower than a scan.
- HCS and HCR are both valid choices, HCR is to be used for the most lightweight adaption while HCS is to be used when we want the fastest adaption.
*** Conclusion
- Their initial experiments yielded an insight about adaptive merging. Data moves out of init partitions and into final partitions. Thus, the more times an initial partition has already been searched, the less likely it is to be searched again. A final partition is searched by every query, either because it contains the result of else because results are moved into it. So effort spoent on refining an initial partition is less likely to pay off than the effort invested on final partitions.
- Their hybrid algos exploit this distinction and apply different refinement strategies to initial versus final partitions.
- This enables an entirely new approach to physicial database design.
- The init database contains no indexes (or only indexes for primary keys and uniqueness constraints). Query processing initially relies on large scans, yet all scans contribute to index optimisation in the key ranges of actual interest.