Let's cover two more pros:
3. More efficient page placement
4. Faster random lookups for large indexes
- Can be fragmented, i.e. its pages aren't always written in sequential order. So sequential page read isn't always sequential - i.e. there can be random seeks on HDD with ease.
- Has less then 100% fill ratio: usually it varies from 75 to 85%.
- Has fixed record size, and thus - pages with fixed size. Usually - 4 to 16 Kb. This makes its compression problematic; moreover, usage of such small pages imply there will be >3 layers in the B+tree. Later I'll explain why this isn't good.
Our index slice:
- Isn't fragmented. They're always written in sequential order - from the leftmost page to the rightmost page. This means sequential index reading is always fast; although usually we should read several slices.
- Always has 100% fill ratio: when B+trees are serialized to disk or merged, a special page builder is used. Basically, we can build a B+tree-like (I'll explain later, why "like") structure of arbitrary size in stream with almost zero RAM usage by an ordered enumerable providing its content (acts as the data source - such ordered enumerables can be built e.g. for slice merge operations). "Like" - because all the pages of this structure except the ones on the right "edge" of the tree will be 100% filled; pages on the right "edge" will be 0...100% filled (unpredictable) - that's actually a violation of B+tree definition. Although from the point of its performance it is even better, since the numer of such pages is 3 in our worst case.
- May have arbitrary page size. We're targeting on ~ 16..32Kb compressed page size, which in turn mean uncompressed page will be ~ 64...128Kb at least, and will have ~ 1000...2000 nodes.
Let's discuss why we're going to use bigger pages.
- Random read from HDD is quite slow. 100...200 read operations \ sec.
- There should be almost no difference between random reads of 4Kb blocks and 64Kb blocks, if the total amount of blocks is huge (i.e. big index): cache doesn't help; any random read means seek (~ 5...10ms) + read. HDD rotation speed is 7500...15000 RPM now, so one rotation takes at least 4ms. So to get some random byte on track read after positioning the head to it, we need ~ 2ms in average (rotate ~ a half of track). But how many bytes are on track? ~ 500Kb. So e.g. 32Kb of data is just 6% of track, although in average we rotate 50% of track. Ok, I want to say it doesn't matter much if we read 4Kb blocks or 32Kb blocks at random - the difference in [reads count per second] is just 10%.
- In case with sequential read the size of page doesn't matter much at all.
- If the index is large, most of page fetch requests will be requests to almost random pages. Since on-disk indexes are read-only, subsequent access to the same pages will almost always hit the cache.
- Page decompression & deserialization time doesn't matter much, since we anyway read ~ 100-200 pages per second in case with mostly random access. Ok, we fetch 1000+ pages per second (taking into account there are some sequential readers) - it anyway seems 1ms is more then enough to decompress and deserialize 16Kb of data.
- Compression seems simply "must be", since normally the data in a single page is quite similar - all the keys there are close to each other by their values. So we expect ~ 4x compression factor in average. Google reports they have compression factor of ~ 10 for their indexes.
- We shouldn't care about write behavior much, since as I've mentioned, any write operation implies writing of big file (i.e. index slice) sequentially.
- Having bigger pages (16...32Kb) seems cheap (just ~ 10% of performance on read)
- With such pages we can reduce the amount of seeks per random index lookup to just one. How? If we maintain 2K nodes in a single page, 3-level B+tree may store up to 8G items. Ok, 40 bytes per item => 320Gb index size. Enough to fill a single HDD. But we can easily keep (1+2K) pages (root + all level 2 pages) in memory cache. So in this case we really need just one page read at max to access any index item.
Imagine the same for regular index with 4Kb pages: there are just 100 items on each page; we can cache just 3 above levels of such index in RAM; but we need 5-level tree to store ~ the same 10G items; thus accessing a random item requires 2 page fetches from HDD, or approximately twice more time.
[To be continued...]
Friday, April 04, 2008