Monday, March 31, 2008

Xtensive.Indexing architecture: pros and cons

Cons:

1. Too many slices.

This is the most annoying con. Imagine, how index may work if it is a kind of "diff tower" having up to 100 floors (slices)? Lets think about operations we should run fast, and how many slices we'll really have.

Let's count the slices first:

A new slice is added to TransactionalIndex on successful commit of any transaction. So initially we have ~ 1 slice per each transaction. But there are background slice merge processes, which merge the slices with a subsequent ones, if their sizes are nearly equal. Basically, they prefer to merge such slices first - just because their "merge score" is higher. So if there are no slices with close sizes, they can try to merge "badly matching" ones, if their merge score isn't too small.

Ideally, we merge 2 equally sized slices and get 1 slice with double size. Ok, let's assume:
- The biggest slice we have (the lowest one) is 256Gb.
- The smallest slice size is 16Kb (i.e. ~ the size of a single compressed index page - we'll talk about this further).
We can count the approximate amount of slices we expect to have by doubling the size of the smallest slice until we'll reach the biggest slice size. 16K*2^24 = 256G. Just 24 slices at all... Let's assume we're dealing with 4 times more slices further (i.e. 100 slices) - just to switch from "ideal" to "average or the worst" case.

As I've mentioned, these slices can be located either on disk, or simply stored in memory. Let's count how many slices on disk we'll have - this is important, since slices in memory are quite fast, but slices on disk are quite slow from the point of random access (~ 100-200 pages \ sec. - the estimate is based on average random seek time of existing hard disks). Let's assume we store any slice with size <256mb style="font-weight: bold;">just 10 on-disk slices (2^10 * 256Mb = 256Gb).

Now - the most frequent operations:

a) Index lookup (seek). Much like fetching a value by its key from the dictionary. This is probably the most frequent operation, since almost any join of relatively big table implies one such lookup per each joined row. Logarithmic complexity per each slice, so total complexity is ~ CountOfSlices*log(AverageKeyCountPerSlice). The worst thing here is that low-level slices can't be efficiently cached in memory, so there should be ~ one disk seek per each slice. Too much.

How this can be optimized? Ok, let's imagine there is a way to answer on the following question immediately (i.e. w\o any IO and within constant time): does index slice S contain any change related to key K at all? If this is possible, we can immediately run this test for each slice (~ CountOfSlices compexity, so almost constant time), and choose the topmost one, for which the result is positive - this is the only slice we should do a lookup at (i.e. it contains the latest difference made do the key K). So a total complexity in this case is ~ CountOfSilces + log(AverageKeyCountPerSlice), and ~ just one IO operation (reading slice page from the disk) - that's much better.

How we can achieve this? Using Bloom filter. If you're interested in details, please read an article about them, e.g. from Wikipedia. Some facts:
- Bloom filters may give false positive responses (i.e. it may report there is key K in slice S, although there actually no such key K). The probability of these responses depends on filter size and on the amount of hash functions.
- An example: 1% FPR probabilty requires allocation of ~ 9.6 bits per each element in set and 6 hash functions. So to store Bloom filter for index with 1G keys in memory, we should provide just 1.2Gb RAM! (1G keys means ~ 0.1...1Tb of data).
- If a 1% FPR probability seems too high, each time we add about 4.8 bits per element we decrease it by ten times.
- 32-bit hash isn't enough here - e.g. in case with Bloom filter for 1G key set we need to address ~ 10G bits, which is more then 32-bit integer range. That's why we developed Xtensive.Core.Hashing - a part of our framework allowing to generate arbitrary number of 64-bit hashes for generally any data type with performance comparable to regular GetHashCode() calls.

b) Index range lookup. Much like "Select Customer objects where Name>="A" and Name<="B"". Again, nearly the same problem: we should fetch the same range from any index slice.

It seems we can't reduce the amount of seeks here. Ok, we can get up to CountOfSilcesOnDisk times more seeks on such operations... But let's think if it's painful or not. Most frequent range fetch occurs when e.g. all the items from a collection bound to some object are required. And here we really should check all the slices for this range of items. But:
- Usually such operations aren't so random as lookups. I.e. there is a high probability of the same subsequent operation per each user session. And the second one has high chances to be executed w\o any IO.
- Frequently we don't simply fetch a range, but join some other table(s) to it. Obviously the amount of seeks in this case will be mainly dependent on amount of joined rows, which is usually not well comparable to CountOfSilcesOnDisk (~ 10-15). I.e. in this case we loose almost nothing.
- Finally, drilling down to any leaf page always needs just one IO operation (i.e. disk seek) in our case, since we're dealing with compressed pages with 100% fill rate. In case with regular index this requires up to 3 disk seeks, if the index is large. This will partially compensate the multiplier.

Conclusion: it seems there are no any vital obstacles here.

2. Nothing else.

Really, I can't identify any other con here.


Pros:

1. Relative simplicity

As you've seen, such indexing layer architecture is quite well layered - i.e. we can split it to almost independent set of layers "wrapping" (i.e. transforming the functionality of) the other ones. This means we can write, test and debug any layer independently on others. E.g. we used SortedIndex for testing above layers when BPlusTree was not working. Each layer is relatively simple and well-defined, which brings nearly the same benefits.

Later you'll understand we're dealing with concurrency using quite simple approach - there are almost no locks (and thus no locking engine) at all. A simple approach for handling concurrency issues is quite important, since such issues are probably the most complex for debugging.

Probably it seems it's rather difficult to produce a current view of index based on a sequence of their slices, but actually it isn't. We just have a set of wrappers adding a new slice over previous ones; all actual "merge" work is handled by their range enumerators.

Transaction rollback is quite simple: you should just forget about the slice related to such transaction.

Commit is simple as well: a slice we're going to commit is "promoted" to the top of the slice tower (i.e. to the most recent committed transaction) passing over each slice starting from its own creation point and to the topmost one. On each promotion step conflicts are detected (complexity is ~ equal to join operation). Since most of such slices are still in-memory slices, this requires no any IO. When it reaches the top of slice tower, we just write a mark in transaction log - something like "New slice #Xxx is added" (since all operations related to construction of this slice are already logged).

2. Unmatched degree of parallelism

Let's state some facts first:
- Almost all the slices are always read-only. The only exception are slices describing the changes made by transactions, which are still running now. But each of these slices are accessed just by threads controlling these transactions. So any data we access concurrently is read-only (i.e. it doesn't need any access synchronization); any data which isn't read-only is always "bound" exclusively to the only transaction that may see it.
- Slice merge process doesn't modify the existing slices as well - it always creates new ones and reports each merge completion to TransactionalIndex. Merging two slices is actually almost as fast as their deserialization.

The only "concurrency point" we have here is TransactionalIndex structure itself - i.e. basically, a sequence of slices forming its current state. Not a tree, since TransactionalIndex itself isn't interested in transactions with "running now" state - it is interested in committed transactions only. Basically, there are just 3 atomic operations on index structure:
- Get all slices: returns a list containing all the slices in index. Executed on start of any transaction, and further (for long running transactions) - after each index structure update interval (~ several seconds). So generally it's a rare operation - i.e. it runs < 1000 times per second.
- Add a new slice on the top. Executed on commit of transaction. Again, quite rare.
- Replace a continuous sequence of slices by a single one. Executed on completion of slice merge operation. Again, quite rare - we've already estimated how many slices and merges there are.

So as you see, this (and the only) "concurrency point" is not a concurrency point at all: all atomic operations on it are blazingly fast (i.e. any of them can be executed > 1M times per second), but expected amount of these operations is ~ 1K / second - that's simply a zero concurrency!

Ok, so let's enumerate what can be parallelized with this "zero concurrency" (i.e. parallelized at least to the number of CPUs \ CPU cores \ PCs):
- Transaction processing: each transaction can be served by an independent thread. Zero concurrency with other such threads! Btw, this doesn't mean there should be just one such thread per transaction - i.e. any data reading query can be parallelized without any problems as well, but currently we aren't thinking about parallel query execution (won't bring a lot, if amount of concurrent transactions is high and there is no locking at all, since server load will be anyway distributed well).
- Slice merging: yes, we can run as many slice merge processes as we we wish.

I.e. everything can be parallelized - as intensively as we wish!

Now compare this to existing databases do:
- Modifying B+tree page requires X-locking it, and S-locking the path to the root (approximately).
- Reading B+tree page requires S-locking the path to the root.
This means that e.g. if some change propagates to the root page, every index reader waits. Nice :( Btw, this isn't the only possible, and definitely not the best approach to making B+tree concurrently accessible, but in even best approaches implying locking (that's what are used everywhere now) behaves similar.

So what we're doing is moving all the possible concurrency to read-only data only (that needs no synchronization) instead of synchronizing the access to read-write data with locks. We think this is quite promising approach from the point of even near future - parallelism becomes more and more precious now. If there will be 16+ cores in a single CPU (this moment seems very close now), everything, which can't be parallelized simply won't be interesting. Would you prefer to use some old fuel allowing to drive your car with e.g. 20Km/h speed limit only, if there is a fuel applying no any speed limits at all? I doubt.

[To be continued...]

No comments:

Post a Comment