Monday, March 31, 2008

Xtensive.Indexing architecture: pros and cons


1. Too many slices.

This is the most annoying con. Imagine, how index may work if it is a kind of "diff tower" having up to 100 floors (slices)? Lets think about operations we should run fast, and how many slices we'll really have.

Let's count the slices first:

A new slice is added to TransactionalIndex on successful commit of any transaction. So initially we have ~ 1 slice per each transaction. But there are background slice merge processes, which merge the slices with a subsequent ones, if their sizes are nearly equal. Basically, they prefer to merge such slices first - just because their "merge score" is higher. So if there are no slices with close sizes, they can try to merge "badly matching" ones, if their merge score isn't too small.

Ideally, we merge 2 equally sized slices and get 1 slice with double size. Ok, let's assume:
- The biggest slice we have (the lowest one) is 256Gb.
- The smallest slice size is 16Kb (i.e. ~ the size of a single compressed index page - we'll talk about this further).
We can count the approximate amount of slices we expect to have by doubling the size of the smallest slice until we'll reach the biggest slice size. 16K*2^24 = 256G. Just 24 slices at all... Let's assume we're dealing with 4 times more slices further (i.e. 100 slices) - just to switch from "ideal" to "average or the worst" case.

As I've mentioned, these slices can be located either on disk, or simply stored in memory. Let's count how many slices on disk we'll have - this is important, since slices in memory are quite fast, but slices on disk are quite slow from the point of random access (~ 100-200 pages \ sec. - the estimate is based on average random seek time of existing hard disks). Let's assume we store any slice with size <256mb style="font-weight: bold;">just 10 on-disk slices (2^10 * 256Mb = 256Gb).

Now - the most frequent operations:

a) Index lookup (seek). Much like fetching a value by its key from the dictionary. This is probably the most frequent operation, since almost any join of relatively big table implies one such lookup per each joined row. Logarithmic complexity per each slice, so total complexity is ~ CountOfSlices*log(AverageKeyCountPerSlice). The worst thing here is that low-level slices can't be efficiently cached in memory, so there should be ~ one disk seek per each slice. Too much.

How this can be optimized? Ok, let's imagine there is a way to answer on the following question immediately (i.e. w\o any IO and within constant time): does index slice S contain any change related to key K at all? If this is possible, we can immediately run this test for each slice (~ CountOfSlices compexity, so almost constant time), and choose the topmost one, for which the result is positive - this is the only slice we should do a lookup at (i.e. it contains the latest difference made do the key K). So a total complexity in this case is ~ CountOfSilces + log(AverageKeyCountPerSlice), and ~ just one IO operation (reading slice page from the disk) - that's much better.

How we can achieve this? Using Bloom filter. If you're interested in details, please read an article about them, e.g. from Wikipedia. Some facts:
- Bloom filters may give false positive responses (i.e. it may report there is key K in slice S, although there actually no such key K). The probability of these responses depends on filter size and on the amount of hash functions.
- An example: 1% FPR probabilty requires allocation of ~ 9.6 bits per each element in set and 6 hash functions. So to store Bloom filter for index with 1G keys in memory, we should provide just 1.2Gb RAM! (1G keys means ~ 0.1...1Tb of data).
- If a 1% FPR probability seems too high, each time we add about 4.8 bits per element we decrease it by ten times.
- 32-bit hash isn't enough here - e.g. in case with Bloom filter for 1G key set we need to address ~ 10G bits, which is more then 32-bit integer range. That's why we developed Xtensive.Core.Hashing - a part of our framework allowing to generate arbitrary number of 64-bit hashes for generally any data type with performance comparable to regular GetHashCode() calls.

b) Index range lookup. Much like "Select Customer objects where Name>="A" and Name<="B"". Again, nearly the same problem: we should fetch the same range from any index slice.

It seems we can't reduce the amount of seeks here. Ok, we can get up to CountOfSilcesOnDisk times more seeks on such operations... But let's think if it's painful or not. Most frequent range fetch occurs when e.g. all the items from a collection bound to some object are required. And here we really should check all the slices for this range of items. But:
- Usually such operations aren't so random as lookups. I.e. there is a high probability of the same subsequent operation per each user session. And the second one has high chances to be executed w\o any IO.
- Frequently we don't simply fetch a range, but join some other table(s) to it. Obviously the amount of seeks in this case will be mainly dependent on amount of joined rows, which is usually not well comparable to CountOfSilcesOnDisk (~ 10-15). I.e. in this case we loose almost nothing.
- Finally, drilling down to any leaf page always needs just one IO operation (i.e. disk seek) in our case, since we're dealing with compressed pages with 100% fill rate. In case with regular index this requires up to 3 disk seeks, if the index is large. This will partially compensate the multiplier.

Conclusion: it seems there are no any vital obstacles here.

2. Nothing else.

Really, I can't identify any other con here.


1. Relative simplicity

As you've seen, such indexing layer architecture is quite well layered - i.e. we can split it to almost independent set of layers "wrapping" (i.e. transforming the functionality of) the other ones. This means we can write, test and debug any layer independently on others. E.g. we used SortedIndex for testing above layers when BPlusTree was not working. Each layer is relatively simple and well-defined, which brings nearly the same benefits.

Later you'll understand we're dealing with concurrency using quite simple approach - there are almost no locks (and thus no locking engine) at all. A simple approach for handling concurrency issues is quite important, since such issues are probably the most complex for debugging.

Probably it seems it's rather difficult to produce a current view of index based on a sequence of their slices, but actually it isn't. We just have a set of wrappers adding a new slice over previous ones; all actual "merge" work is handled by their range enumerators.

Transaction rollback is quite simple: you should just forget about the slice related to such transaction.

Commit is simple as well: a slice we're going to commit is "promoted" to the top of the slice tower (i.e. to the most recent committed transaction) passing over each slice starting from its own creation point and to the topmost one. On each promotion step conflicts are detected (complexity is ~ equal to join operation). Since most of such slices are still in-memory slices, this requires no any IO. When it reaches the top of slice tower, we just write a mark in transaction log - something like "New slice #Xxx is added" (since all operations related to construction of this slice are already logged).

2. Unmatched degree of parallelism

Let's state some facts first:
- Almost all the slices are always read-only. The only exception are slices describing the changes made by transactions, which are still running now. But each of these slices are accessed just by threads controlling these transactions. So any data we access concurrently is read-only (i.e. it doesn't need any access synchronization); any data which isn't read-only is always "bound" exclusively to the only transaction that may see it.
- Slice merge process doesn't modify the existing slices as well - it always creates new ones and reports each merge completion to TransactionalIndex. Merging two slices is actually almost as fast as their deserialization.

The only "concurrency point" we have here is TransactionalIndex structure itself - i.e. basically, a sequence of slices forming its current state. Not a tree, since TransactionalIndex itself isn't interested in transactions with "running now" state - it is interested in committed transactions only. Basically, there are just 3 atomic operations on index structure:
- Get all slices: returns a list containing all the slices in index. Executed on start of any transaction, and further (for long running transactions) - after each index structure update interval (~ several seconds). So generally it's a rare operation - i.e. it runs < 1000 times per second.
- Add a new slice on the top. Executed on commit of transaction. Again, quite rare.
- Replace a continuous sequence of slices by a single one. Executed on completion of slice merge operation. Again, quite rare - we've already estimated how many slices and merges there are.

So as you see, this (and the only) "concurrency point" is not a concurrency point at all: all atomic operations on it are blazingly fast (i.e. any of them can be executed > 1M times per second), but expected amount of these operations is ~ 1K / second - that's simply a zero concurrency!

Ok, so let's enumerate what can be parallelized with this "zero concurrency" (i.e. parallelized at least to the number of CPUs \ CPU cores \ PCs):
- Transaction processing: each transaction can be served by an independent thread. Zero concurrency with other such threads! Btw, this doesn't mean there should be just one such thread per transaction - i.e. any data reading query can be parallelized without any problems as well, but currently we aren't thinking about parallel query execution (won't bring a lot, if amount of concurrent transactions is high and there is no locking at all, since server load will be anyway distributed well).
- Slice merging: yes, we can run as many slice merge processes as we we wish.

I.e. everything can be parallelized - as intensively as we wish!

Now compare this to existing databases do:
- Modifying B+tree page requires X-locking it, and S-locking the path to the root (approximately).
- Reading B+tree page requires S-locking the path to the root.
This means that e.g. if some change propagates to the root page, every index reader waits. Nice :( Btw, this isn't the only possible, and definitely not the best approach to making B+tree concurrently accessible, but in even best approaches implying locking (that's what are used everywhere now) behaves similar.

So what we're doing is moving all the possible concurrency to read-only data only (that needs no synchronization) instead of synchronizing the access to read-write data with locks. We think this is quite promising approach from the point of even near future - parallelism becomes more and more precious now. If there will be 16+ cores in a single CPU (this moment seems very close now), everything, which can't be parallelized simply won't be interesting. Would you prefer to use some old fuel allowing to drive your car with e.g. 20Km/h speed limit only, if there is a fuel applying no any speed limits at all? I doubt.

[To be continued...]

Wednesday, March 19, 2008

Xtensive.Indexing architecture: physical view

Physical view

As I've briefly described in previous post, any logical index is actually represented by a stack of physical indexes (slices) laying one over another. Each slice "contains" just the changes made to the logical index state described by a previous slice, and so on - till the "ground" slice. Ground slice contains the changes made to an empty index. Let's try to show this in the table:

Logical index: PK_Customers

Visible slices:
- Slice 0 (ground): size: 10Gb, transactions: 1...13352, status: on disk, read-only // All the changed made to an empty index in transactions 1...13352
- Slice 1: size: 4Gb, transactions: 13353...21249, status: on disk, read-only // All the changes made in transactions 13353...21249
- Slice 2: size: 3Gb, transactions: 21250...27498, status: on disk, read-only
- Slice 10: size: 100Mb, transactions: ..., status: on disk, read-only
- Slice 100: size: 100Kb, transactions: 32453...32460, status: in memory, read-only
- Slice 101: size: 10Kb, transactions: 32461...32462, status: in memory, read-only
- Slice 102: size: 1Kb, transactions: 32464...32464, status: in memory, read-only

Index merge process slices (not yet visible, but managed):
- Slice 1.M: size: 6Gb, transactions: 13353...27498, status: on disk, serializing // A result of merging Slice 1 & Slice 2
- Slice 99.M: size: 150Kb, transactions: ... ...32460, status: in memory, serializing // A result of merging Slice 99 & Slice 100

Active transaction slices (visible to a particular transaction)
- Slice 101.32463: size: 1Kb, transactions: 32463...32463, status: in memory, read-write // All the changes made in transaction 32463, that was started after commit of transaction 32462 (Slice 101), and isn't completed yet
- Slice 102.32465: size: 1Kb, transactions: 32463...32463, status: in memory, read-write // All the changes made in transaction 32465, that was started after commit of transaction 32464 (Slice 102), and isn't completed yet

So that's how everything should look like. Slice names here are just readable examples (real names will be different). Now let's study what benefits such index representation may give us, and what problems we may get with it comparing this to regular indexing approach (concurrent access to a single B+tree based index).

Pros. and Cons.

[To be continued...]

Monday, March 17, 2008

Xtensive.Indexing architecture: logical view, part 2

4. DifferentialIndex

That's what we're using to describe index changes. Differential index is noting more then a wrapper over regular index adding "+" (added), "-" (removed) and "*" (changed) marks against each of its items (they aren't included into the key, of course). Any differential index knows the slice with its own changes, as well as the DifferentialIndex below it (the index to which changes are applied).

As I've mentioned, we can't change the indexes on disk - once some index (i.e. a slice of differential index) is persisted, it can only be read, and finally - disposed by index garbage collector.

5. TransactionalIndexSet and TransactionalIndex

TransactionalIndexSet manages index representation for a particular transaction. It allows to:
- Create a TransactionalIndex for specified Transaction - the changeable index view (TransactionalIndex) you'll get will look exactly like a snapshot isolated index for this transaction should look (i.e. no any changes made after its start will be visible)
- Manage index slice sequence - the chain of the slices forming the index. It provides any part of this chain (from T1 to T2) and supports two chain modify operations: push a new slice on the top of it (that's what happens on transaction commit) and replace two old slices by a new one (that's what background index merge process does). Its garbage collector removes the useless slices (e.g. as the result of merge or rollback), but any slice exists at least for the duration of chain cache period (~ 1 min.) after the time it was actually released from the TransactionalIndexSet chain sequence - this allows TransactionalIndexes to get chain updates much rarely.
- Handle Transaction.Commits and Rollbacks - by detecting the conflicts on commits and pushing the committed slice into to common chain sequence and releasing the newly created slices (differential indexes) on rollbacks.
- Handle recovery after the failure.

Finally, it logs everything into the TransactionLog it is bound to (redo log), ensures it is flushed on commits and "resurrects" the index state after the last successful commit on recovery.

TransactionalIndex is actually rather simple wrapper over DifferentialIndex - it pushes all the actions made to it into TransactionLog and takes care about the size \ location of the underlying index slice(s). The topmost slice it works with is always in-memory index; but when in reaches a certain limit, it creates a new slice, and serializes the old one to disk in background (it's safe, since old one is already a read-only slice at this point).

6. NonUniqueIndex

It is a wrapper transforming some unique index into a non-unique one - by rejecting a part of its key.

Note: any relational index is internally described as unique index: if it is primary index, then its key is unique itself; if it is secondary index, it can be exposed as an index with the following key: (NonUniqueKey, ForeignKey). Since ForeignKey is unqiue (as PrimaryKey), the whole key is unique. So NonUnqiueIndex just "strips" the ForeignKey part of the key, and exposes it as non-unique index with just NonUnqiueKey key.

Physical view: to be continued...

Wednesday, March 12, 2008

Xtensive.Indexing architecture: logical view

We're using layered index architecture: each layer adds some features to \ transforms the result of the layer it wraps. Any layer is exposed either an index itself (i.e. it implements IUniqueOrderedIndex or INonUniqueIndex), or exposes a set of such indexes acting nearly as a collection \ provider of them. So probably it's better to think about each layer as of the editable index view of the index below it.

Note: all the transforms happen "on the fly" - i.e. no additional time is required to get the layer itself; moreover, the additional time layer spends on transform is constant per each accessed item.

Brief description: What is Index?

Index is a set of items ordered by their keys. Quite similar to Dictionary, but with the following differences:
- Index is always ordered, although Dictionary - isn't. So it's possible to get the range of index items (~ from key1 to key2). Getting range enumerator requires logarithmic time; getting a subsequent item in range requires constant time.
- Key is always a part of the item. I.e. there is always a way to extract it from the item (this is usually done by some delegate). Quite frequently such extractor does nothing - it just returns the item itself - finally item is always a Tuple, its first N fields form the key, comparers used by index don't consider other fields at all - so it's safe to return the item as key.
- Indexes supports our IRangeMeasureable interface. This means you can get a measurement for any pre-defined measure (pre-defined - because index must maintain its measurements for each of its page) - not just for the whole index, but for an arbitrary range of its items. The best thing here is that such operation requires logarithmic time, rather then linear (that's what you have in any database now). Certainly, measurement maintenance requires some additional time, but it actually doesn't change the asymptotic time (in O(n) notation) of any index modification operation in general - i.e. Add(TItem item) still requires logarithmic time in both cases.

Why measures are so important? Actually they allow to bundle OLAP cubes support right into the database. Moreover, they're better then OLAP cubes, since their dimensions don't quantize (i.e. you can get a measure value not for just Q1\2007, Q2\2007, ..., but for any time range). Some examples of queries measures may shorten to ~ constant time:

- Select count(*) from ... where X>=@XMin and X<@XMax
- Select sum(...), avg(...), min(...), max(...) from ... where X>=@XMin and X<@XMax
- etc...

All above queries will require nearly logarithmic time to complete, if all the mentioned measures are defined for an index indexing X value. Any existing database (at least we didn't found any, which behaves differently) will require a linear time for the same query. Let's show what does this mean:
- We need to fetch ~ 2 pages multiplied by the amount of on-disk transactional layers (I'll describe physical index structure in another post shortly; for now just believe this is correct - even for quite large index) to accomplish such a query. All the calculations are relatively fast in comparison to page fetch time; let's assume there are 10 transactional layers (good enough estimation even for ~ 1Tb index). So we should fetch just 20 pages to get the query done, or 200ms. In the worst case! I.e. even if there are 1G of items!
- Traditional database will spend ~ pageSize / itemSize * itemCount / fillFactor time on this. So if itemSize is 100b, the itemCount is 10M and fillFactor is 1 (ok, imagine index is ideal ;) ), it will spend at least the time necessary to read 1Gb from the disk. I.e. ~ 10 sec. on rather fast HDDs; if the amount of items will be 1G, it will spend 100 times more time at least (1000 sec.). Why "at least": normally such large indexes are quite fragmented, so even sequential page access requires many seeks. This may easily increase the time by a factor of 10 (ok, you may get 3 hours ;) ). Compare this to ours 0.2 sec. in the worst case ;)

Some of our indexes are generic (BPlusTree, SortedListIndex) - i.e. they can deal with generally any type of TItem and TKey. Others explicitly require a certain types of them - actually, this is always Tuple (certainly, with arbitrary structure). Some of layers transform their keys and values (i.e. add \ remove some field they need) - normally they don't copy the Tuples during these operations, but return TupleView objects (also Tuples) handling these transforms on-the-fly.

Ok, let's return back to logical structure now. There are the following layers:

1. Physical layer

There are two different index types: BPlusTree or SortedListIndex. The second one is used when the amount of items is relatively small (<1000...2000). Both indexes provide the same features, but BPlusTree internally uses much more complex approach to get them working, e.g. Measurements are stored in each of its page, but not for the whole index.

Note: both physical indexes may operate in two modes:
- In-memory mode, read-write access
- In-stream mode, read only access (SortedListIndex just deserializes its content in this mode; BPlusTree fetches pages from the disk and utilizes in-memory last-recently-used page cache in this mode).

Later I'll explain why this is not just enough to implement fully transactional indexes, but even much better from many points in comparison to traditional index architecture (just in-stream mode with read-write access).

2. Physical layer selector (optional)

AutoIndex works here. It automatically chooses the best index implementation on the layer below - BPlusTree or SortedListIndex, and switches between them on demand based on the amount of items.

Note: this part isn't implemented now; we're just considering this optimization. For now we're using just BPlusTrees as physical indexes, although usage of SortedListIndex (it exists, but isn't used here) seems attractive: almost all index changes made by a single transaction are quite small (<100 items), so SortedListIndex usage may act as a good optimization for this common case here.

3. CompoundIndexSet (optional)

Compound index is the index merging several indexes having the common part of the key into a single physical index. Basically, it transparently "injects" index identifier right after the common key part.

So if the indexes it exposes have the following structure:

- Index1: long, string
- Index2: long, int, string
- Index3: long, guid

The structure of the physical index will be the following:

- Index: long, int (index identifier), {string | int, string | guid }

Why this index is useful? Imagine that first key value is e.g. user profile key, and the most part of the data we're dealing with is bound to a user profile - i.e. normally it's required to search just inside user's data. Such a transform ensures that user data will be located closely in the underlying physical index - i.e. in a subsequent set of pages, or in a single page. Moreover, if distributed partitioning will occur (ok, some day even this will work ;) ), it will never be splat between two different partition servers. Having the data closer to each other ensures it will be accessed much faster: access time mainly depends on the amount of pages we fetch, at least on the primary storage we have now (hard drives) - random page fetch normally implies random read, and the smallest possible time for HDD here is ~ 5ms now, so a single HDD can serve just 100-200 random page fetches per second. CompoundIndex reduces the amount of random requests by using an assumption the data we access is always related to some entity or data type.

Another possible scenario here is efficient date \ time handling for several indexes. Again, we can merge a set of indexes including some date on the first place into a single physical one saying "the data we access is mainly the data related to the same time period".

Why we're thinking such scenarios are common now? The amount of SAAS and web applications is growing quite fast; the amount of users and data they're maintaining is growing even faster. So a generic framework like DataObjects.Net should be fully ready to deal with huge amounts of data. The age of desktop databases with modest sizes has passed - lots of companies try to provide some shared service for the huge amount of users now.

[To be continued tomorrow...]

Last week changes

We're still working mainly on indexing. Today I'll describe our indexing layer architecture here - at least this should give some imagination of why all is so complex, and we're taking more and more time.

Updated plans for the nearest three weeks:
- Launch the new web site. Initially we planned to launch mainly HTML-based web site, but on the last week we decided to get much more featured one. It became obvious on this week it will take some additional time to get all the features we want implemented, so I assigned additional people from our "research team" (working on v4.0) on this task. Earlier we've spend ~ 2 man-weeks on its "soft" part (i.e. everything else design \ HTML code), so it will be a perfect case to check if it's really easy to build a full-featured software company web site on DO v3.9 + some of our new stuff (mainly - Xtensive.Core)
in some reasonable time frame (~ 1 month in total). For now we're going to deliver all the essential features mentioned before (downloads \ my downloads, news, orders, license \ subscription tracking, help desk (questions, support, search), indexing & search of the whole group of web sites (primary website, blogs, forum) ). Further simple help desk will be "exploded" Wiki + Knowledge Base + Help Desk.
- Complete the following parts in Xtensive.Core: fast & strongly typed binary serialization (Xtensive.Core.Serialization namespace), size estimation
layer (Xtensive.Core.SizeCalculators, should be shortly used by indexes).
- Implement Xtensive.Xml.Serialization (mainly - deserialization of \ applying changes to generally anything from XML files). Right now we need this to get the "default" content (common dictionaries, such as Currency, Sex, etc...) imported to a web site database, plus handle download \ product definitions (Product, Edition, Branch, Version, etc. in web site model). As you may expect, this will work even v3.9 version of DO.
- Continue work on Xtensive.Indexing. Still lots of things to do, although basic indexes and wrappers pass most of the tests now. Further details will follow today.

Concerning the DataObjects.Net 4.0 CTP: it's still postponed. We're targeting to end of April now - codebase is still growing fast...

Sunday, March 02, 2008

Upcoming website update: new features

I've mentioned we're going to launch a new web site shortly. Planned date is already passed - there is still some work to do; plus, we can't launch it until it works perfectly. In any case, now it is a question of week or two. You can get some imagination of its design here (it's the original design version, the current one seems much better, but certainly it can't be shown until launch).

So what you can expect from it (except better design):
- Current web site actually works on Apache \ PHP (ok, don't laugh - it was developed 5 years ago). The new one works on ASP.Net & DataObjects.Net.
- User profiles: generally any download action will require you to login \ register. Earlier we were using private URLs, which is certainly much less convenient. So shortly it will be enough to login to get the access to all the products you've ordered, as well as to nightly builds.
- Much better Downloads section: new web site "understands" such types, as Product, Edition, Version, etc., and binds all the downloadable materials (documentation, help files, license agreements, etc.) to them, so it will be much easier to find the appropriate file \ a particular version of it, or related materials.
- Integration with blogs: first of all, this blog will replace old News section. Secondly, we're going to add some other blogs related to some particular projects \ technologies, or simply maintained by some of our developers. So there will be much more news ;)
- Integration with Support Forum: new posts on it will be displayed on the first page of our web site. Further there will be common authentication \ accounts as well.
- Much faster order processing: it should take few minutes from the moment of purchase to get the ordered product listed in your Downloads section.
- The content will be updated (there are some pages on the current one we're in shame for, e.g. this one).
- Finally, Russians have some sense of humor as well, and there should be some proof of it ;). Ok, I'm announcing this at least to make you to browse all the pages after the launch ;)

Further plans (before official DataObjects.Net v4.0 launch) include support for product activation (ok, it seems the most part of our customers knows we don't validate the amount of ordered licenses now - and actually that's what we dislike), Help Desk and Knowledge Base.