Wednesday, March 12, 2008

Xtensive.Indexing architecture: logical view

We're using layered index architecture: each layer adds some features to \ transforms the result of the layer it wraps. Any layer is exposed either an index itself (i.e. it implements IUniqueOrderedIndex or INonUniqueIndex), or exposes a set of such indexes acting nearly as a collection \ provider of them. So probably it's better to think about each layer as of the editable index view of the index below it.

Note: all the transforms happen "on the fly" - i.e. no additional time is required to get the layer itself; moreover, the additional time layer spends on transform is constant per each accessed item.

Brief description: What is Index?

Index is a set of items ordered by their keys. Quite similar to Dictionary, but with the following differences:
- Index is always ordered, although Dictionary - isn't. So it's possible to get the range of index items (~ from key1 to key2). Getting range enumerator requires logarithmic time; getting a subsequent item in range requires constant time.
- Key is always a part of the item. I.e. there is always a way to extract it from the item (this is usually done by some delegate). Quite frequently such extractor does nothing - it just returns the item itself - finally item is always a Tuple, its first N fields form the key, comparers used by index don't consider other fields at all - so it's safe to return the item as key.
- Indexes supports our IRangeMeasureable interface. This means you can get a measurement for any pre-defined measure (pre-defined - because index must maintain its measurements for each of its page) - not just for the whole index, but for an arbitrary range of its items. The best thing here is that such operation requires logarithmic time, rather then linear (that's what you have in any database now). Certainly, measurement maintenance requires some additional time, but it actually doesn't change the asymptotic time (in O(n) notation) of any index modification operation in general - i.e. Add(TItem item) still requires logarithmic time in both cases.

Why measures are so important? Actually they allow to bundle OLAP cubes support right into the database. Moreover, they're better then OLAP cubes, since their dimensions don't quantize (i.e. you can get a measure value not for just Q1\2007, Q2\2007, ..., but for any time range). Some examples of queries measures may shorten to ~ constant time:

- Select count(*) from ... where X>=@XMin and X<@XMax
- Select sum(...), avg(...), min(...), max(...) from ... where X>=@XMin and X<@XMax
- etc...

All above queries will require nearly logarithmic time to complete, if all the mentioned measures are defined for an index indexing X value. Any existing database (at least we didn't found any, which behaves differently) will require a linear time for the same query. Let's show what does this mean:
- We need to fetch ~ 2 pages multiplied by the amount of on-disk transactional layers (I'll describe physical index structure in another post shortly; for now just believe this is correct - even for quite large index) to accomplish such a query. All the calculations are relatively fast in comparison to page fetch time; let's assume there are 10 transactional layers (good enough estimation even for ~ 1Tb index). So we should fetch just 20 pages to get the query done, or 200ms. In the worst case! I.e. even if there are 1G of items!
- Traditional database will spend ~ pageSize / itemSize * itemCount / fillFactor time on this. So if itemSize is 100b, the itemCount is 10M and fillFactor is 1 (ok, imagine index is ideal ;) ), it will spend at least the time necessary to read 1Gb from the disk. I.e. ~ 10 sec. on rather fast HDDs; if the amount of items will be 1G, it will spend 100 times more time at least (1000 sec.). Why "at least": normally such large indexes are quite fragmented, so even sequential page access requires many seeks. This may easily increase the time by a factor of 10 (ok, you may get 3 hours ;) ). Compare this to ours 0.2 sec. in the worst case ;)

Some of our indexes are generic (BPlusTree, SortedListIndex) - i.e. they can deal with generally any type of TItem and TKey. Others explicitly require a certain types of them - actually, this is always Tuple (certainly, with arbitrary structure). Some of layers transform their keys and values (i.e. add \ remove some field they need) - normally they don't copy the Tuples during these operations, but return TupleView objects (also Tuples) handling these transforms on-the-fly.

Ok, let's return back to logical structure now. There are the following layers:

1. Physical layer

There are two different index types: BPlusTree or SortedListIndex. The second one is used when the amount of items is relatively small (<1000...2000). Both indexes provide the same features, but BPlusTree internally uses much more complex approach to get them working, e.g. Measurements are stored in each of its page, but not for the whole index.

Note: both physical indexes may operate in two modes:
- In-memory mode, read-write access
- In-stream mode, read only access (SortedListIndex just deserializes its content in this mode; BPlusTree fetches pages from the disk and utilizes in-memory last-recently-used page cache in this mode).

Later I'll explain why this is not just enough to implement fully transactional indexes, but even much better from many points in comparison to traditional index architecture (just in-stream mode with read-write access).

2. Physical layer selector (optional)

AutoIndex works here. It automatically chooses the best index implementation on the layer below - BPlusTree or SortedListIndex, and switches between them on demand based on the amount of items.

Note: this part isn't implemented now; we're just considering this optimization. For now we're using just BPlusTrees as physical indexes, although usage of SortedListIndex (it exists, but isn't used here) seems attractive: almost all index changes made by a single transaction are quite small (<100 items), so SortedListIndex usage may act as a good optimization for this common case here.

3. CompoundIndexSet (optional)

Compound index is the index merging several indexes having the common part of the key into a single physical index. Basically, it transparently "injects" index identifier right after the common key part.

So if the indexes it exposes have the following structure:

- Index1: long, string
- Index2: long, int, string
- Index3: long, guid

The structure of the physical index will be the following:

- Index: long, int (index identifier), {string | int, string | guid }

Why this index is useful? Imagine that first key value is e.g. user profile key, and the most part of the data we're dealing with is bound to a user profile - i.e. normally it's required to search just inside user's data. Such a transform ensures that user data will be located closely in the underlying physical index - i.e. in a subsequent set of pages, or in a single page. Moreover, if distributed partitioning will occur (ok, some day even this will work ;) ), it will never be splat between two different partition servers. Having the data closer to each other ensures it will be accessed much faster: access time mainly depends on the amount of pages we fetch, at least on the primary storage we have now (hard drives) - random page fetch normally implies random read, and the smallest possible time for HDD here is ~ 5ms now, so a single HDD can serve just 100-200 random page fetches per second. CompoundIndex reduces the amount of random requests by using an assumption the data we access is always related to some entity or data type.

Another possible scenario here is efficient date \ time handling for several indexes. Again, we can merge a set of indexes including some date on the first place into a single physical one saying "the data we access is mainly the data related to the same time period".

Why we're thinking such scenarios are common now? The amount of SAAS and web applications is growing quite fast; the amount of users and data they're maintaining is growing even faster. So a generic framework like DataObjects.Net should be fully ready to deal with huge amounts of data. The age of desktop databases with modest sizes has passed - lots of companies try to provide some shared service for the huge amount of users now.

[To be continued tomorrow...]

No comments:

Post a Comment