Thursday, January 10, 2008

Highlight: [Index] aspect, B+ tree-based indexes, measures

Let's start looking up highlights from [Index] aspect. Here is example code (NUnit test) and its output.

Here [Index] has nearly the same meaning as before - it creates an index on collection. It is not a persistent index, but in-memory index. I hope to show you fully persistent and transactional collections with persistent indexes created by absolutely the same way in the end of this month.

Note: All the things we're showing now aren't DataObjects.Net 4.0 itself. But they are (or will be) used either in models based on it, or inside it. So this isn't how indexes are defined in v4.0, but just a kind of "technology demo". Practically we don't use [Index] aspect at all for now, even in our own code, but create indexes manually (i.e. using factory \ constructors), although this doesn't mean it is useless.

What this example shows:
- How to create an index on collection supporting change notifications (our CollectionBase). In fact, you should just apply [Index] aspect on collection, and [Changer] aspect - on the property on which index is to be built.
- How to access it: if [Index] is applied, you may cast the collection to IHasIndexes interface in runtime, although it wasn't supported. PostSharp + our aspects make all the magic during compilation of assembly, so any type with [Index] supports this interface.
- How to add measures to it - see IndexedPersonCollection constructor. Measures are some additive values (such as key count, sum of indexed property and so on) that can be further calculated for any index range in logarithmic time - their values are maintained in index pages.
- How to get index range: IndexedPersonCollection.FilterByBirthday method.
- How to get measures for index ranges: see IndexedPersonCollection.CountByBithday, AverageAgeByBithday methods.

Note: nothing additional should be done to maintain the indexes of such collection intact - [Changer] ensures it will be notified before and after any change of its item. [Changer] aspect does very similar thing: it makes the declaring type of property or method it is applied on to properly support IChangeNotifier interface.

Finally, this test shows how B+ trees with automatic binding to collection \ item are compared to other possible index implementations:
- SortedDictionary-based indexes: see ManualPersonCollection and RegularPerson classes
- List without indexes: see RegularPersonCollection.

Some conclusions, that can be made from test output:
- Our index implementation is ~ 3 times slower on insertions \ deletions, then SortedDictionary. Actually they are quite comparable (actual slowdown factor is about 1.5) - the noticeable part of time is spent on automatic change notification ([Changer] implementation). Further we'll improve this (profiling shows that both PostSharp.Laos-generated code, as well as our [Changer] implementation has noticeable lacks).
- Taking this into account the performance of our B+ trees seems good - SortedDictionary internally uses red-black tree, which is almost perfect for keeping balanced trees in memory, so it's quite difficult to compete with it, especially having so many levels of abstraction (see e.g. tree providers we have now; more advanced ones are in development).
- On large sets of data B+ trees needs less memory then red-black trees. This can be seen in the last test sequence (1M items) - even although we maintain specified set of of measures.
- Indexing speed (i.e. seek time) is almost the same as for the SortedDictionary.

And the most important ones:
- "Filtering ..." tests: B+ tree allows to get index ranges (e.g. all the persons with age between 20 and 30). There is no built-in collection in .Net that allows this. Note that getting index range enumerator requires nearly the same time as indexing (i.e. item seek operation) in our case; so almost all the time in this test in case of B+ tree is spent on enumeration of selected items; tests for other collections do the same, but they should enumerate the whole collection to get the same result.
- "Count on ranges ...", "Average on ranges ..." tests show how measure calculation for index range is fast. Again, in our case it requires the time comparable to index seek operation.

We're using B+ trees everywhere - all non-SQL storages (memory, file system, and further - distributed) will be fully based on it. My current intention is to show some of features these storages will share. Do you know if e.g. Microsoft SQL Server is capable of storing arbitrary measures inside its indexes? Or at least item\key count? I'm not sure ;) (I know they use "statistics" for increasing speed of count(*)-like queries, but what are the details?).

Ok, the next step from our side here will be a demonstration of persistent and fully transactional index providing snapshot isolation. As I've mentioned, we expect getting most part of this working on the end of this month.

1 comment:

  1. Forgot to add - all you need to run the example, as well as its code, is available here.