Sunday, January 13, 2008

Indexes: why we're working on our own indexing engine?

Let's state one of our final goals fist: we want to implement a distributed storage behind DataObjects.NET 4.0. Further I'll try to explain why.

First of all I should say that initially we haven't thought about this. We started working on indexing mainly because they were necessary to implement offline layer well - there are all the same problems, e.g. it's quite desirable to have an opportunity to find all the objects referencing a particular one (e.g. on its removal). In some cases it's desirable to check e.g. uniqueness of some value, or order the objects by value of some of its properties. So indexes are anyway necessary here. Of course, in-memory, non-concurrent indexes, which are obviously simpler to implement.

But in May we've found one quite attractive problem, which isn't yet solved, and its solution should be very interesting commercially. On the other hand, it isn't easy to solve it using the technologies we had that time - i.e. the application solving it needs extremely big storage - probably, working on hundreds of PCs. That's how the idea of implementing distributed indexes has appeared.

We started to work on solution from mid-summer, from the lowest level - distributed file system. And immediately made a set of mistakes (~ 2 man-months of work were dropped because of this) - we underestimated the problem. Finally this lead us to studying existing approaches ;) We've read all we could find about distributed databases and storage systems. The most interesting articles were Google's publications describing Chubby (distributed lock service using Paxos algorithm), GFS (Google's distributed file system) and BigTable (distributed indexed table) architecture, as well as the papers related to Microsoft's Boxwood (distributed file system) project. Google's publications were interesting mainly because of innovative ideas; Boxwood was very good as example of possible layering \ architecture of distributed file system. Key conclusions we've made:
- We really can implement distributed storage - we understood everything; moreover, there were some code metrics related to GFS \ BigTable, that were looking quite good. So if we won't make too many mistakes, it will take reasonable time.
- We liked a lot the approach used by Google - they are using so-called sandwiched (sliced) indexes. Each layer (slice) of such sandwich (except the topmost one) is read-only B+tree containing differences that should be applied to the lower part of "sandwich" to produce the state related to the current layer. Further I'll explain why this approach seems the most promising in comparison to others in near future, taking into account the trends.
- We understood that index implementations in databases we know seems much less promising then this one - from quite many points.

So there is something we need, not too complex... Moreover, it seems it's a good time to implement this, since for now something similar isn't widely available. And finally, it integrates quite well into the technology we have - DataObjects.Net v4.0 design was initially implying two kinds of storage providers (SQL-based and indexing-based), and distributed indexes can be supported as provider of the second type without almost any additional efforts (further we understood we should get some problems distributing queries, but almost immediately found a good solution). Ok, we should implement this.

Of course, not "right now" - i.e. we don't intend to deliver it asap. But we developed an implementation plan for such a storage. And as we discovered, many pieces of it can be implemented and tested even without all the distributed stuff like Paxos server and distributed file system. Basically, we need to make such indexes working concurrently over regular file system - their further "upgrade" to distributed ones is actually much less complex, after we'll get working distributed file system (again, it's not a "right now" task - for now we just have some "wrong distributed file system" code, newly working and tested Paxos server (election algorithm) using Xtensive.Messaging, untested distributed state machine and the platform for running distributed NUnit tests - Xtensive.Distributed.Test). All distributed stuff-related works are paused from mid-autumn - they really can be done later, moreover, it's always better to design any big system from topmost layers to the lowest ones: in this case requirements to lower layers immediately appear when you implement or design above ones. If you're starting from the lowest ones, you may spend a lot of time on implementing some of features you probably won't need at all. From that time we're working just on Xtensive.Storage and indexing.

Ok, now let's explain few quite essential questions:

Is transactional data indexing - all we need to implement a storage (e.g. RDBMS)?

The answer is yes. Essential features of any RDBMS include:
- Tables and indexes access and maintainance. Table is nothing more then [primary] index, which key is PK (primary key), and value includes all other columns. An index on this table is the same index again, which key includes indexed columns and value (FK - foreign key) - a key of corresponding entry from primary index (PK).
- Transactions support. This implies implementing either lock-based or row versioning-based isolation.

Everything else is actually not essential. An example of quite well-known database engine offering just these features (moreover, it provides just seek \ range access to indexes and support for transactions - i.e. as far as I remember, there is no such term as "table" at all - just "index") is Berkeley DB, which was recently acquired by Oracle.

Finally, there is well-known trend, which seems to be proved during last years: more and more complex logic tends to migrate from RDBMS layer (e.g. from stored procedures and triggers) to business logic layer - at least because it's much easier to write, maintain and test the code written on high-level languages, rather then on SQL. So having just indexing engine as RDBMS is an extremal case of this concept; and since we have very good experience related to migrating everything to BLL and providing very general abstractions there, may be it's time to take care about this "simple" storage layer as well...

Funny, but this is completely contrary to what Microsoft is doing: they're offering to "bundle" application BLL assemblies into SQL Server. We're trying to bundle RDBMS into the application domain (of course this isn't correct for the distributed database).

Why Google's approach to data indexing is more promising then other well-known ones, including the one used by SQL Server?

Well, the following fact plays key role here: the guys at Google were choosing an approach to data indexing keeping one key requirement in mind: it should be possible to distribute any index (or a set of them - i.e. BigTable) to thousands of PCs. Sandwiched index seems really the most promising option in this case:
- Any operation on it requires almost no locking in average (i.e. there will be some moments requiring locking, but these locks can be released almost immediately - they lock the structures in memory (let's say index map), and this happens without any intermediate disk i\o). AFAIK there is no other approach to implementing indexes allowing such a level of parallelism. Why this is important? Ok, imagine what is the cost of reading distributed lock status (at least one network round-trip). This approach seems quite promising now even on home PCs - the number of CPU cores is growing, as well as the amount of hard drives.
- Such indexes are quite fast for data modification operations. Basically, the time for any operation on them is ~ cost of running this operation in memory on rather small slice (i.e. B+tree containing 0...1000 keys) + cost of logging it to append-only log.
- It's possible to implement such indexes using just CreateFile, AppendToFile and DeleteFile operations. This is quite important, since appending to a distributed file can be implemented much easier and work much faster then safe overwrite of existing part of a distributed file. If you're interested in details - please refer to GFS description.
- Since there is append-only writes, there is no requirement to have same-sized pages. Moreover, there is no necessity to deal with dead pages in the index file. If I'd compare this approach to the same process in SQL Server indexes \ files, the best analogy I imagine seems "Sliced indexes" ~ ".Net garbage collection" (fast allocation, rare compactions) vs "Commonly used indexes" ~ "C++ heap management" (slow allocation, slow reclamation, no compactions).
- It's possible to compress the index with ease because of all above. Since any index page usually contains quite similar data (i.e. binary part of keys is usually the same on the same page), any compression algorithm like LZW (possibly - in conjunction with differential key compression) gives quite good results. Google reports they reduce the index size by ~ 10 times (!) by this (for their crawler). Why compression is important? Just imagine you read 10 times less from the disk.
- There are some disadvantages (e.g. any index seek operation implies seek in all the slices - imagine, if 20 of them are big (so caching works bad for them), so there can be lots of disk operations), but it seems all of them can be successfully resolved (e.g. the last issue is resolved quite gracefully with Bloom filters).

All above conclusions are listed in mentioned paper describing GFS. We've made few more:
- We understood such approach allows to implement snapshot isolation with ease. Moreover, the cost of transaction completion (rollback or commit) is again almost equal to the cost of appendinga mark to transaction log (+ version conflict detection - on commit). That's a bit strange, but Google doesn't offer this at all (for now they limit support of transactions in BigTables to support of atomic operations only - i.e. in fact there are no "honest" transactions yet).
- Finally, we can easily maintain additive measures in such indexes. Obviously they're quite useful - they reduce the time of queries like "Select avg(...) from ... where ..." from O(rowsToProcess) to O(log(totalRowCount)).

If you'd compare this to indexing in e.g. SQL Server, you'll immediately understand SQL Server database will never be distributed at all - not just "so well" (of course if they won't significantly redesign the storage). They use B+tree page locking, page overwriting, complex page reclamation... This really can't be distributed well.

Btw, such approach to indexing seems also historical: sandwiched indexes seem to require more time for background maintainance (i.e. compaction) and may take 3-5 times more space on disk, if no compression is used. It's nearly the same case as with snapshot isolation - it is much more attractive, but it appeared only in SQL Server 2005. Most likely just because SQL Server has grown up from Sybase, which was initially providing lock-based isolation. Ok, the same is correct for garbage collection - it wasn't used frequently till Java \ .Net times, because memory was quite precious resource, + CPUs were not fast enough.

Finally, there is quite important "proof of concept" that such indexes are much closer to the future: Google works quite well ;) So we like to have them too. Initially - as part of file system-based storage, and further - as the part of distributed one. Hopefully the God will help us ;)


P.S. There is one unanswered topic: so how we're going to deal with queries, if all above is true? For now I can forward you to DataObjects.Net 4.0 Platform Vision - there are some details related to this. I'll try to describe our approach to querying the data more precisely in this blog further (e.g. some aspects related to distributed queries).

Highlight: Logging

If you design a solid framework, obviously there should be logging. .Net Framework doesn't offer much here - just logging to Windows logs. Certainly this isn't a good option, especially if you want to log some diagnostics-related events.

After spending some time on studying third-party logging libraries, we came to a conclusion there are two primary options:
- Either to use Logging Application Block from Microsoft Enterprise Library. Flexible, but slower (by some reports - much slower).
- Or stop on Apache log4net. Port of log4java (from my point of view "port" is disadvantage, since it isn't natively designed for .Net), but pretty good.

Other pros and cons can be found e.g. here, here and here.

And what was may be the most important, both frameworks doesn't provide few quite desirable features:
- A good approach for binding some default log to any type \ method automatically. This means that ideally we want to write something like Log.Info(...) anywhere to write a message to the default log for the place where such call happens.
- Log capturing. Ideally, we want to be able to create something that will listen all the events logged to a certain set of logs, capture them and allow to browse them further. This was quite important for e.g. storage builder - this class inspects all the registered types and builds storage model for them. This involves activities of many other classes, which should log all warnings and errors to some log(s). Ideally, storage builder should just capture all such logging events from other logs, and if at least one error is found, it throws an exception showing all errors and warnings.
- Log indentation. It would be nice to have an ability to temporary increase the indent in some log(s) to produce more readable output.

Finally we came to a conclusion which seems quite regular for us in such cases: we shouldn't rely or strictly bind ourselves on either ;) But building our own logging framework isn't a cheap solution as well. So we decided to provide a set of generic interfaces for logging in Xtensive.Core.Diagnostics, and implement them in wrappers for log4.net - that was pretty easy.

But in addition we should solve mentioned problems. And as I see now, we've solved them gracefully:

1. Default logs.

First of all, it's necessary to define what's "default log". We decided that default log is either a log bound to the current assembly (i.e. the assembly from which logging method is called), or a log bound to the current namespace (i.e. the namespace of class from which logging method is called). If there is no default log bound to the current namespace, we should use the default log for above one.

And as I've mentioned, the preferable way of logging for us is to write something like Log.Error(...) anywhere where this is necessary - it seems it's the shortest possible form of getting something logged at all ;)

What we definitely didn't want to have is:
- Necessity to use construction like LogProvider.FindLog("Xtensive.Core").Debug(...) - since in this case we specify assembly \ namespace name, which may lead to logging to a wrong log - e.g. on moving a class to another namespace \ assembly.
- Necessary to declare a static ILog-like member in each of our classes (e.g. as log4net recommends) - of course this would solve the problem (i.e. we could write log.Debug(...) in any method of such class), but even declaring such a property in any class (+ probably, static constructor!) seems too much.
- Of course we could try to make PostSharp to do this, but it is also not so easy - PostSharp can make any of such classes to implement some interface (e.g. ILogProvider), but in this case it would be anyway complex to log: to prevent compile-time errors, we should use something like ((ILogProvider)(object)this).Log.Error(...) - again, too much code. Moreover, this approach doesn't allow to access the log from static members.

Finally we've found the solution. There is the following class in Xtensive.Core.Diagnostics:

  /// <summary>
/// Log template - simplifies logging,
/// provides support for <see cref="LogCaptureScope"/>.
/// </summary>
/// <typeparam name="T">Should always be the type of descendant.</typeparam>
public class LogTemplate<T>
{
private static ILog instance = null;

/// <summary>
/// Gets the <see cref="ILog"/> this type logs to.
/// </summary>
public static ILog Instance {
get {
return instance; }
protected set
{
ArgumentValidator.EnsureArgumentNotNull(value,
"value");
instance = value;
}
}

#region ILog-like static methods

public static bool IsEnabled(LogEventTypes eventType)
{
return Instance.IsLogged(eventType);
}

[Conditional(
"DEBUG")]
public static void Debug(object message, Exception exception)
{
Instance.Debug(message, exception);
}

[Conditional(
"DEBUG")]
public static void Debug(string format, params object[] args)
{
Instance.Debug(format, args);
}

[Conditional(
"DEBUG")]
public static void Debug(IFormatProvider provider, string format, params object[] args)
{
Instance.Debug(provider, format, args);
}

public static void Info(object message, Exception exception)
{
Instance.Info(message, exception);
}

public static void Info(string format, params object[] args)
{
Instance.Info(format, args);
}

public static void Info(IFormatProvider provider, string format, params object[] args)
{
Instance.Info(provider, format, args);
}

public static void Warning(object message, Exception exception)
{
Instance.Warning(message, exception);
}

public static void Warning(string format, params object[] args)
{
Instance.Warning(format, args);
}

public static void Warning(IFormatProvider provider, string format, params object[] args)
{
Instance.Warning(provider, format, args);
}

public static void Error(object message, Exception exception)
{
Instance.Error(message, exception);
}

public static void Error(string format, params object[] args)
{
Instance.Error(format, args);
}

public static void Error(IFormatProvider provider, string format, params object[] args)
{
Instance.Error(provider, format, args);
}

public static void FatalError(object message, Exception exception)
{
Instance.FatalError(message, exception);
}

public static void FatalError(string format, params object[] args)
{
Instance.FatalError(format, args);
}

public static void FatalError(IFormatProvider provider, string format, params object[] args)
{
Instance.FatalError(provider, format, args);
}

#endregion


// Static constructor

static LogTemplate()
{
Type t =
typeof (T);
string logName = "Unnamed";
logName = (
string)t.GetField("Name", BindingFlags.Public | BindingFlags.Static).GetValue(null);
string skipPrefix = "Xtensive.";
if (logName.StartsWith(skipPrefix))
logName = logName.Substring(skipPrefix.Length);
Instance = LogProvider.GetLog(logName);
Diagnostics.Log.Info(
"{0} log initialized.", Instance);
}
}

It is base generic type for any default log. It's quite easy to create a default log based on this type:
using System.Reflection;
using Xtensive.Core.Diagnostics;

namespace Xtensive.Core
{
/// <summary>
/// Log for this namespace.
/// </summary>
public sealed class Log: LogTemplate<Log>
{
// Copy-paste this code!
public static readonly string Name;

static Log()
{
string className = MethodInfo.GetCurrentMethod().DeclaringType.FullName;
Name = className.Substring(0, className.LastIndexOf(
'.'));
}
}
}

So to get default log in some namespace, you should just copy-paste the last piece of code, and set appropriate namespace in it!

Why this approach works? C# resolves type name by the following pattern:
- It searches the current namespace (i.e. namespace of the type where reference to class is found) for the type with the name it resolves
- If no type with specified name is found, it searches the above namespace, and so on
- Finally, it searches the namespaces listed in using directives in the current file.
So if you declare such Log in some namespace, type Log will be resolved to it in this namespace and all above ones (of course if above namespace doesn't declare another Log). That's exactly the behavior we want to get!

Funny trick: have you noted what's the ancestor of Log? It is LogTemplate<Log> - i.e. a LogTemplate< <ThisType> >. LogTemplate class should be generic (i.e. have at least one generic parameter) - to ensure its static instance member will be created for each of its non-generic instances. But on the other hand we should ensure that each instance of this generic will have different type of T parameter (if it will be the same, there will be shared static member). And Log type (i.e. <ThisType>) perfectly suits here!

So now we have Log.Info(...) - like calls everywhere - and doesn't worry about moving the classes or methods. Copy-paste once, use everywhere in assembly or namespace. Perfect!

2. Log capturing.

This problem is actually much less tricky to solve - i.e. just coding is necessary here. There is LogCaptureScope type allowing to capture an output from a set of ILogs into another ILog - i.e. exactly what we need. Just an example, and nothing more:
using (new LogCaptureScope(LogProvider.ConsoleLog)) {
Log.Info(
"This message will be written twice on console.");
}

3. Log indentation.

Again, rather simple problem, so let's show its solution on one more small example:
Log.Info("Not indented");
using (new LogIndentScope()) {
using (new LogCaptureScope(LogProvider.ConsoleLog)) {
Log.Info(
"Indented (1)");
using (new LogIndentScope()) {
Log.Info(
"Indented (2)");
}
}
}

Highlight: Contexts, ContextBounds, Scopes, context activation

We've introduced a nice concept in Xtensive.Core represented by IContext, Scope and IContextBound classes. It's used widely in many cases - e.g. most of key classes in Xtensive.Storage implement IContext<T>.


So what's context?

Context is an object implementing IContext<T>:

  public interface IContext<T>
where T: class, IDisposable
{
bool IsActive { get; }
T Activate();
}

Type T here is the type of scope of the context. Let's see what's Scope<T>:
  public class Scope<TContext>: IDisposable
where TContext: class
{
[ThreadStatic]
static Scope<TContext> currentScope;

private TContext context;
private Scope<TContext> outerScope;

///
/// Gets the context of this instance.
///

/// The context.
protected internal static TContext CurrentContext
{
get { return currentScope != null ?
currentScope.context : default(TContext); }
}

///
/// Gets the current scope.
///

/// The current scope.
protected internal static Scope<TContext> CurrentScope
{
get { return currentScope; }
}

///
/// Gets the context of this instance.
///

/// The context of this instance.
protected TContext Context
{
get { return context; }
}

///
/// Gets the outer of this instance.
///

/// The outer scope of this instance.
protected Scope<TContext> OuterScope
{
get { return outerScope; }
}

// Initializer

///
/// Initializes the scope.
///

/// The new context.
public virtual void Activate(TContext newContext)
{
if (context!=null)
throw Exceptions.AlreadyInitialized("Context");
context = newContext;
outerScope = currentScope;
currentScope = this;
}


// Constructors

///
/// Initializes a new instance of the class.
///

/// The context of this scope.
protected Scope(TContext context)
{
Activate(context);
}

///
/// Initializes a new instance of the class.
/// Doesn't invoke method.
///

protected Scope()
{
}

// Destructor

protected virtual void Dispose(bool disposing)
{
try {
while (currentScope!=this) {
if (currentScope==null)
throw new InvalidOperationException(Strings.ExScopeCantBeDisposed);
currentScope.Dispose();
}
currentScope = outerScope;
context = null;
outerScope = null;
}
catch (Exception e) {
Log.Error("Scope dispose error: {0}", (object)e);
throw;
}
}

public void Dispose()
{
Dispose(true);
}
}

And finally, IContextBound<TContext>:
  public interface IContextBound<TContext> 
where TContext : class
{
///
/// Gets to which current
/// instance is bound.
///

TContext Context { get; }
}

How all these classes work together? Normally you should:
- Implement IContext<XxxScope> in your own XxxContext class. E.g. AtomicContext: IContext<AtomicScope>.
- Implement XxxScope: Scope<AtomicContext>. Since almost all Scope's methods are protected (except Activate), you should publish some of them as public ones. Normally these methods are:
  public static new XxxContext CurrentContext {
get {
return Scope.CurrentContext;
}
}

public new XxxContext Context
{
get { return base.Context; }
}

- Finally, you should implement IContextBound<XxxContext> on any object which is bound to XxxContext.


What benefits does this bring?

1. First of all, you're getting a stack of XxxContexts for each of your threads. Topmost context on this stack is accessible via XxxScope.CurrentContext property (the name is upon your choice). If there is no active context in the current thread, it normally returns null, but you can override this (by providing e.g. some default context). A good example of this is SerializerScope - it allows to get some current ISerializer (in fact, simplified IFormatter - we've found some properties in IFormatter aren't necessary for its consumers, but since there were no possibility to introduce this interface as IFormatter's base, it's added separately, + we provide FormatterWrapper - it can expose any IFormatter as ISerializer).

Having such a stack gives you one quite attractive opportunity: let's imagine there are many methods \ classes in your assembly assuming there is ambient XxxContext (i.e. current) is available via XxxScope.CurrentContext - so such methods may rely on this. A good examples of quite useful contexts are Session \ Transaction (they're IContexts, and there are SessionScope and TransactionScope). Guaranteed presence of such current contexts allow to use e.g. new MyEntity() code to create a persistent entity: very base constructor (Persistent..ctor) will bind the entity to the current Session automatically, so you shouldn't pass Session everywhere.

2. You can easily activate some context of IContextBound object, keep it active for desirable period and finally be sure it will be deactivated (i.e. removed from the context stack):

Let's see how easily we can clone a persistent object by this way:
  Person personClone;
using (originalPerson.Session.Activate()) {
personClone = (Person) BinarySerializer.Clone(originalPerson);
}

So all we did here is ensured that appropriate Session (as context) is active during cloning. Any persistent implements ISerializable, and its base deserializing constructor automatically binds any deserializing object to the current Session.

Note that XxxContext.Activate method normally acts as follow:
  public XxxScope Activate()
{
if (!IsActive)
return new XxxScope(this);
else
return null;
}

public bool IsActive {
get {
return XxxScope.CurrentContext==this;
}
}

So almost nothing is done if current context is already active; Activate returns null in this case, so no additional XxxScope object is allocated \ disposed. I.e. activation is quite fast. Btw, it's quite fast even if current context isn't active - just one non-finalizable object is allocated; moreover, it doesn't use any locks.

3. Finally, it's pretty easy to build a PostSharp aspect, that will activate some necessary object's context automatically for a duration of method call. There is ContextActivationHelper type allowing to do the most part of this job in aspect automatically. Again, an example (from ValidateAttribute aspect):
  public sealed class ValidateAttribute : OnMethodBoundaryAspect, 
ILaosWeavableAspect,
IDeserializationCallback
{
[NonSerialized]
private ContextActivationHelper<ValidationScope, ValidationContext> contextActivationHelper;

...

public override bool CompileTimeValidate(MethodBase method)
{
if (!contextActivationHelper.CompileTimeValidate(this, method))
return false;
...
return true;
}

public override void OnEntry(MethodExecutionEventArgs eventArgs)
{
IContextBound<ValidationContext> validationContextBound = (IContextBound<ValidationContext>)eventArgs.Instance;
eventArgs.MethodExecutionTag = contextActivationHelper.TryActivateScope(validationContextBound);
}

public override void OnExit(MethodExecutionEventArgs eventArgs)
{
ValidationScope validationScope = (ValidationScope)eventArgs.MethodExecutionTag;
if (validationScope!=null)
validationScope.Dispose();
}


// Constructors

public ValidateAttribute()
{
contextActivationHelper = new ContextActivationHelper<ValidationScope, ValidationContext>();
}

// IDeserializationCallback members

public void OnDeserialization(object sender)
{
contextActivationHelper = new ContextActivationHelper<ValidationScope, ValidationContext>();
}
}

So as you see, it's pretty easy to create an aspect activating all necessary contexts automatically. This is ideal e.g. for methods of persistent types: when top-level method is entered, its Storage and Session are activated as contexts; new Transaction and AtomicContext (which in turn creates and activates IRedoDescriptor and IUndoDescriptor) are created and activated. But when this method calls other methods on persistent types, almost nothing happens, since all necessary contexts are already created and active.

Note: PostSharp supports compound aspects - such aspects do nothing themselves, but instruct PostSharp to apply a set of other aspects (they can be applied on generally anything). So normally you shouldn't even put an aspect attribute to ensure context activation - this will happen automatically because an attribute (put on some very base class - e.g. on Persistent) ensuring all its methods, as well as methods of its descendants will be aspected automatically by default (e.g. if there is no [Suppress(SuppressionType.Transactional)] attribute is applied on some method).

This means you shouldn't even write using (...) blocks to get all the necessary contexts activated.

Isn't this beautiful? ;)

Thursday, January 10, 2008

Highlight: [Index] aspect, B+ tree-based indexes, measures

Let's start looking up highlights from [Index] aspect. Here is example code (NUnit test) and its output.

Here [Index] has nearly the same meaning as before - it creates an index on collection. It is not a persistent index, but in-memory index. I hope to show you fully persistent and transactional collections with persistent indexes created by absolutely the same way in the end of this month.

Note: All the things we're showing now aren't DataObjects.Net 4.0 itself. But they are (or will be) used either in models based on it, or inside it. So this isn't how indexes are defined in v4.0, but just a kind of "technology demo". Practically we don't use [Index] aspect at all for now, even in our own code, but create indexes manually (i.e. using factory \ constructors), although this doesn't mean it is useless.

What this example shows:
- How to create an index on collection supporting change notifications (our CollectionBase). In fact, you should just apply [Index] aspect on collection, and [Changer] aspect - on the property on which index is to be built.
- How to access it: if [Index] is applied, you may cast the collection to IHasIndexes interface in runtime, although it wasn't supported. PostSharp + our aspects make all the magic during compilation of assembly, so any type with [Index] supports this interface.
- How to add measures to it - see IndexedPersonCollection constructor. Measures are some additive values (such as key count, sum of indexed property and so on) that can be further calculated for any index range in logarithmic time - their values are maintained in index pages.
- How to get index range: IndexedPersonCollection.FilterByBirthday method.
- How to get measures for index ranges: see IndexedPersonCollection.CountByBithday, AverageAgeByBithday methods.

Note: nothing additional should be done to maintain the indexes of such collection intact - [Changer] ensures it will be notified before and after any change of its item. [Changer] aspect does very similar thing: it makes the declaring type of property or method it is applied on to properly support IChangeNotifier interface.

Finally, this test shows how B+ trees with automatic binding to collection \ item are compared to other possible index implementations:
- SortedDictionary-based indexes: see ManualPersonCollection and RegularPerson classes
- List without indexes: see RegularPersonCollection.

Some conclusions, that can be made from test output:
- Our index implementation is ~ 3 times slower on insertions \ deletions, then SortedDictionary. Actually they are quite comparable (actual slowdown factor is about 1.5) - the noticeable part of time is spent on automatic change notification ([Changer] implementation). Further we'll improve this (profiling shows that both PostSharp.Laos-generated code, as well as our [Changer] implementation has noticeable lacks).
- Taking this into account the performance of our B+ trees seems good - SortedDictionary internally uses red-black tree, which is almost perfect for keeping balanced trees in memory, so it's quite difficult to compete with it, especially having so many levels of abstraction (see e.g. tree providers we have now; more advanced ones are in development).
- On large sets of data B+ trees needs less memory then red-black trees. This can be seen in the last test sequence (1M items) - even although we maintain specified set of of measures.
- Indexing speed (i.e. seek time) is almost the same as for the SortedDictionary.

And the most important ones:
- "Filtering ..." tests: B+ tree allows to get index ranges (e.g. all the persons with age between 20 and 30). There is no built-in collection in .Net that allows this. Note that getting index range enumerator requires nearly the same time as indexing (i.e. item seek operation) in our case; so almost all the time in this test in case of B+ tree is spent on enumeration of selected items; tests for other collections do the same, but they should enumerate the whole collection to get the same result.
- "Count on ranges ...", "Average on ranges ..." tests show how measure calculation for index range is fast. Again, in our case it requires the time comparable to index seek operation.

We're using B+ trees everywhere - all non-SQL storages (memory, file system, and further - distributed) will be fully based on it. My current intention is to show some of features these storages will share. Do you know if e.g. Microsoft SQL Server is capable of storing arbitrary measures inside its indexes? Or at least item\key count? I'm not sure ;) (I know they use "statistics" for increasing speed of count(*)-like queries, but what are the details?).

Ok, the next step from our side here will be a demonstration of persistent and fully transactional index providing snapshot isolation. As I've mentioned, we expect getting most part of this working on the end of this month.

Xtensive.Indexing, Xtensive.Integrity - early Community Preview

We've just published a preview of some of assemblies forming our new storage platform:
- Xtensive.Core - all we couldn't find in System \ mscorlib \ System.Core assemblies, as well as some essential aspects ([Synchronize], [Lockable], [Changer], [Trace])
- Xtensive.Indexing - data indexing framework. B+ tree-based indexes, measures, Bloom filters, [Index] aspect.
- Xtensive.Integrity - atomicity (undo \ redo), delayed validation. Staged (atomic) events and transactions will also be moved to this assembly shortly. [Atomic], [Validate] aspects.
- Xtensive.Sql - DOM for SQL language. Can be downloaded separately here.
- Xtensive.Messaging - protocol-independent message exchange framework.
- Xtensive.PluginManager - plugin search and loading framework.

You can download the binaries with NUnit tests for them here.

Comments \ highlights and examples will follow up shortly.

Tuesday, January 01, 2008

Happy New Year!

First of all, Happy New Year! Especially to all the guys related to software development ;)

Secondly, I apologize for failing with the DataObjects.Net 4.0 CTP release in 2007. New date for the first CTP is February 15, 2008; internally we're targeting on the end of January.

This month I'll publish a set of posts covering some of new concepts we've introduced, that already work. Most of them will be related to Xtensive.Core and Xtensive.Indexing assemblies - so I'll describe very basic concepts of DataObjects.Net 4.0 and everything else we're building on the top of them. Both assemblies will be available as well (free for now). I hope it will help you to get the initial imagination of our new platform.

P.S. Last year was quite complex for us - we've grew up to 20 person team, but actually lost some part of our good reputation (we still have two big problems: v4.0 release & v3.X support quality). So resolving these issues and preventing the similar ones is essential for us now, and I hope we'll prove this in 2008.