Monday, March 23, 2009

LINQ in DataObjects.Net v4.0: details, issues and news, part 2

3. Support for common LINQ workarounds for SQL syntax

I really like how LINQ is designed - its syntax is so unified, but still more natural and flexible than SQL. And because of this some SQL features looks quite differently in it - although more naturally. Let's see how LEFT OUTER JOIN may look like:

Version 1:
from p in Products
from nonEmptyC in (from c in Categories where c.CategoryID==p.CategoryID select c).DefaultIfEmpty()

select new {nonEmptyC, p}

The same, but in extension method calls:
Products.SelectMany (
p => Categories.Where (c => c.CategoryID == p.CategoryID).DefaultIfEmpty(),
(p, nonEmptyC) => new { nonEmptyC, p } )

Version 2:
from p in Products
join c in Categories on p.CategoryID equals c.CategoryID into jc
from nonEmptyC in jc.DefaultIfEmpty()
select new {nonEmptyC, p}

The same, but in extension method calls:
Categories, p => p.CategoryID, c => c.CategoryID, (p, jc) => new {p, jc} )
.SelectMany (
tmp => tmp.jc.DefaultIfEmpty(), (tmp, nonEmptyC) => new {nonEmptyC, tmp.p} )

So as you see, there are at least two different ways to implement the same LEFT OUTER JOIN in LINQ. And we must recognize both.

4. Optimization of RSE queries for indexing providers

As far as you might know, we translate any query into RSE query first. RSE uses indexes as lowest level abstraction of data sources. But there are two types of indexes:
- Primary indexes - in our case they have the same meaning as table in SQL (because we always have primary keys an any tables).
- Secondary indexes - they allow to locate primary keys by values of other columns from primary indexes they are bound to.

Our LINQ translation layer (as well as any other pat of DO40) "thinks" of primary indexes as of tables they're bound to - if it needs a data from some table, it queries corresponding primary index. Sexondary indexes are ignored at all by LINQ translator. Why? Because LINQ translator's goal is to produce the crudest query plan (RSE query). In particular, this plan:
- Uses only primary indexes
- Uses only index scans. Index seeks (RecordSet.Seek and RecordSet.Range operations) aren't used at all.

Further this plan (RSE query) is either:
- Passed directly to RSE compiler - this happens in underlying execution engine supports query optimization. In fact, this happens for SQL providers.
- Optimized and passed to RSE compiler - this happens for indexing providers.

So we perform query plan optimization. That's what any good SQL server does. And the most interesting quersion here is what exactly our query optimizer can do, and how our query optimizer can be compared with others. Brief answer:
1) For now we're mainly working on index usage related optimizations - e.g. replacement of PrimaryIndex.Filter(...) chain to SecondaryIndex1.Range(...) .Union(SecondaryIndex2.Range(...)) .Union(...) .Join(PrimaryIndex, ...) .Select(...).Filter(...) chain. We hope this will be working in the beginning of April.
2) Further plans include operation reordering, join sequence and aggregate calculation optimizations (earlier I wrote we already maintain some additive aggregates in indexes, so that's how we'll use them).
3) If part 1) will be implemented well, we'll "beat" many embedded database engines right on start - e.g. SQLite (its query optimizer implements this part, but with quite serious limitations). Btw, we're seriously considering implement query plan branching here - this will allow us to gracefully handle queries like "Select ID, Value from Simplest where (ID=@P1) or (@P2=1)" - most of RDBMS (including Microsoft SQL Server) can't do this now. If you don't see anything suspecious in this query, check out the description here.

So we hope our query optimizer will be good enough for embedded databases almost from scratch. It should provide expected optimization results for 95% of relatively simple queries and possibility to write a query translated to any desirable execution plan, if optimizer can't help.

LINQ team short term goals

We must show:
- DO perfectly works with regular databases - in particular, this means we must provide full LINQ support for them
- Embedded databases provided by DO are reliable and fast as well. Initially "fast" means "typical LINQ queries are executed without unnecessary index scans".

That's it.

P.S. This month I feel myself quite devoted to development - last week I worked overnight for few times. That's because I feel the progress, and thus feel myself much more happier ;) In September we switched a part of DO4 team to a new project that was requiring this to be completed in time, and now these guys are back. Moreover, we've hired few new developers after NY, and for now they joined DO4 team as well. So right now there are 7 developers + me working on DO4 - that's much more then 3 developers + me before. This pushes the development forward quite fast.


  1. Hello Alex,

    i know that the query optimizer is important for indexing providers. But don't forget the early adopters. We need at least schema evoution to deploy even alpha stages of our projects. Ant is ok to provide this for SQL Storage only for the moment.

    BTW. Providing nightly builds is great.

    Kind Regards Thomas

  2. Actually we're working on optimizer just because we can't assign more developers to schema evolution and LINQ tasks. Or, better to say, we can, but they won't go faster.

    Concerning schema evolution: I'm finishing with nice level of abstraction for such tasks now: Xtensive.Modelling. It should help to implemented this faster and better.

    So we're applying all the efforts to bring both LINQ and schema evolution ASAP.

  3. This sounds great. If LINQ and Schema evolution are ready, DO4 can be used for new projects.