Monday, March 23, 2009

LINQ in DataObjects.Net v4.0: details, issues and news, part 2

3. Support for common LINQ workarounds for SQL syntax

I really like how LINQ is designed - its syntax is so unified, but still more natural and flexible than SQL. And because of this some SQL features looks quite differently in it - although more naturally. Let's see how LEFT OUTER JOIN may look like:

Version 1:
from p in Products
from nonEmptyC in (from c in Categories where c.CategoryID==p.CategoryID select c).DefaultIfEmpty()

select new {nonEmptyC, p}

The same, but in extension method calls:
Products.SelectMany (
p => Categories.Where (c => c.CategoryID == p.CategoryID).DefaultIfEmpty(),
(p, nonEmptyC) => new { nonEmptyC, p } )

Version 2:
from p in Products
join c in Categories on p.CategoryID equals c.CategoryID into jc
from nonEmptyC in jc.DefaultIfEmpty()
select new {nonEmptyC, p}

The same, but in extension method calls:
Categories, p => p.CategoryID, c => c.CategoryID, (p, jc) => new {p, jc} )
.SelectMany (
tmp => tmp.jc.DefaultIfEmpty(), (tmp, nonEmptyC) => new {nonEmptyC, tmp.p} )

So as you see, there are at least two different ways to implement the same LEFT OUTER JOIN in LINQ. And we must recognize both.

4. Optimization of RSE queries for indexing providers

As far as you might know, we translate any query into RSE query first. RSE uses indexes as lowest level abstraction of data sources. But there are two types of indexes:
- Primary indexes - in our case they have the same meaning as table in SQL (because we always have primary keys an any tables).
- Secondary indexes - they allow to locate primary keys by values of other columns from primary indexes they are bound to.

Our LINQ translation layer (as well as any other pat of DO40) "thinks" of primary indexes as of tables they're bound to - if it needs a data from some table, it queries corresponding primary index. Sexondary indexes are ignored at all by LINQ translator. Why? Because LINQ translator's goal is to produce the crudest query plan (RSE query). In particular, this plan:
- Uses only primary indexes
- Uses only index scans. Index seeks (RecordSet.Seek and RecordSet.Range operations) aren't used at all.

Further this plan (RSE query) is either:
- Passed directly to RSE compiler - this happens in underlying execution engine supports query optimization. In fact, this happens for SQL providers.
- Optimized and passed to RSE compiler - this happens for indexing providers.

So we perform query plan optimization. That's what any good SQL server does. And the most interesting quersion here is what exactly our query optimizer can do, and how our query optimizer can be compared with others. Brief answer:
1) For now we're mainly working on index usage related optimizations - e.g. replacement of PrimaryIndex.Filter(...) chain to SecondaryIndex1.Range(...) .Union(SecondaryIndex2.Range(...)) .Union(...) .Join(PrimaryIndex, ...) .Select(...).Filter(...) chain. We hope this will be working in the beginning of April.
2) Further plans include operation reordering, join sequence and aggregate calculation optimizations (earlier I wrote we already maintain some additive aggregates in indexes, so that's how we'll use them).
3) If part 1) will be implemented well, we'll "beat" many embedded database engines right on start - e.g. SQLite (its query optimizer implements this part, but with quite serious limitations). Btw, we're seriously considering implement query plan branching here - this will allow us to gracefully handle queries like "Select ID, Value from Simplest where (ID=@P1) or (@P2=1)" - most of RDBMS (including Microsoft SQL Server) can't do this now. If you don't see anything suspecious in this query, check out the description here.

So we hope our query optimizer will be good enough for embedded databases almost from scratch. It should provide expected optimization results for 95% of relatively simple queries and possibility to write a query translated to any desirable execution plan, if optimizer can't help.

LINQ team short term goals

We must show:
- DO perfectly works with regular databases - in particular, this means we must provide full LINQ support for them
- Embedded databases provided by DO are reliable and fast as well. Initially "fast" means "typical LINQ queries are executed without unnecessary index scans".

That's it.

P.S. This month I feel myself quite devoted to development - last week I worked overnight for few times. That's because I feel the progress, and thus feel myself much more happier ;) In September we switched a part of DO4 team to a new project that was requiring this to be completed in time, and now these guys are back. Moreover, we've hired few new developers after NY, and for now they joined DO4 team as well. So right now there are 7 developers + me working on DO4 - that's much more then 3 developers + me before. This pushes the development forward quite fast.

Tuesday, March 10, 2009

LINQ in DataObjects.Net v4.0: details and issues

I think it's right time to give some details on implementation of LINQ in new DO. Let's start from issues:

1. Correlated subqueries. LINQ allows us to write generally anything in from clause (collectionSelector expression in SelectMany) - e.g. a correlated subquery: from a in MainQuery from b in Subquery select new {a, b} (translated by C# to MainQuery.SelectMany(a => Subquery, (a, b) => new { a = a, b = b } )). Note that Subquery is generally any IQueryable, including e.g. AnotherQuery.Where(a.X==...).

Standard SQL doesn't allow to do the same - you can't use correlated subqueries in from clause there. Quite strange, because generally this doesn't significantly affect on anything. But:
- SQL Server 2005 supports this via Cross / Outer apply statement.
- We also suppport this in RSE by ApplyProvider.

It's easy to implement its executable counterpart, but its translation to SQL is rather tricky: we've added this statement to SqlDom, although now we can complile it only for SQL Server 2005. We hope it will be possible to do this without serious limitations for Oracle and PostgreSQL as well, although we aren't going to rely on this: we'll translate correlated subqueries (actually this is possible for the most frequently used part of them) to RSE using JoinProvider or PredicateJoinProvider - in fact, we expect correlated subquery looks like NonCorrelatedSubquery.Where(correlatedExpression) in this case. In other cases we'll use ApplyProvider.

Btw, EF and LinqToSql behave the same (nearly - e.g. there is no RSE behind them ;) ).

2. DateTime and TimeSpan handling. Unfortunately, SQL Server doesn't support interval (or similar) data type, but many others do, so functions dealing with dates are quite different. We did a lot to handle this:
- First of all, we implemented corresponding abstraction layer in SqlDom.
- Secondly, we implemented unified method call translation framework - so-called MemberCompilerProvider - a base class in Xtensive.Core.Linq providing compilers (methods) for specified members and allowing to invoke them fast during the translation process. Now it allows us (and you!) to define compilers operating at any desirable translation layer (in this case, at RSE to SQL translation layer) for generally any invoked members. See sources in Xtensive.Storage.Providers.Sql.Mappings.FunctionMappings namespace (internal!) for many many examples. Here is just one:

[Compiler(typeof(DateTime), "Year", TargetKind.PropertyGet)]
public static SqlExpression DateTimeYear(SqlExpression this_)
return SqlFactory.Extract(SqlDateTimePart.Year, this_);

... To be continued.

DataObjects.Net v4.0: status update

Installer: finally this part is almost fully finished. The newest installer (check out the latest nightly build) downloads and installs all you need to either use or even compile new DataObjects.Net. The only feature we should add is test database creation (although this isn't necessary for running samples on IMDB), but this is for some later time.

LINQ support: in the end of February we've started to refactor our implementation of LINQ translator for the third time. First attempt was too crude - mainly for testing. We've finished with it in december. The second one was good, but not simple enough (i.e. too complex to deal with it further). Finally we've come to an approach allowing us to get enough simple unified implementation. This should lead LINQ translation project to completion by the end of March. Obviously, expect query optimizer for index storages - this part will be in development for few more months, but as it was mentioned, absence of this feature for now won't lead to any troubles.

Schema evolution: unfortunately, we're just starting this part. But the good news is that we decided to switch to index-based schema comparison rather then SqlDom-based. This must simplify everything a lot (index model is much simpler then SqlDom database model), and we anyway need this part developed till the launch of our upcoming file system storage provider. But on the other hand, this implies we will need to use a set of workarounds to make schema evolution layer fully compatible with the general SQL database. But for now the implementation speed is more important then internal unification - we still want to deliver this feature till the end of March.

Nightly builds: now they're live for DO4. The file with empty name here is nightly build of DO4 - for now our website improperly recognizes its display name, since we've switched to different naming pattern (to be fixed today).

Other features: almost untouched yet, mainly because of complexities with LINQ translation. It goes much slower then we initially planned, but in comparison to other known implementations we're rather far from reaching the averages - e.g. LLBLGen team spent 6 months on this. And I really don't want to reach this border.