Thursday, April 03, 2008

Status update

1. We're launching our new web site next week. New features include the announced ones, as well as few new ones. The most useful is global search: you can search on our support forum, news (blogs) and on the web site itself using a single search form on web site. Probably we'll add search in our Online help later.

2. The part of our team working on it during last 4 weeks is returning back to v4.0 now. We're targeting to deliver CTP in the beginning of May now - web site has taken 1.5 additional weeks.

3. Funny or not, but new web site development lead to appearance of few new projects, which will be publicly available ~ with release of DataObjects.Net 4.0:

Xtensive.Web

There are web crawler and e-mail sending components. Nothing very special.

So as you might expect, the whole search feature depends on this component. We simply crawl all we need to built-in database.

Xtensive.Fulltext

Here we have implemented a text analyzer (tokenizer & token processing chain) for full-text indexing & querying, highlight region selector and highlighter itself.

Usual question: why? The answer: although DO offers full-text indexing & search, it doesn't offer text highlighting (selection of "best matching" part of the text and highlighting found words in - independently of their forms). Finally, since we offer our own indexing engine in v4.0, the decision to built our own full-text indexing model there (based just on regular indexes - i.e. on our owns as well) was made far before this moment, so Xtensive.Fulltext will anyway help to solve some problems we'll get on this path further. But for now this part is used just for selecting best matching text parts and highlighting found words there.

Certainly we evaluated Lucene.Net highlighter, but found it simply not acceptable (don't remember why exactly now, but the reasons were serious - at least it couldn't handle Russian word forms and offers quite limited best matching fragment detection approach).

Token processing chain (default tokens are word, number, symbol, url, e-mail, etc.) may include arbitrary token filters, such as "stop words" eliminator, sentence detector and finally, stemmer (words to their base forms converter). Initially we tried to take Snowball stemmers from Lucene.Net, but found them worse then bad - most likely because they're generated for Java by some Snowball grammar compiler, and then - ported to .Net by some automatic tool (as the whole Lucene.Net project). Russian stemmer doesn't work at all there because of buggy encoding conversion; all stemmers work extremely inefficiently from the point of .Net code optimization rules.

So for now we're using just different implementation of Porter stemmer for English, and finishing our own implementation of Russian stemmer based on Snowball stemming algorithm for Russian language. Further plans (not "right now", of course) include implementation of either Snowball grammar to .Net compiler, or simply a set of stemmers for most common European languages.

Xtensive.Core.Serialization (namespace) and Xtensive.Xml (assembly)

It is our own serialization layer. Differences to what's offered by .Net are:
- Strongly typed. .Net serialization uses boxing even for serializing such simple type of values as Int32.
- Identifier compression (used by binary serialization only): every identifier is serialized as-is just once (on its first appearance); further a reference to it is serialized. This significantly reduces binary serializer's output size (almost any type name \ property name takes just 4 bytes).
- Arbitrary serialization formats - currently binary (fast) and XML (human readable & writeable).
- Two selectable serialization strategies: quick (no queue) for small graphs & robust (with queue) for large ones. Moreover, property serialization strategy can be selected by object serializer itself.
- Arbitrary set of value serializers (e.g. you can write native value serializer for your own Vector type - so serialize it not as an object in graph, but as a value of some property)
- Well-known objects support. .Net serialization layer requires type substitution for this (to IObjectReference implementor), which is certainly not good. We do not - i.e. we always allow a particular object serializer to make a decision if a new object should be created, or an old one should be used.
- Custom serialization queues support. This allows to serialize graphs which simply can't fit in RAM. That's probably one of the worst limitations of .Net serialization layer.
- Etc.

As I wrote before, we need this to get the "default" content (common dictionaries, such as Currency, Sex, etc...) imported to a web site database, plus handle download \ product definitions (Product, Edition, Version, etc. in web site model). Btw, for now just deserialization works (i.e. tested) - for now we need just it; but serialization will be certainly added shortly.

6 comments:

  1. Some blog post earlier you mentioned a delivery date (that was already delayed) of 7 march..Now it's scheduled for begin May?
    I see many new stuff, but not a single direct ORM related feature (like paging support, permissions in db, remove fastloaddata, minimize deadlocks, etc..)

    ReplyDelete
  2. a) Correct.

    b) Concerning the new stuff: I agree with you. This means we're looking for a kind of database + ORM. I wrote before (starting from Platform Vision) we see a lack of really scalable data storages like Google's BigTable, and demand for such storages in near future will grow quite fast (applications are migrating to SaaS model; databases become bigger and bigger). So we want to make a complete proposal here: good ORM as front-end, and generally any relational storage including our own one as back end.

    ReplyDelete
  3. Btw, here I'm trying to explain why we feel we can compete with existing ones, such as SQL Server (certainly only in some cases): most of them really inherit the architecture of databases developed in 80-90s. The environment (and thus - goals) that time were completely different, and design\implementation they have seems not so good now.

    We noticed this when works on our in-memory indexing engine were started, that in turn lead us to the following conclusion: we can do better. Not just for in-memory indexing, but even for in-cluster indexing. And although we don't fight for clusters right now, we try to deliver an architecture that will allow us to do this in near future.

    ReplyDelete
  4. Concerning the features you're enumerating:
    - Paging is actually rather simple problem. Not sure if it's even worth to explain the approaches...
    - Permissions, when they'll appear, will be definitely fully relational. If this is interesting, I can briefly explain the model I have in my head. But until this moment we should make the core persistence working.
    - No FastLoadData. Most of ideas here were already explained in our support forum. Briefly, we'll simply load all we can from the affected tables \ indexes (that's simple); everything else will be loaded on demand; finally, we must provide an approach allowing to express the intention to work with a set of objects and their properties - to implement preloading
    - Deadlocks: in case with our own storage this is completely impossible: we're going to support snapshot isolation only. Moreover, we'll provide few more nice concurrency related features you won't find in existing databases now. I'll cover them briefly here further. In case with regular databases locking (or version conflict detection) will still fully depend on them.

    ReplyDelete
  5. >> Paging is actually rather simple problem. Not sure if it's even worth to explain the approaches...

    It's a simple problem, but not efficient solved in the 3.* codebase..

    ReplyDelete
  6. I agree with this. So this time we'll provide better solution.

    ReplyDelete