Sunday, January 13, 2008

Indexes: why we're working on our own indexing engine?

Let's state one of our final goals fist: we want to implement a distributed storage behind DataObjects.NET 4.0. Further I'll try to explain why.

First of all I should say that initially we haven't thought about this. We started working on indexing mainly because they were necessary to implement offline layer well - there are all the same problems, e.g. it's quite desirable to have an opportunity to find all the objects referencing a particular one (e.g. on its removal). In some cases it's desirable to check e.g. uniqueness of some value, or order the objects by value of some of its properties. So indexes are anyway necessary here. Of course, in-memory, non-concurrent indexes, which are obviously simpler to implement.

But in May we've found one quite attractive problem, which isn't yet solved, and its solution should be very interesting commercially. On the other hand, it isn't easy to solve it using the technologies we had that time - i.e. the application solving it needs extremely big storage - probably, working on hundreds of PCs. That's how the idea of implementing distributed indexes has appeared.

We started to work on solution from mid-summer, from the lowest level - distributed file system. And immediately made a set of mistakes (~ 2 man-months of work were dropped because of this) - we underestimated the problem. Finally this lead us to studying existing approaches ;) We've read all we could find about distributed databases and storage systems. The most interesting articles were Google's publications describing Chubby (distributed lock service using Paxos algorithm), GFS (Google's distributed file system) and BigTable (distributed indexed table) architecture, as well as the papers related to Microsoft's Boxwood (distributed file system) project. Google's publications were interesting mainly because of innovative ideas; Boxwood was very good as example of possible layering \ architecture of distributed file system. Key conclusions we've made:
- We really can implement distributed storage - we understood everything; moreover, there were some code metrics related to GFS \ BigTable, that were looking quite good. So if we won't make too many mistakes, it will take reasonable time.
- We liked a lot the approach used by Google - they are using so-called sandwiched (sliced) indexes. Each layer (slice) of such sandwich (except the topmost one) is read-only B+tree containing differences that should be applied to the lower part of "sandwich" to produce the state related to the current layer. Further I'll explain why this approach seems the most promising in comparison to others in near future, taking into account the trends.
- We understood that index implementations in databases we know seems much less promising then this one - from quite many points.

So there is something we need, not too complex... Moreover, it seems it's a good time to implement this, since for now something similar isn't widely available. And finally, it integrates quite well into the technology we have - DataObjects.Net v4.0 design was initially implying two kinds of storage providers (SQL-based and indexing-based), and distributed indexes can be supported as provider of the second type without almost any additional efforts (further we understood we should get some problems distributing queries, but almost immediately found a good solution). Ok, we should implement this.

Of course, not "right now" - i.e. we don't intend to deliver it asap. But we developed an implementation plan for such a storage. And as we discovered, many pieces of it can be implemented and tested even without all the distributed stuff like Paxos server and distributed file system. Basically, we need to make such indexes working concurrently over regular file system - their further "upgrade" to distributed ones is actually much less complex, after we'll get working distributed file system (again, it's not a "right now" task - for now we just have some "wrong distributed file system" code, newly working and tested Paxos server (election algorithm) using Xtensive.Messaging, untested distributed state machine and the platform for running distributed NUnit tests - Xtensive.Distributed.Test). All distributed stuff-related works are paused from mid-autumn - they really can be done later, moreover, it's always better to design any big system from topmost layers to the lowest ones: in this case requirements to lower layers immediately appear when you implement or design above ones. If you're starting from the lowest ones, you may spend a lot of time on implementing some of features you probably won't need at all. From that time we're working just on Xtensive.Storage and indexing.

Ok, now let's explain few quite essential questions:

Is transactional data indexing - all we need to implement a storage (e.g. RDBMS)?

The answer is yes. Essential features of any RDBMS include:
- Tables and indexes access and maintainance. Table is nothing more then [primary] index, which key is PK (primary key), and value includes all other columns. An index on this table is the same index again, which key includes indexed columns and value (FK - foreign key) - a key of corresponding entry from primary index (PK).
- Transactions support. This implies implementing either lock-based or row versioning-based isolation.

Everything else is actually not essential. An example of quite well-known database engine offering just these features (moreover, it provides just seek \ range access to indexes and support for transactions - i.e. as far as I remember, there is no such term as "table" at all - just "index") is Berkeley DB, which was recently acquired by Oracle.

Finally, there is well-known trend, which seems to be proved during last years: more and more complex logic tends to migrate from RDBMS layer (e.g. from stored procedures and triggers) to business logic layer - at least because it's much easier to write, maintain and test the code written on high-level languages, rather then on SQL. So having just indexing engine as RDBMS is an extremal case of this concept; and since we have very good experience related to migrating everything to BLL and providing very general abstractions there, may be it's time to take care about this "simple" storage layer as well...

Funny, but this is completely contrary to what Microsoft is doing: they're offering to "bundle" application BLL assemblies into SQL Server. We're trying to bundle RDBMS into the application domain (of course this isn't correct for the distributed database).

Why Google's approach to data indexing is more promising then other well-known ones, including the one used by SQL Server?

Well, the following fact plays key role here: the guys at Google were choosing an approach to data indexing keeping one key requirement in mind: it should be possible to distribute any index (or a set of them - i.e. BigTable) to thousands of PCs. Sandwiched index seems really the most promising option in this case:
- Any operation on it requires almost no locking in average (i.e. there will be some moments requiring locking, but these locks can be released almost immediately - they lock the structures in memory (let's say index map), and this happens without any intermediate disk i\o). AFAIK there is no other approach to implementing indexes allowing such a level of parallelism. Why this is important? Ok, imagine what is the cost of reading distributed lock status (at least one network round-trip). This approach seems quite promising now even on home PCs - the number of CPU cores is growing, as well as the amount of hard drives.
- Such indexes are quite fast for data modification operations. Basically, the time for any operation on them is ~ cost of running this operation in memory on rather small slice (i.e. B+tree containing 0...1000 keys) + cost of logging it to append-only log.
- It's possible to implement such indexes using just CreateFile, AppendToFile and DeleteFile operations. This is quite important, since appending to a distributed file can be implemented much easier and work much faster then safe overwrite of existing part of a distributed file. If you're interested in details - please refer to GFS description.
- Since there is append-only writes, there is no requirement to have same-sized pages. Moreover, there is no necessity to deal with dead pages in the index file. If I'd compare this approach to the same process in SQL Server indexes \ files, the best analogy I imagine seems "Sliced indexes" ~ ".Net garbage collection" (fast allocation, rare compactions) vs "Commonly used indexes" ~ "C++ heap management" (slow allocation, slow reclamation, no compactions).
- It's possible to compress the index with ease because of all above. Since any index page usually contains quite similar data (i.e. binary part of keys is usually the same on the same page), any compression algorithm like LZW (possibly - in conjunction with differential key compression) gives quite good results. Google reports they reduce the index size by ~ 10 times (!) by this (for their crawler). Why compression is important? Just imagine you read 10 times less from the disk.
- There are some disadvantages (e.g. any index seek operation implies seek in all the slices - imagine, if 20 of them are big (so caching works bad for them), so there can be lots of disk operations), but it seems all of them can be successfully resolved (e.g. the last issue is resolved quite gracefully with Bloom filters).

All above conclusions are listed in mentioned paper describing GFS. We've made few more:
- We understood such approach allows to implement snapshot isolation with ease. Moreover, the cost of transaction completion (rollback or commit) is again almost equal to the cost of appendinga mark to transaction log (+ version conflict detection - on commit). That's a bit strange, but Google doesn't offer this at all (for now they limit support of transactions in BigTables to support of atomic operations only - i.e. in fact there are no "honest" transactions yet).
- Finally, we can easily maintain additive measures in such indexes. Obviously they're quite useful - they reduce the time of queries like "Select avg(...) from ... where ..." from O(rowsToProcess) to O(log(totalRowCount)).

If you'd compare this to indexing in e.g. SQL Server, you'll immediately understand SQL Server database will never be distributed at all - not just "so well" (of course if they won't significantly redesign the storage). They use B+tree page locking, page overwriting, complex page reclamation... This really can't be distributed well.

Btw, such approach to indexing seems also historical: sandwiched indexes seem to require more time for background maintainance (i.e. compaction) and may take 3-5 times more space on disk, if no compression is used. It's nearly the same case as with snapshot isolation - it is much more attractive, but it appeared only in SQL Server 2005. Most likely just because SQL Server has grown up from Sybase, which was initially providing lock-based isolation. Ok, the same is correct for garbage collection - it wasn't used frequently till Java \ .Net times, because memory was quite precious resource, + CPUs were not fast enough.

Finally, there is quite important "proof of concept" that such indexes are much closer to the future: Google works quite well ;) So we like to have them too. Initially - as part of file system-based storage, and further - as the part of distributed one. Hopefully the God will help us ;)

P.S. There is one unanswered topic: so how we're going to deal with queries, if all above is true? For now I can forward you to DataObjects.Net 4.0 Platform Vision - there are some details related to this. I'll try to describe our approach to querying the data more precisely in this blog further (e.g. some aspects related to distributed queries).


  1. Thanks for the thoughtful reply. Apologies if i came off harsh, I have a lot of respect for X-tensive (DO.NET was great to work with) and I'd hate to see you guys lose ground because of what I see as common software engineering pitfalls.

    BTW - Have you guys checked out HBase? It's an open source attempt at implementing Bigtable AFAIK, I don't know how far they have come along in regards to indexing.

  2. I agree we're on rather risky way now - i.e. if our new proposal won't be competitive enough, we may loose a product at all.

    Concerning HBase - we've seen it. I can briefly compare the key features here:

    1. We're keeping ourselves closer to architecture of relational storages, rather then BigTable. BigTable is actually quite similar to a database where all the tables share the same primary key part (that's what acts as row key in BigTable); moreover, any index (except column family-related ones) in such a database includes this part of the key on the first place.

    Pros and cons:
    - It seems it's almost always possible to represent any model by such a way; at least with several BigTables. But actually this looks a bit strange for the first glance ;)
    - Such way of organizing the data \ indexes eliminates duplication of common part of any key everywhere.
    - Since they partition everything by this key, and the amount of tablets can be large (they report each tablet is ~ 128...256Mb), actually it's difficult to imagine how they use other indexes (of course if they want to find something regardless of its BigTable key). In fact, they should run a query against all the tablets in this case. But of course creation of another BigTable with different key may help here.
    - Everything else was mentioned above.

    We'll be keeping ourselves closer to relational storages. But to not loose key compression \ partitioning by key part \ index grouping features, we'll provide support for nested indexes. I.e. it will be possible to group the indexes having some common prefix in their key into one "big index" (its top-level B+tree will index this common part of the key, and nodes will be nested indexes). From the point of developer \ API this won't affect much - such aggregate index will expose their nested indexes as if they'd really share the common key part, although internally they won't store it.

    P.S. For now we don't work on this.

    2. We tend to use B+trees everywhere, not maps \ hashes. Simple estimation shows that B+tree with large page size acts almost with hash table efficiency: if I'd keep e.g. 100Gb per partition (~ 1G of keys), a B+tree with 1024 nodes (~ 64Kb of compressed data) per page will have just 3 levels. Top two of them (root page + a level below) can be cached in memory, so seek in such a tree requires just one disk IO operation. Ok, two in the worst case - this allows to address 1024 times more keys \ data.

    3. As I've mentioned, it seems quite strange for me they choose so small tablet (partition) size. We're targeting on ~ 100Gb partitions (i.e. 1000 times larger). The explanation is quite simple: the larger partition is, the less of them are affected by queries, thus distributed query needs much less network round-trips. For now don't see any serious reasons for not making partitions at least 1000 times larger then they.

    Moreover, in this case information on index partitioning can be distributed \ cached much more efficiently: index repartitioning is rather rare (once per X hours?), and if this happens, client simply asks to give him a full update (all the changes in index partitions from time X to now) after first partition miss. Maintaining even 1Pb index seems quite possible in this case: there are ~ 10000 partitions, so sending rare & small updates to each client on misses seems quite cheap.

    Note: 1Pb index is actually almost unimaginable. Ok, the whole database, but not one index ;)

    4. Probably we won't use distributed file system at all. In fact, the problem of replication can be solved on partition server level - mainly, we should just replicate transaction log to other servers (replication slaves).

    Btw, this approach even brings some benefits:
    - It's possible to avoid touching master partition server at all, while transaction is read-only, and rely on any slave, since they're always fully in sync.
    - Probably it's even possible to choose the master dynamically - i.e. for each transaction. This should allow to distribute the data read queries equally among them.

    So basically, this approach implies implementation of several distributed replicated state machines:
    - Storage server: manages indexes, distributed transactions, etc.
    - Index server manages the whole index, cares about repartitioning, etc.
    - Index partition server: manages a single partition.

    5. We store the data natively, i.e. not as strings.

    6. Large objects would never be stored in indexes. We're going to provide a converter on DataObjects.Net level, which will allow to use Stream fields. On RDBMS level they'll be represented as string columns containing URLs referencing actual stream data. The actual stream will be provided by another service (either simple or distributed stream storage). Operations with such streams will be transactional \ isolated as well.

  3. Thought about having no distributed file system a bit more today... I like this idea more and more: really, having a set of replicated state machines (servers) relying on Paxos seems quite good idea. There are few more attractive consequences:

    - It isn't necessary to deliver the data from DFS (GFS in Google) to these servers, since they keep it locally. This saves a part of network bandwidth, plus allows to forget about problems with caching of fetched data. Google says this is actually not necessary, since hitting the same data seems rather case + there is page cache on index server's level. Here it isn't necessary to even think about this - if in some case caching will be efficient, OS cache will probably handle this (although we also have page cache in B+trees).

    - GFS ensures durability of results of any atomic operation, such as appending to a file. RDBMS ensures durability of all the actions made in successfully committed transaction. A single transaction may involve a set of DFS operations, if we don't care much about optimization. If we care about this, we can reduce them to ~ a set of log appends... Why it's important to have as less DFS operations as it's possible? That's because any DFS operation is atomic, and its execution implies at least single communication round between the involved services (chunkservers in GFS case). So the less atomic operations we have, the better. But if we don't have DFS, and implement durability on RDBMS level, the atomic operation is transaction, which are more rare (~ continue longer) then DFS operations under any case.

    - I like the idea to develop such a universal component as replicated state machine itself ;) Actually we started work on it in autumn, but for now they're frozen.

    - Finally, it's pretty easy to implement DFS having such distributed indexing engine and stream servers - it's almost nothing more then index containing URLs of all the stored files (streams). Btw, since we already have measures, so e.g. calculation of folder size will be a peace of cake ;)

    So that's probably a good example showing that "top-to-bottom design" can save some time (for now - just theoretically) ;)