Since it's leveldb underneath the usual problems with write rate apply: its quit...

dsl · on April 28, 2013

I was also surprised this used leveldb underneath, especially since the goal is for larger than memory datasets.

For every level you add an additional disk read to every query. For optimal performance you should have RAM greater than 10% the size of your dataset, so the OS disk cache can operate effectively.

benbjohnson · on April 28, 2013

Do you have any links for more information on writes locking for minutes at a time? I haven't seen this with LevelDB and I couldn't find anything after searching around.

hosay123 · on April 28, 2013

Try searching their mailing list for "hang" and suchlike (of course they don't openly advertise such an easily triggered flaw).

LevelDB write rate initially seems amazing, since it's simply writing unsorted keys to an append-only file until the file hits 2MB or so. For bursty loads it feels great.

But the moment writes are sustained for longer than it can merge segments (say while doing a bulk load), per-write latency spikes appear (average op time jumps from <1ms to >30000ms for a single record), and eventually it'll get so far behind that all attempts to progress will hang entirely, waiting for the background compactor to free up room in the youngest generation. The effect seems to worsen exponentially with database size. To attempt to mitigate this, when LevelDB notices it's falling behind it begins sleeping for 0.1s every write.

It's especially easy to trigger on slow CPUs with spinning rust drives.

https://groups.google.com/forum/?fromgroups=#!topic/leveldb/...

continuations · on April 28, 2013

> But the moment writes are sustained for longer than it can merge segments (say while doing a bulk load), per-write latency spikes appear

Isn't that a problem common to all SSTable-based databases?

Is LevelDB any worse than HBase or Cassandra in this area?

snaky · on April 29, 2013

Yes, it seems like Cassandra is suffering from compacting too, although maybe not so ugly as LevelDB.

The only solution I've found is Castle backend from Acunu [1], but there were no updates from 2011 [2] and it looks really heavy (with kernel module and all that)

[1] http://www.slideshare.net/acunu/cassandra-on-castle

[2] https://bitbucket.org/acunu/fs.hg

pilooch · on April 28, 2013

Any hint at how http://symas.com/mdb/ behaves ?

Been using Tokyo Cabinet for a long time now, and have repeatedly hit similar hangs. Shopping for a future datastore of this sort!

hosay123 · on April 28, 2013

MDB had no issue on the same hardware and workload I discovered the LevelDB behaviour. It does not defer work (unless running asynchronously, in which case the OS is technically deferring work), so performance is a predictable function of database size, and unaffected by prior load.

Tokyo Cabinet should behave similarly.. can you tell us a bit more about your setup?

pilooch · on April 29, 2013

Sure, using TC hash datastore over millions (tenth of, not hundreds or billions) of entries, each being a compressed protobuffer. The cost of each write grows exponentially after some time. We've played with parameters, buckets size and numbers, cache, we tried over SSDs vs regular HD, no big improvement. We've considered writing a sharded version of TC (there are a couple implementations already IIRC). Typically, the problem seems to be related to the size of the file on disk and the number of buckets. Somewhere, the reads and writes become prohibitive (at least for our usage).

We like the speed of these datastores as some of our algorithms proceed with millions of calls every few seconds or so, and we like that it is not remote.

jamesaguilar · on April 28, 2013

If it is a bulk load, surely slow writes aren't a serious issue as long as throughput is good? Or are you saying the average write takes 30s?

benbjohnson · on April 28, 2013

Thanks! I appreciate the heads up.