Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Without knowing more about their schema, it's hard to speculate. However, let's assume they had a single, large "checkins" collection, sharded by user ID as they said. That would be distributed across the shards. If the mongod process is down on one shard and therefore the shard isn't responding, the data just wouldn't be available.

But it sounds like the shard was still up. As per the sharding FAQ[1]:

"What if a shard is down or slow and I do a query? If a shard is down, the query will return an error. If a shard is responding slowly, mongos will wait for it. You won't get partial results."

They said they had a performance issue on the overloaded shard so perhaps it's possible that mongo believed the shard to be up, when it was instead overloaded. This meant any queries just waited, taking down the whole site.

As I said, this is speculation and as a massive production user of MongoDB ourselves[2], I'm interested to know more.

[1] http://www.mongodb.org/display/DOCS/Sharding+FAQ [2] http://blog.boxedice.com/2010/02/28/notes-from-a-production-...



"What if a shard is down or slow and I do a query? If a shard is down, the query will return an error. If a shard is responding slowly, mongos will wait for it. You won't get partial results."

an approach to CAP that chooses consistency over availability. I wonder why as it isn't a bank and they could have chosen the availability instead.


If in doubt, choose consistency.


But it wasn't the overloaded shard that brought the site down (it was only slowing things down), it was the addition of the new shard:

"For reasons that are not entirely clear to us right now, though, the addition of this shard caused the entire site to go down."


If the data was being moved it could cause even more load on an already overloaded shard.


I know nothing about MongoDB but why would one shard that's responding slowly affect queries to all the other shards and take the whole site down? I can't believe any database system would be designed to work like that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: