Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In case people don't know, Mark Nottingham (the author) is the chair of the HTTP working group at the IETF. He isn't just some guy with opinions on the internet. (Sorry mnot!)

I've never found pub/sub quite the right abstraction, because almost every implementation I've seen has race conditions or issues on reconnect. Usually its possible to lose messages during reconnection, and there's often other issues too. I usually want an event queue abstraction, not pub/sub.

I met mnot a few years ago when we (the braid group) took a stab at writing a spec for HTTP based streaming updates[1]. Our proposal is to do state syncronization around a shared objects (the URL). Each event on the channel is (explicitly or implicitly) an update for some resource. So:

- The document (resource) has a version header (ETag?).

- Each event sent on the stream is a patch which changes the version from X to Y. Patches can either be incremental ("insert A at position 5") or if its small, just contain a new copy of the document.

- You can reconnect at any time, and specify a known version of the document. The server can bring you up to date by sending the patches you missed. If the server doesn't store historical changes, it should be able to just send you a fresh copy of the document. After that, the subscription should drip feed events as they come in.

One nice thing about this is that CDNs can do highly efficient fan-out of changes. A cache could also use braid to subscribe to resources which change a lot on the origin server, in order to keep the cached values hot.

[1] https://datatracker.ietf.org/doc/html/draft-toomim-httpbis-b...

We've done some revisions since then but have been working on getting more running code before pushing forward more with the approach. Most recent draft & issues: https://github.com/braid-org/braid-spec/blob/master/draft-to...



pub/sub is a horrible abstraction. I've written about it here: http://www.adama-lang.org/blog/pubsub-sucks

My experience is building an exceptionally large pub/sub service, and pub/sub starts nice until you really care about reliability. I made the mistake of patenting a protocol to sit on top of WebSocket/SSE/MQTT as the ecosystem was a giant mess: https://uspto.report/patent/grant/11,228,626

What I learned was that by having E2E anti-entropy protocol is that it's hard for pub/sub abstractions to not lose stuff or lie. That protocol emerged precisely because years of investment in pub/sub was a leaky bucket of issues.

Braid looks interesting. I'm using JSON marge as my algebraic operator since I have an ordered stream of updates for my system. http://www.adama-lang.org/blog/json-on-the-brain


One issue with primitive pub/sub is usually that they builds on top of unbounded queues, which are problematic in various ways. They might grow over time, until the system runs out of resources and bad things happen. And even if we don’t get that far, the queue might hold a ton of things that peers are no longer interested in. Pull based systems are better here, because clients only request data when they actually need it and nothing has to be queued. But obviously this comes with some extra latency.


Buffer bloat can be a really complex, thorny issue. How do you even solve it, in principle for systems like this? What should happen if a client can't process all messages? Do you drop some? Should the server persist messages on disk? (And how / when?)

I was absolutely gobsmacked reading Kafka's API design, because kafka avoids this problem with a single, simple API choice.

In Kafka, instead of a client subscribing indefinitely, clients can only request ranges of events. Like, a client query will say "Get me the next ~1mb of events from ID 123". The kafka server will send those events (from disk, or streaming) then stop sending them to that client. Once the server has sent the quota 1mb of events, it'll stop and the client is then expected to request the next 1mb of events.

This simple API almost entirely prevents buffer bloat. (Can you see it?). If there's not enough bandwidth for the events to arrive, at most 1mb of events will ever be buffered for that client. Because the client will wait to ask for the next 1mb of events until after the first events have been received, if a client processes events too slowly (or doesn't have the bandwidth to keep up), it'll just gracefully fall further and further behind, without consuming any extra server resources. It'll keep working - just work slowly. And, sure - this is still a problem. But its graceful. And it entirely fits with what people expect to happen when a client doesn't have enough bandwidth or CPU. And its a much smaller problem than if the entire server failed due to a single slow client. Its genius.


That's a pattern to reduce on queue size (the one between broker and consumer), but it still means the broker (here Kafka) would need to queue up everything that any consumer might potentially need.

It can work, and the data is at least limited by the amount and speed of producers. But it can also get a bottlneck.

> How do you even solve it, in principle for systems like this? What should happen if a client can't process all messages? Do you drop some? Should the server persist messages on disk? (And how / when?)

This will all pretty much depend on the problem domain. If one can't drop messages, then it pretty much all doesn't help. But usually in systems clients do not need all the past history - they might just need the newest data. So it's possible to detect in various parts of the system (even the server or broker before the TCP stack) that some data is superseeded with a newer version, and then drop the old variant instead of sending it too.

If purely "latest state" has to be transferred to a client, the max amount of data which needs to be queued is always sizeof(state). now this gets again tricky, because the state could be big and completely resending it might also be expensive. So a compromise solution that is sometimes used is sending the client one snapshat of the complete state when connecting, and then falling back to sending events/diffs. If one client falls behind on processing diffs, the will likely reconnect and get a new starting point. Obviously also has its gotchas, since you don't want to get to a point where there's constant fetching of initial state which is most costly. But at least it limits the amount of data that is queued anywhere in the system to "whatever is required between 2 state snapshots".


These are great insights by both Mnot and Josephg. I've followed up on this conversation on the HTTP working group mailing list:

https://lists.w3.org/Archives/Public/ietf-http-wg/2022JanMar...


Here is an event stream abstraction that has very strong semantics (exactly once in most cases) with simple usage examples, native HTTP and websockets APIs, strong atomicity and durability guarantees [1].

What it doesn’t have is clustering and a father that’s != null at marketing =]

[1] https://github.com/thibauts/styx


Very interesting. I've implemented something similar. It evolved out of the co-browsing solution I developed for the company I work for.

The solution uses mqtt. Clients subscribe to a topic on the server, and the server publishes patches to update the view. Patches can be incremental (patch against the last frame), cumulative (patch agains the last keyframe) or a new keyframe. It allows for server side rendered views. Multiple clients can subscribe to the same view and keep in sync. See: https://github.com/mmzeeman/zotonic_mod_teleview


An example proof of concept application: https://github.com/mmzeeman/zotonic_mod_doom_fire/


Did you consider using CRDTs for your document sync? It sounds like what you were implementing was somewhere between OTs and CRDTs and trying to create a standard event protocol for them?

CRDTs remove the need keeping track of all (or any just recent) revisions on the server as you would with OTs, your server can be stateless and just act as a message broker between clients.

Edit:

Should have followed your links, yes that’s exactly what you are doing.

https://braid.org/


Thanks a lot for sharing this, it’s very relevant to my work at the moment.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: