In case people don't know, Mark Nottingham (the author) is the chair of the HTTP...

mathgladiator · on Feb 20, 2022

pub/sub is a horrible abstraction. I've written about it here: http://www.adama-lang.org/blog/pubsub-sucks

My experience is building an exceptionally large pub/sub service, and pub/sub starts nice until you really care about reliability. I made the mistake of patenting a protocol to sit on top of WebSocket/SSE/MQTT as the ecosystem was a giant mess: https://uspto.report/patent/grant/11,228,626

What I learned was that by having E2E anti-entropy protocol is that it's hard for pub/sub abstractions to not lose stuff or lie. That protocol emerged precisely because years of investment in pub/sub was a leaky bucket of issues.

Braid looks interesting. I'm using JSON marge as my algebraic operator since I have an ordered stream of updates for my system. http://www.adama-lang.org/blog/json-on-the-brain

Matthias247 · on Feb 21, 2022

One issue with primitive pub/sub is usually that they builds on top of unbounded queues, which are problematic in various ways. They might grow over time, until the system runs out of resources and bad things happen. And even if we don’t get that far, the queue might hold a ton of things that peers are no longer interested in. Pull based systems are better here, because clients only request data when they actually need it and nothing has to be queued. But obviously this comes with some extra latency.

josephg · on Feb 21, 2022

Buffer bloat can be a really complex, thorny issue. How do you even solve it, in principle for systems like this? What should happen if a client can't process all messages? Do you drop some? Should the server persist messages on disk? (And how / when?)

I was absolutely gobsmacked reading Kafka's API design, because kafka avoids this problem with a single, simple API choice.

In Kafka, instead of a client subscribing indefinitely, clients can only request ranges of events. Like, a client query will say "Get me the next ~1mb of events from ID 123". The kafka server will send those events (from disk, or streaming) then stop sending them to that client. Once the server has sent the quota 1mb of events, it'll stop and the client is then expected to request the next 1mb of events.

This simple API almost entirely prevents buffer bloat. (Can you see it?). If there's not enough bandwidth for the events to arrive, at most 1mb of events will ever be buffered for that client. Because the client will wait to ask for the next 1mb of events until after the first events have been received, if a client processes events too slowly (or doesn't have the bandwidth to keep up), it'll just gracefully fall further and further behind, without consuming any extra server resources. It'll keep working - just work slowly. And, sure - this is still a problem. But its graceful. And it entirely fits with what people expect to happen when a client doesn't have enough bandwidth or CPU. And its a much smaller problem than if the entire server failed due to a single slow client. Its genius.

Matthias247 · on Feb 21, 2022

That's a pattern to reduce on queue size (the one between broker and consumer), but it still means the broker (here Kafka) would need to queue up everything that any consumer might potentially need.

It can work, and the data is at least limited by the amount and speed of producers. But it can also get a bottlneck.

> How do you even solve it, in principle for systems like this? What should happen if a client can't process all messages? Do you drop some? Should the server persist messages on disk? (And how / when?)

This will all pretty much depend on the problem domain. If one can't drop messages, then it pretty much all doesn't help. But usually in systems clients do not need all the past history - they might just need the newest data. So it's possible to detect in various parts of the system (even the server or broker before the TCP stack) that some data is superseeded with a newer version, and then drop the old variant instead of sending it too.

If purely "latest state" has to be transferred to a client, the max amount of data which needs to be queued is always sizeof(state). now this gets again tricky, because the state could be big and completely resending it might also be expensive. So a compromise solution that is sometimes used is sending the client one snapshat of the complete state when connecting, and then falling back to sending events/diffs. If one client falls behind on processing diffs, the will likely reconnect and get a new starting point. Obviously also has its gotchas, since you don't want to get to a point where there's constant fetching of initial state which is most costly. But at least it limits the amount of data that is queued anywhere in the system to "whatever is required between 2 state snapshots".

toomim · on Feb 20, 2022

These are great insights by both Mnot and Josephg. I've followed up on this conversation on the HTTP working group mailing list:

https://lists.w3.org/Archives/Public/ietf-http-wg/2022JanMar...

thibauts · on Feb 20, 2022

Here is an event stream abstraction that has very strong semantics (exactly once in most cases) with simple usage examples, native HTTP and websockets APIs, strong atomicity and durability guarantees [1].

What it doesn’t have is clustering and a father that’s != null at marketing =]

[1] https://github.com/thibauts/styx

mmzeeman · on Feb 20, 2022

Very interesting. I've implemented something similar. It evolved out of the co-browsing solution I developed for the company I work for.

The solution uses mqtt. Clients subscribe to a topic on the server, and the server publishes patches to update the view. Patches can be incremental (patch against the last frame), cumulative (patch agains the last keyframe) or a new keyframe. It allows for server side rendered views. Multiple clients can subscribe to the same view and keep in sync. See: https://github.com/mmzeeman/zotonic_mod_teleview

mmzeeman · on Feb 20, 2022

An example proof of concept application: https://github.com/mmzeeman/zotonic_mod_doom_fire/

samwillis · on Feb 20, 2022

Did you consider using CRDTs for your document sync? It sounds like what you were implementing was somewhere between OTs and CRDTs and trying to create a standard event protocol for them?

CRDTs remove the need keeping track of all (or any just recent) revisions on the server as you would with OTs, your server can be stateless and just act as a message broker between clients.

Edit:

Should have followed your links, yes that’s exactly what you are doing.

https://braid.org/

jamil7 · on Feb 20, 2022

Thanks a lot for sharing this, it’s very relevant to my work at the moment.