Whenever I have a chance, I extol the virtues of message-passing and event-loops in structuring concurrent workflows in Rust. However, in my wide-eyed enthusiasm, I will make it sound almost as if it cannot go wrong.
So today, lets go into some details about the archetype of buggy message-passing code.
I think this bug applies to pretty-much all message-passing code(the exception might be using synchronous bounded channels).
Even though I think native threads and Crossbeam channels are easier to work with than their equivalent from the async/futures ecosystem(what do you mean theimpl Futuretrait bound is not met? Its the channel from the !^@#% futures crate! 15 mins later: ah, ok), I think the power of message-passing, and event-loops, is really in how they logically abstract over the underlying concurrency mechanism.
What you need is something representing a channel/source or events, and something representing a thread of execution, thats it.
And this bug is essentially about how to break those.
This story is based on actual events. In certain cases incidents, characters and timelines have been changed for dramatic purposes. Certain characters may be composites, or entirely fictitious.
The pledge
Ok so first of all, were using Rust. So that means we can be totally Fearless in our approach to concurrency, like:
By leveraging ownership and type checking, many concurrency errors are compile-time errors in Rust rather than runtime errors.(source: https://doc.rust-lang.org/book/ch16-00-concurrency.html)
While this is true, in the sense that you cant accidentally share some variable between threads, this doesnt prevent you from writing buggy concurrent business logic.
So, the Rust compiler has your back when the it comes to a certain classes of mistakes, but cannot actually show you how to ensure your logic is something you can reason about, and that shows deterministic outcomes(up to a point).
So then, what can? Well, you can try using locks, although that will probably require pairing them with a condvar, or, if you can, go the easier route and use a combination of:
- components running an event-loop,
- channels, for sending and receiving messages(aka events).
The two are intrinsically linked: a component running an event-loop, is simply a thread of execution that happens to be driven by a loop where messages are received sequentially on one, or several (via a select), channels.
One example of a thread of execution that is not a native thread is Chromiums sequence. I think that one really has a great name since it accurately conveys the concept.
The magic consists of having multiple of such event-loops communicate with each other, enabling you to build complicated(yet hopefully as simple as possible) concurrent workflows with deterministic characteristics.
What is this determinism based on? I think it comes down to two things:
- Multiple sends from the same thread of execution are received in order.
- While an event-loop is handling a given message, it only operates on data it owns uniquely.
Re 2, it should be noted that combining shared-state with event-loops is usually a recipe for subtle disaster. The reason for that is that while shared-state can be effectively protected by a lock, the lock itself doesnt offer any kind of synchronization with the message-passing. So while it may appear to work, your logic is likely to be fundamentally broken by faulty assumptions about (a non-existent) synchronization between the messages, and the lock(s).
The fix is usually to, for a given workflow, either only use messages, or only use locks(and then pair them with a condvar for signalling).
Finally, when I write for a given workflow, this still allows for workflows to be nested to each other, where each level can be using different models. For example you could have a top-level workflow involving communicating event-loops and messages, and then one of these components could still internally own nested workflows, and those could then involve say some locks and condvars(example here).
And when does this determinism breakdown? Lets take a look in the next section.
The turn
Lets start with an actual real-world example, and then try to turn it into something more abstract.
So, as I was working on this PR in Servo, I was surprised by some intermittent failures of a test that showed up after a change(not the final change in the PR by the way, lets say it showed up in between commits).
The background for this is the following:
- There is a workflow crossing components running independently(previously described here).
- The two components are named script, and net.
- The goal is for script to stream chunks of data to net, and then shutdown the workflow once done, or if there is an error.
- The workflow is pull-based, where it is net that requests the next chunk, and then script tries to read it from a source and forward it on.
- On the side of script, it involves two threads: one for receiving IPC messages, known as the IPC router thread, and another one for running code in a runtime, where reading data from the source involves running code in this runtime. Lets call it the script thread.
- When reading data from the source, three things can happen: we get a chunk of data, we get a done flag, or we get an error.
So before the PR, the IPC router thread would simply receive messages, in a kind of event-loop, from the net component asking for a chunk of data. The script thread would then receive a message from the router thread, and then run some user-supplied code in order to try to read a chunk.
When the result of the operation became available, the script thread would then directly send the result to net , which would either be a chunk of data, or a stream is done signal.
Furthermore, in the done case, the script thread would, to trigger the shutdown of the workflow, first drop its sender to net, and then also send a message to the IPC router thread, telling it to drop its sender to net as well. The idea was that this would then disconnect the corresponding receiver in net, which would then result in dropping the sole sender used to request chunks of data, and this would then disconnect the receiver on the IPC router thread.
Wow, what a shutdown workflow.
So this was actually working fine(with some exception), however the problem arose when I added some code to make a distinction between the done and the error case.
Basically, both done and error should result in shutdown of the workflow, however the error case required first doing something to propagate the error elsewhere in net.
And this is where the archetype message-passing bug showed up.
My initial take on propagating the error would see the script thread follow the following sequence in the error case:
- Propagate the error via a message to net.
- Shutdown the workflow like before, via a message to the IPC router thread, which would then propagate the shutdown of this particular workflow to net.
And the bug manifested itself through a test, the one asserting that the error would propagate, intermittently failing.
This actually took a few println! to figure out, and soon the problem became clear: I was expecting two messages from two different threads to magically synchronize their ordering.
Remember what I wrote above:
Multiple sends from the same thread of execution are received in order.
Yep, it doesn’t work when you send multiple messages, from multiple threads.
This is very clear when you read it like that, but it can creep-up in business logic undetected until a test starts failing intermittently.
So this is what the script thread and the IPC router thread were doing:
- script sends the error message to net, expecting it to be handled and the error to propagate.
- script sends the done message to the IPC router thread, expecting that thread to then send a message to shutdown the workflow to net.
Result: two messages, one error and one shutdown are sent to net, from two different threads, yet the logic expects error to be received, and handled, by net first.
Yes so if using synchronous bounded channels, you wouldnt run into this problem. However that could also dramatically reduce the throughput of the various components. Topic for another post
On another note, as was pointed out by matklad, the shutdown last part of the problem could be prevented by not sending an explicit shutdown message, instead relying on dropping senders to signal shutdown. In this case, regardless of how threads are scheduled, so long as the script thread is holding on to its sender, and/or has already sent the error message, then that message would be guaranteed to be handled at some point, before the disconnect signal would travel to the receiver.
However, this is about the archetype of message-passing bug, and there are a uncountable situations where you might need to perform operation 1 followed by 2, where 2 is not shutdown, and the bug might show-up. When 2 is shutdown, you could prevent this class of errors by relying on dropping senders for signalling.
In this particular case, I had to include an explicit shutdown message.
That was a sneaky one, and the reason is that the script thread would actually send the messages in order, however not to the same thread, and intermittently the shutdown message, sent by the IPC router thread, would be received before the error message sent by the script thread.
Even though the IPC router thread would only send the shutdown message AFTER having received the second message sent by the script thread, but second message here is meaningless because these are not messages to the same thread, hence no ordering can be expected.
Yep, I know, this is somewhat of a flippant way to write about some code. But almost all message-passing bug can be trace-back to something like that. What makes them sneaky is the complexity of the business logic, which can hide the various fictitious ordering assumptions that are built into it.
Well, glad there are tests to catch these things, however you have to be lucky to have these tests to begin with. If this can sneak into business logic, it also means it can not be covered by tests. Servo is lucky to be able to rely on the shared Web Platform Tests for enabling a broad coverage of various logic
Next, lets see how we can bring back some order to this chaos.
The prestige
So the ordering breaks-down, because two messages are sent to two different thread of execution running two different event-loops.
Actually the HTML spec has a nice concept abstracting over this, the parallel queue:
So this parallel queue is essentially the prototype of a thread of execution running an event-loop, defined in this case as continuously running algorithms steps that have been enqueued to it(similarly to sending a message on a channel, really).
This is useful because it makes it very clear that if you enqueue two sets of steps, to two different parallel queue, they will be run independently and therefore not synchronize their ordering in any way(this is like sending two messages on two channels to two different threads).
Also, if you have two different in-parallel algorithms both enqueuing steps to the same parallel queue, you also cannot know which set of steps will run first(this is like having two threads send two messages on the same channel to the same third thread).
You only get ordering guarantees when queuing steps to the same parallel queue, from the same algorithm.
Once again:
Multiple sends from the same thread of execution are received in order.
So in our case described earlier, we want to send two messages:
- Error(if there is a one),
- Stop/Shutdown(always when the workflow is finished, although potential errors should be handled first).
Those two messages need to be handled in that order by net. Yet they are sent by script from two different threads, the script thread sends the potential error, then it sends the shutdown to the second IPC router thread, which then, after shutting down its local part of the workflow, sends it also to the net component.
But we always want the net component to first handle any potential error message.
So, using parallel queues, we need to ensure the error and done steps are enqueued in order, to the same queue, from a single algorithm.
Using Rust channels and some sort of thread of execution running an event-loop, we need to send two messages on the same channel, from the same thread.
And off-course, that thread can then send further messages to another thread.
So the solution is to use one thread of execution running an event-loop, ok lets just call it a parallel queue, to serialize the error and stop operations, and then from that parallel queue we can enqueue further steps onto another parallel queue(the one in net).
Simply stated: we need both messages to go through the IPC router thread, which then would communicate with net. In other words we cannot have the script thread send two messages, one to net, and the other to the IPC router thread, where it then sends another message to net, and expect the script thread -> net to be handled before the IPC router thread -> net message.
So we restructure the operation, where the script thread simply either sends an error or a done message to the IPC router thread, and its that thread that is then responsible to either:
A. First send the error to net, if there was an error, and then shutdown the workflow, or
B. Immediately shutdown the workflow when receiving the done message.
It shows up by introducing this enum, modelling A or B:
https://github.com/servo/servo/pull/26906/files#diff-ae0ff1fd98b06dfb13baf427cdffc28aR64
and in the IPC router thread, when receiving either the error or done message from the script thread, the appropriate sequence of action is taken like:
https://github.com/servo/servo/pull/26906/files#diff-ae0ff1fd98b06dfb13baf427cdffc28aR136
the stop_reading operation on the IPC router thread will then simply either send the error or done message, at:
https://github.com/servo/servo/pull/26906/files#diff-ae0ff1fd98b06dfb13baf427cdffc28aR205
And finally, in net, the messages will be handled in a way that propagates a potential error:
https://github.com/servo/servo/pull/26906/files#diff-aa469beb5619907dbccd88364264b9b8R544
So, for the few who havent navigated away by now, lets recap:
We had a bug that was caused by having:
One parallel queue enqueue two set of algorithmic steps onto two different parallel queues, all the while expecting some sort of ordering to be preserved.
and this was fixed by instead having:
One parallel queue enqueue a single set of algorithmic steps onto a single other parallel queue, and having these steps then result in a local state change when run, followed by enqueuing another set of steps onto yet another single parallel queue(which when handled would also result in local state changes).
The result is actual deterministic behavior, where any error will propagate before the shutdown of the streaming workflow.
The end
Ok, I think this article had the highest ratio of lines in English/lines of code, this was really zooming-in into something that could easily be waved away as an irrelevant small bug, and yet I think this actually lies at the basis for almost every bug you encounter with message-passing code.