> No, you aren't suppose to be able to see how the word-count works by looking at this code. The complexity happens elsewhere, setting up the state-machine.
This is like most of the reason I opened this article. I wish they'd spend more time talking about this. In fact, most of the README covers details that are important, but, irrelevant to their algorithmic improvements.
Maybe they expect you to open up the code, but even there the comments are largely describing what things do and not why, so I feel you need to have some understanding of the algorithm in question. For example, the parts where the state table gets constructed rather unhelpful comments to the uninformed:
/**
* Build an ASCII row. This configures low-order 7-bits, which should
* be roughly the same for all states
*/
static void
build_basic(unsigned char *row, unsigned char default_state, unsigned char ubase)
...
/**
* This function compiles a DFA-style state-machine for parsing UTF-8
* variable-length byte sequences.
*/
static void
compile_utf8_statemachine(int is_multibyte)
Even `wc2o.c` doesn't delve into the magic numbers it has chosen. I was hoping this repo would be more educational and explain how the state table works and why it's constructed the way it is.
Does anyone have a good resource for learning more about asynchronous state-machine parsers, that also could hopefully help explain why this is better than just iterating over characters? I'm guessing maybe it's the lack of branching?
You can figure it out by just looking at the table and output. It prints the number of lines, then words, then characters. Since the lines count is counts[1], you can conclude that when a newline is encountered, the machine will transition to state 1, hence why the newline is the only 2 entry in the character table and all 2 entries in the state machine point to state 1. From the character table, we can see that characters are split between classes 0, 1, 2 which are word, space, and newline characters.
To get the definitions of all other states, by working back from the fact that state 2 is the word count, it must be entered when a word character (class 0) is hit after some space/newline characters. Hence state 0 represents the machine being in the middle of a sequence of whitespace, state 1 is when the machine is in a sequences of newlines, state 2 is the first letter of a word, and state 3 is any subsequent letter of a word. You can then verify that the machine transitions to the correct state from those by checking what it does in case of a word, space, or newline character. But I'm not sure why the table is 4×4. The fourth transition from every state is unreachable since the symbols are all in [0, 2].
As for why the approach is better, I think it's because it avoids branching. If you think about the length of a word, it's quite short compared to a CPU pipeline, so on a branching version of this you're going to be stuck spending most of your execution time on branch miss-prediction.
There are fewer branches but there is now a data dependency between loop iterations which makes each iteration slightly slower (maybe 1-2 cycles additional latency per iteration).
Because newlines are relatively common and unpredictable a state machine is likely better. But on a long file with no new lines and no spaces the branching one should be slightly faster. (All reasoning from first principles; I have not done any benchmarking!)
The README covers the exact case you describe - the `word.txt` benchmark is just a file with 93MB of the char `x`. The state machine is still faster in this case.
wc2 is faster than wc, but unless I am missing something, I can't find an instance where the author benchmarked wc2 with the core state machine loop replaced with branches.
wc2 and wc might be compiled with different flags / use a different strategy for IO, which makes it hard to compare speeds directly. The theoretical speedup of branches is tiny compared potential speedup of changing your compiler flags!
> But I'm not sure why the table is 4×4. The fourth transition from every state is unreachable since the symbols are all in [0, 2].
I think the fourth transition represents illegal Unicode. From there it stays in the illegal state until it hits legal UTF-8, then goes back to counting.
And actually there isn't a state for illegal ASCII so it makes no sense to have a transition to/from it. And it still can never receive a 3 as input and use that transition.
If you're like me, you need to stop thinking of it as "99% of call sites" and start thinking of it as "99% of iteration loops". I know it's super obvious on the face of it, but the implications is that most call sites can repeatedly fail but are totally overshadowed by insanely high frequency hot loops that it almost always predicts right (it's skewed, just like avg vs median)
This is the 1% of cases. As I understand it, most branches take the same path most of the time. Think of a loop for instance. With purely statistical prediction (go whichever direction the branch takes more often), loops will be predicted with extreme accuracy and the result is a huge number of correct branches. Then think about error handling code which isn't triggered most of the time due to input being valid. Between these two easy cases, I think you cover the majority of branches is programs which is why so many of them are predictable.
Only branches that capture exceptional events, e.g. error checks, show 99% hit rate.
Branches that are part of an algorithm are driven by the data you feed to them. For example in a classical binary search, the prediction is right only 50% of the time.
But there's an overall loop which branches backward to the same spot, regardless whether the iteration went left or right. At least that's predictable.
If we have an instruction that will conditionally load one of two pointers from memory, the left or right traversal can be made branch-free.
The only reason we need to switch between two code paths in the binary tree descent is that we have two different pieces of code for loading the left or right pointer to the different offsets used in the load instruction.
If we use a two element array for left and right, the indices [0] and [1] select between them. We make the comparison condition calculate a 0 or 1 result and use it as an index.
It may be true in some cases, but not in general. I think of it as an observation that for large swathes of code the branch is predictable. Eg, we usually don't go down the error checking codepath. Faced with flow control that depends on random data, such as "is this random number even", the branch predictor will have a hard time doing better than 50%.
Run a profiler on your code sometime and see how many branches it reports taking. You can easily get millions of branches per second, so 1% events happen hundreds of thousands of times.
I’m with you. It appears to be a fairly bog-standard DFA, but divining that from 4x4 `int`s and a 256 `char` array feels like I’d just be doing all of the heavy lifting, intellectually. However… it does look like it’s trying to detect spaces (that’s the array of 256 `char`, as `char` is self-indexing). I can’t be bothered to remember how `wc` works, but my guess is that the state-machine has transition states for `sp->sp`, `sp->wd`, `wd->wd`, and `wd-sp`. It’d also need error modes. I suppose it doesn’t contain states for escapes?
> why this is better than just iterating over characters?
The explanation it provides regarding Apache vs Ngnix seemed to me to imply that Apache requires memory allocation to read headers into buffers and Ngnix uses a state machine to avoid memory allocation. Memory allocation is slower than most people realise.
Because classic wc is not iterating over every byte once, but multiple times.
It's especially obvious in the Unicode case where it first takes 1-4 bytes to get a Unicode character and then checks this character with another function to see if it's whitespace
But even with with naive ASCII approach, if you don't hand roll a state machine you are checking multiple conditions on each byte (is it a space and am I leaving a word etc)
Those 1-4 bytes are sitting in a register the entire time and thus basically free to read as often as you want, though.
An actual sampled profile showing the two approaches would be interesting. Naively it seems like it's just because it has faster UTF8 handling and nothing to do with being a state machine exactly
Even something as simple as wc calling a libc function for parsing utf-8, that doesn't get inlined, would destroy its performance relative to anything optimized.
Personally, I'd expect SIMD to win over all of these. wc sounds like kind of challenge that's very easy to partition and process in chunks, though UTF-8 might ruin that.
Which counters to increment at each byte depends on the previous bytes, though. You could probably succeed using overlapping chunks, but I wouldn't call it very easy.
That sort of "find the correct chunk boundary" logic was very common with all the mapreduce processing that was done when people still used the phrase big data.
I think it keeps wc simple if an algorithm is introduced not directly related to count words it becomes more complicated to debug. I see this a an example of the Unix philosophy of do one thing well.
A project with all code working with special magic reminds me of another project that some attention a while back (XZ lib - supposed improvements with packages containing clever system back doors).
Edit: The code is likely harmless but the more opaqueness is in a system, the easier it is for malevolent opaqueness to hide.
That is not what they mean by “magic”. The “magic” in xz is the complexity in the build system. Here, the “magic” are seemingly arbitrary numbers that improve performance. The reason they’re magic is because we don’t know why they were chosen. The code itself is obvious and not magic.
This is like most of the reason I opened this article. I wish they'd spend more time talking about this. In fact, most of the README covers details that are important, but, irrelevant to their algorithmic improvements.
Maybe they expect you to open up the code, but even there the comments are largely describing what things do and not why, so I feel you need to have some understanding of the algorithm in question. For example, the parts where the state table gets constructed rather unhelpful comments to the uninformed:
Even `wc2o.c` doesn't delve into the magic numbers it has chosen. I was hoping this repo would be more educational and explain how the state table works and why it's constructed the way it is.Does anyone have a good resource for learning more about asynchronous state-machine parsers, that also could hopefully help explain why this is better than just iterating over characters? I'm guessing maybe it's the lack of branching?