> No, you aren't suppose to be able to see how the word-count works by looking a...

James_K · on June 20, 2024

You can figure it out by just looking at the table and output. It prints the number of lines, then words, then characters. Since the lines count is counts[1], you can conclude that when a newline is encountered, the machine will transition to state 1, hence why the newline is the only 2 entry in the character table and all 2 entries in the state machine point to state 1. From the character table, we can see that characters are split between classes 0, 1, 2 which are word, space, and newline characters.

To get the definitions of all other states, by working back from the fact that state 2 is the word count, it must be entered when a word character (class 0) is hit after some space/newline characters. Hence state 0 represents the machine being in the middle of a sequence of whitespace, state 1 is when the machine is in a sequences of newlines, state 2 is the first letter of a word, and state 3 is any subsequent letter of a word. You can then verify that the machine transitions to the correct state from those by checking what it does in case of a word, space, or newline character. But I'm not sure why the table is 4×4. The fourth transition from every state is unreachable since the symbols are all in [0, 2].

As for why the approach is better, I think it's because it avoids branching. If you think about the length of a word, it's quite short compared to a CPU pipeline, so on a branching version of this you're going to be stuck spending most of your execution time on branch miss-prediction.

tylerhou · on June 20, 2024

There are fewer branches but there is now a data dependency between loop iterations which makes each iteration slightly slower (maybe 1-2 cycles additional latency per iteration).

Because newlines are relatively common and unpredictable a state machine is likely better. But on a long file with no new lines and no spaces the branching one should be slightly faster. (All reasoning from first principles; I have not done any benchmarking!)

anamexis · on June 20, 2024

The README covers the exact case you describe - the `word.txt` benchmark is just a file with 93MB of the char `x`. The state machine is still faster in this case.

tylerhou · on June 21, 2024

wc2 is faster than wc, but unless I am missing something, I can't find an instance where the author benchmarked wc2 with the core state machine loop replaced with branches.

wc2 and wc might be compiled with different flags / use a different strategy for IO, which makes it hard to compare speeds directly. The theoretical speedup of branches is tiny compared potential speedup of changing your compiler flags!

samatman · on June 20, 2024

> But I'm not sure why the table is 4×4. The fourth transition from every state is unreachable since the symbols are all in [0, 2].

I think the fourth transition represents illegal Unicode. From there it stays in the illegal state until it hits legal UTF-8, then goes back to counting.

teo_zero · on June 22, 2024

For handling ASCII, table needs 4 states × 3 classes of chars. Why is it defined as 4×4?

To handle illegal chars, it would need a 4th class but also a 5th state, so that's not the reason.

Can it be to replace a 'modulo' operation with an 'and' in the access to table?

James_K · on June 20, 2024

That isn't relevant in the ASCII example, and the UTF-8 one uses a 256×256 table.

samatman · on June 21, 2024

Half of all bytes are illegal ASCII though.

James_K · on June 21, 2024

And actually there isn't a state for illegal ASCII so it makes no sense to have a transition to/from it. And it still can never receive a 3 as input and use that transition.

Vvector · on June 20, 2024

I've read recently (here on HN) that branch predictors are right 99% if the time. Is that inaccurate?

drchickensalad · on June 20, 2024

If you're like me, you need to stop thinking of it as "99% of call sites" and start thinking of it as "99% of iteration loops". I know it's super obvious on the face of it, but the implications is that most call sites can repeatedly fail but are totally overshadowed by insanely high frequency hot loops that it almost always predicts right (it's skewed, just like avg vs median)

James_K · on June 20, 2024

This is the 1% of cases. As I understand it, most branches take the same path most of the time. Think of a loop for instance. With purely statistical prediction (go whichever direction the branch takes more often), loops will be predicted with extreme accuracy and the result is a huge number of correct branches. Then think about error handling code which isn't triggered most of the time due to input being valid. Between these two easy cases, I think you cover the majority of branches is programs which is why so many of them are predictable.

teo_zero · on June 21, 2024

That's an over-generalization.

Only branches that capture exceptional events, e.g. error checks, show 99% hit rate.

Branches that are part of an algorithm are driven by the data you feed to them. For example in a classical binary search, the prediction is right only 50% of the time.

kazinator · on June 21, 2024

But there's an overall loop which branches backward to the same spot, regardless whether the iteration went left or right. At least that's predictable.

If we have an instruction that will conditionally load one of two pointers from memory, the left or right traversal can be made branch-free.

The only reason we need to switch between two code paths in the binary tree descent is that we have two different pieces of code for loading the left or right pointer to the different offsets used in the load instruction.

If we use a two element array for left and right, the indices [0] and [1] select between them. We make the comparison condition calculate a 0 or 1 result and use it as an index.

ncruces · on June 21, 2024

And any loop that iterates more than 100 times.

kardos · on June 20, 2024

It may be true in some cases, but not in general. I think of it as an observation that for large swathes of code the branch is predictable. Eg, we usually don't go down the error checking codepath. Faced with flow control that depends on random data, such as "is this random number even", the branch predictor will have a hard time doing better than 50%.

kllrnohj · on June 21, 2024

Run a profiler on your code sometime and see how many branches it reports taking. You can easily get millions of branches per second, so 1% events happen hundreds of thousands of times.

See for example this quick investigation and look at the number of branches for what's not a very big program https://bnikolic.co.uk/blog/hpc-perf-branchprediction

casey2 · on June 21, 2024

Here's how the table is made

For starters you need to know what the states the values represent (you should remember this from k&r)

0 we are out of a word, but not via newline (lets call this character type "a one").

1 we are out of a word, via a newline ("a two" or "a newline").

2 we entered a word ("a zero").

3 we are in a word ("a zero").

So we know we need 4 rows in the table

{} {} {} {}

lets fill the 0th row first, we aren't in a word so the only states we can enter are 0,1 and 2. That's helpful cause we can fill out the 1st element

{,0,} {} {} {} (a one sends us to the first row)

{[12],0,[12]} {} {} {}(we know 3 isn't a possible destination as we are out of a word)

{1,0,2} {} {} {} (actually we have a choice of {1,0,2} or {2,0,1} I'll show 1,0,2 but this swaps the first 2 args of print at the end)

{1,0,2} {} {1,0,2} {} (since newline is also out of a word we have the same options)

{1,0,2} {3,0,2} {1,0,2} {} (now 3 is a possible destination if we get zero. A one will take us to state 0 and a newline to state 2 as before)

{1,0,2} {3,0,2} {1,0,2} {3,0,2} (now 0,2, and 3 are our only possible states)

As you can see, there is nothing magic about it.

1vuio0pswjnm7 · on June 21, 2024

"As you can see, there is nothing magic about it."

The parent seems to be using the term "magic numbers" incorrectly.

https://en.wikipedia.org/wiki/Magic_number_(programming)

thechao · on June 20, 2024

I’m with you. It appears to be a fairly bog-standard DFA, but divining that from 4x4 `int`s and a 256 `char` array feels like I’d just be doing all of the heavy lifting, intellectually. However… it does look like it’s trying to detect spaces (that’s the array of 256 `char`, as `char` is self-indexing). I can’t be bothered to remember how `wc` works, but my guess is that the state-machine has transition states for `sp->sp`, `sp->wd`, `wd->wd`, and `wd-sp`. It’d also need error modes. I suppose it doesn’t contain states for escapes?

zadokshi · on June 20, 2024

> why this is better than just iterating over characters?

The explanation it provides regarding Apache vs Ngnix seemed to me to imply that Apache requires memory allocation to read headers into buffers and Ngnix uses a state machine to avoid memory allocation. Memory allocation is slower than most people realise.

loeg · on June 20, 2024

Why is this better for wc, though? wc is not doing memory allocation in the hot loop.

jxndnxu · on June 20, 2024

Because classic wc is not iterating over every byte once, but multiple times.

It's especially obvious in the Unicode case where it first takes 1-4 bytes to get a Unicode character and then checks this character with another function to see if it's whitespace

But even with with naive ASCII approach, if you don't hand roll a state machine you are checking multiple conditions on each byte (is it a space and am I leaving a word etc)

Using a dfa has fixed compute per byte

kllrnohj · on June 20, 2024

Those 1-4 bytes are sitting in a register the entire time and thus basically free to read as often as you want, though.

An actual sampled profile showing the two approaches would be interesting. Naively it seems like it's just because it has faster UTF8 handling and nothing to do with being a state machine exactly

brainwad · on June 21, 2024

According to the authors it's also faster on files full of 'x' or ' ', so there must be more than just better unicode support.

yencabulator · on June 21, 2024

Even something as simple as wc calling a libc function for parsing utf-8, that doesn't get inlined, would destroy its performance relative to anything optimized.

Personally, I'd expect SIMD to win over all of these. wc sounds like kind of challenge that's very easy to partition and process in chunks, though UTF-8 might ruin that.

teo_zero · on June 22, 2024

> very easy to partition and process in chunks

Which counters to increment at each byte depends on the previous bytes, though. You could probably succeed using overlapping chunks, but I wouldn't call it very easy.

yencabulator · on June 22, 2024

That sort of "find the correct chunk boundary" logic was very common with all the mapreduce processing that was done when people still used the phrase big data.

detourdog · on June 20, 2024

I think it keeps wc simple if an algorithm is introduced not directly related to count words it becomes more complicated to debug. I see this a an example of the Unix philosophy of do one thing well.

tkhattra · on June 21, 2024

rob graham wrote an article on this wc program in the humorously titled PoC||GTFO journal - see [1]. he also tweeted about it a few years ago [2].

[1] https://github.com/angea/pocorgtfo/blob/master/contents/arti... [2] https://twitter.com/ErrataRob/status/1494009849427992576

joe_the_user · on June 20, 2024

Hmm,

A project with all code working with special magic reminds me of another project that some attention a while back (XZ lib - supposed improvements with packages containing clever system back doors).

Edit: The code is likely harmless but the more opaqueness is in a system, the easier it is for malevolent opaqueness to hide.

_lvbh · on June 21, 2024

That is not what they mean by “magic”. The “magic” in xz is the complexity in the build system. Here, the “magic” are seemingly arbitrary numbers that improve performance. The reason they’re magic is because we don’t know why they were chosen. The code itself is obvious and not magic.

VMG · on June 20, 2024

I think the documentation might be stale, `build_basic` clearly has code for non-ASCII characters

worewood · on June 20, 2024

Writing text and writing code are two distinct abilities. It is very rare the professional that possesses both.

oopsallmagic · on June 21, 2024

That used to not be the case. Has something changed?

sebastianmestre · on June 20, 2024

Yeah, Sean Barrett has a good writeup about them and optimizing them for branch-prediction, etc

It's framed in the context of programming-language tokenization, but the principles are the same.

https://nothings.org/computer/lexing.html