Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> No, you aren't suppose to be able to see how the word-count works by looking at this code. The complexity happens elsewhere, setting up the state-machine.

This is like most of the reason I opened this article. I wish they'd spend more time talking about this. In fact, most of the README covers details that are important, but, irrelevant to their algorithmic improvements.

Maybe they expect you to open up the code, but even there the comments are largely describing what things do and not why, so I feel you need to have some understanding of the algorithm in question. For example, the parts where the state table gets constructed rather unhelpful comments to the uninformed:

    /**
     * Build an ASCII row. This configures low-order 7-bits, which should
     * be roughly the same for all states
     */
    static void
    build_basic(unsigned char *row, unsigned char default_state, unsigned char ubase)
    
    ...
    
    /**
     * This function compiles a DFA-style state-machine for parsing UTF-8
     * variable-length byte sequences.
     */
    static void
    compile_utf8_statemachine(int is_multibyte)
Even `wc2o.c` doesn't delve into the magic numbers it has chosen. I was hoping this repo would be more educational and explain how the state table works and why it's constructed the way it is.

Does anyone have a good resource for learning more about asynchronous state-machine parsers, that also could hopefully help explain why this is better than just iterating over characters? I'm guessing maybe it's the lack of branching?



You can figure it out by just looking at the table and output. It prints the number of lines, then words, then characters. Since the lines count is counts[1], you can conclude that when a newline is encountered, the machine will transition to state 1, hence why the newline is the only 2 entry in the character table and all 2 entries in the state machine point to state 1. From the character table, we can see that characters are split between classes 0, 1, 2 which are word, space, and newline characters.

To get the definitions of all other states, by working back from the fact that state 2 is the word count, it must be entered when a word character (class 0) is hit after some space/newline characters. Hence state 0 represents the machine being in the middle of a sequence of whitespace, state 1 is when the machine is in a sequences of newlines, state 2 is the first letter of a word, and state 3 is any subsequent letter of a word. You can then verify that the machine transitions to the correct state from those by checking what it does in case of a word, space, or newline character. But I'm not sure why the table is 4×4. The fourth transition from every state is unreachable since the symbols are all in [0, 2].

As for why the approach is better, I think it's because it avoids branching. If you think about the length of a word, it's quite short compared to a CPU pipeline, so on a branching version of this you're going to be stuck spending most of your execution time on branch miss-prediction.


There are fewer branches but there is now a data dependency between loop iterations which makes each iteration slightly slower (maybe 1-2 cycles additional latency per iteration).

Because newlines are relatively common and unpredictable a state machine is likely better. But on a long file with no new lines and no spaces the branching one should be slightly faster. (All reasoning from first principles; I have not done any benchmarking!)


The README covers the exact case you describe - the `word.txt` benchmark is just a file with 93MB of the char `x`. The state machine is still faster in this case.


wc2 is faster than wc, but unless I am missing something, I can't find an instance where the author benchmarked wc2 with the core state machine loop replaced with branches.

wc2 and wc might be compiled with different flags / use a different strategy for IO, which makes it hard to compare speeds directly. The theoretical speedup of branches is tiny compared potential speedup of changing your compiler flags!


> But I'm not sure why the table is 4×4. The fourth transition from every state is unreachable since the symbols are all in [0, 2].

I think the fourth transition represents illegal Unicode. From there it stays in the illegal state until it hits legal UTF-8, then goes back to counting.


For handling ASCII, table needs 4 states × 3 classes of chars. Why is it defined as 4×4?

To handle illegal chars, it would need a 4th class but also a 5th state, so that's not the reason.

Can it be to replace a 'modulo' operation with an 'and' in the access to table?


That isn't relevant in the ASCII example, and the UTF-8 one uses a 256×256 table.


Half of all bytes are illegal ASCII though.


And actually there isn't a state for illegal ASCII so it makes no sense to have a transition to/from it. And it still can never receive a 3 as input and use that transition.


I've read recently (here on HN) that branch predictors are right 99% if the time. Is that inaccurate?


If you're like me, you need to stop thinking of it as "99% of call sites" and start thinking of it as "99% of iteration loops". I know it's super obvious on the face of it, but the implications is that most call sites can repeatedly fail but are totally overshadowed by insanely high frequency hot loops that it almost always predicts right (it's skewed, just like avg vs median)


This is the 1% of cases. As I understand it, most branches take the same path most of the time. Think of a loop for instance. With purely statistical prediction (go whichever direction the branch takes more often), loops will be predicted with extreme accuracy and the result is a huge number of correct branches. Then think about error handling code which isn't triggered most of the time due to input being valid. Between these two easy cases, I think you cover the majority of branches is programs which is why so many of them are predictable.


That's an over-generalization.

Only branches that capture exceptional events, e.g. error checks, show 99% hit rate.

Branches that are part of an algorithm are driven by the data you feed to them. For example in a classical binary search, the prediction is right only 50% of the time.


But there's an overall loop which branches backward to the same spot, regardless whether the iteration went left or right. At least that's predictable.

If we have an instruction that will conditionally load one of two pointers from memory, the left or right traversal can be made branch-free.

The only reason we need to switch between two code paths in the binary tree descent is that we have two different pieces of code for loading the left or right pointer to the different offsets used in the load instruction.

If we use a two element array for left and right, the indices [0] and [1] select between them. We make the comparison condition calculate a 0 or 1 result and use it as an index.


And any loop that iterates more than 100 times.


It may be true in some cases, but not in general. I think of it as an observation that for large swathes of code the branch is predictable. Eg, we usually don't go down the error checking codepath. Faced with flow control that depends on random data, such as "is this random number even", the branch predictor will have a hard time doing better than 50%.


Run a profiler on your code sometime and see how many branches it reports taking. You can easily get millions of branches per second, so 1% events happen hundreds of thousands of times.

See for example this quick investigation and look at the number of branches for what's not a very big program https://bnikolic.co.uk/blog/hpc-perf-branchprediction


Here's how the table is made

For starters you need to know what the states the values represent (you should remember this from k&r)

0 we are out of a word, but not via newline (lets call this character type "a one").

1 we are out of a word, via a newline ("a two" or "a newline").

2 we entered a word ("a zero").

3 we are in a word ("a zero").

So we know we need 4 rows in the table

{} {} {} {}

lets fill the 0th row first, we aren't in a word so the only states we can enter are 0,1 and 2. That's helpful cause we can fill out the 1st element

{,0,} {} {} {} (a one sends us to the first row)

{[12],0,[12]} {} {} {}(we know 3 isn't a possible destination as we are out of a word)

{1,0,2} {} {} {} (actually we have a choice of {1,0,2} or {2,0,1} I'll show 1,0,2 but this swaps the first 2 args of print at the end)

{1,0,2} {} {1,0,2} {} (since newline is also out of a word we have the same options)

{1,0,2} {3,0,2} {1,0,2} {} (now 3 is a possible destination if we get zero. A one will take us to state 0 and a newline to state 2 as before)

{1,0,2} {3,0,2} {1,0,2} {3,0,2} (now 0,2, and 3 are our only possible states)

As you can see, there is nothing magic about it.


"As you can see, there is nothing magic about it."

The parent seems to be using the term "magic numbers" incorrectly.

https://en.wikipedia.org/wiki/Magic_number_(programming)


I’m with you. It appears to be a fairly bog-standard DFA, but divining that from 4x4 `int`s and a 256 `char` array feels like I’d just be doing all of the heavy lifting, intellectually. However… it does look like it’s trying to detect spaces (that’s the array of 256 `char`, as `char` is self-indexing). I can’t be bothered to remember how `wc` works, but my guess is that the state-machine has transition states for `sp->sp`, `sp->wd`, `wd->wd`, and `wd-sp`. It’d also need error modes. I suppose it doesn’t contain states for escapes?


> why this is better than just iterating over characters?

The explanation it provides regarding Apache vs Ngnix seemed to me to imply that Apache requires memory allocation to read headers into buffers and Ngnix uses a state machine to avoid memory allocation. Memory allocation is slower than most people realise.


Why is this better for wc, though? wc is not doing memory allocation in the hot loop.


Because classic wc is not iterating over every byte once, but multiple times.

It's especially obvious in the Unicode case where it first takes 1-4 bytes to get a Unicode character and then checks this character with another function to see if it's whitespace

But even with with naive ASCII approach, if you don't hand roll a state machine you are checking multiple conditions on each byte (is it a space and am I leaving a word etc)

Using a dfa has fixed compute per byte


Those 1-4 bytes are sitting in a register the entire time and thus basically free to read as often as you want, though.

An actual sampled profile showing the two approaches would be interesting. Naively it seems like it's just because it has faster UTF8 handling and nothing to do with being a state machine exactly


According to the authors it's also faster on files full of 'x' or ' ', so there must be more than just better unicode support.


Even something as simple as wc calling a libc function for parsing utf-8, that doesn't get inlined, would destroy its performance relative to anything optimized.

Personally, I'd expect SIMD to win over all of these. wc sounds like kind of challenge that's very easy to partition and process in chunks, though UTF-8 might ruin that.


> very easy to partition and process in chunks

Which counters to increment at each byte depends on the previous bytes, though. You could probably succeed using overlapping chunks, but I wouldn't call it very easy.


That sort of "find the correct chunk boundary" logic was very common with all the mapreduce processing that was done when people still used the phrase big data.


I think it keeps wc simple if an algorithm is introduced not directly related to count words it becomes more complicated to debug. I see this a an example of the Unix philosophy of do one thing well.


rob graham wrote an article on this wc program in the humorously titled PoC||GTFO journal - see [1]. he also tweeted about it a few years ago [2].

[1] https://github.com/angea/pocorgtfo/blob/master/contents/arti... [2] https://twitter.com/ErrataRob/status/1494009849427992576


Hmm,

A project with all code working with special magic reminds me of another project that some attention a while back (XZ lib - supposed improvements with packages containing clever system back doors).

Edit: The code is likely harmless but the more opaqueness is in a system, the easier it is for malevolent opaqueness to hide.


That is not what they mean by “magic”. The “magic” in xz is the complexity in the build system. Here, the “magic” are seemingly arbitrary numbers that improve performance. The reason they’re magic is because we don’t know why they were chosen. The code itself is obvious and not magic.


I think the documentation might be stale, `build_basic` clearly has code for non-ASCII characters


Writing text and writing code are two distinct abilities. It is very rare the professional that possesses both.


That used to not be the case. Has something changed?


Yeah, Sean Barrett has a good writeup about them and optimizing them for branch-prediction, etc

It's framed in the context of programming-language tokenization, but the principles are the same.

https://nothings.org/computer/lexing.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: