I'm kind of speechless that "popping a message from the queue" requires knowing the length of the queue, which requires traversing the entire queue, a linked list of as many as 90,000 elements. Even the "fast" time of about half a millisecond is crazy slow for dequeuing an item.
If a linked list is a built-in Erlang data structure, why doesn't it keep a length count? If it's common knowledge that this operation is O(N), why does RabbitMQ call it on dequeue, and why doesn't it keep its own length count (and why isn't that the obvious fix)?
I can't think of any JavaScript built-in (the language where I spend my days) with a name like "length" that internally dereferences N pointers, but I would not be shocked to discover that dereferencing tens of thousands of pointers was taking a few milliseconds, and I would not write my data structure that way.
This is not true anymore, linked blog post is from 2010. RabbitMQ doesn't work like that now. It has queue indexes and various mechanisms to keep messages in memory and so on.
Well, you do need to know that there exists an item in the queue. Knowing that length > 0 is one way to go about it, but certainly it's enough to establish the existence of one item and then move on.
If one is unaware of the complexity of the .length() method, one might use it to find out if a dequeue is possible.
This seems like a pretty fundamental issue though. I'd feel like any systems programmer worth their weight in salt would know that yo can build an O(1) pop on a queue.
If you don't think about the underlying implementation of erlang:length(), this seems to me like an easy mistake to make.
Lots of data structures keep a size field and access it in O(1), so it's not completely unreasonable to expect erlang:length to be O(1) without making sure.
At least that's how I understood the context of the problem, I may be wrong of course.
oh, I suspect they do. Which is why the manual page for erlang:queue talks about the topic in detail and explains the rationale, gives a workaround, and links to the literature.
You fell victim to the confusion in the original post and this thread between what the original poster describes as popping, and what they were apparently actually doing (which involved taking the length, which is O(N) for the reasons explained here and in the documentation).
erlang's queue's pop ("out") is amortized O(1), worst case O(N) just like every other immutable double headed queue. Here's the source code:
It's the functional way to believe that the elegance of a singly linked list means that it is the right choice for the implementation, instead of something more complex and unwieldy that can fold both ways in the same time, for example.
> yet you can pick up a Bachelor's without ever hearing the word "allocate"
What university has such a program? Programming language design (students write an interpreter and compiler), operating systems (students write a minimal OS including virtual memory management), systems architecture (students design a CPU and write some flavor of ASSEMBLY) and algorithms (students learn about time and memory complexity) are core, required courses in all the CS programs around here. They're required topics for a CS program to be accredited by ABET. Maybe you're confusing actual CS programs with trade school certificate programs; you don't get a bachelors in CS by writing some Java.
I honestly don't know a thing about Erlang. But from my humble understanding of the interpreter source in the article: if is_list(list) is true, then the list is not empty. If you ask me, a respectable dequeue operation shoudn't need more than O(1).
The correct implementation for checking if a list is empty would have been a pattern match, something akin to
case List of
[] -> empty;
[_ | _] -> not_empty
end.
Or, you know. Reading the API docs and using queue:is_empty(Queue), which is O(1) instead of O(n) (which queue:len(Queue) is -explicitly stated to be-).
So this is only tangentially related, but it's all bloody neat, so I'm going to mention it anyway.
First, keeping a length of a list, as part of the library implementation, is a bad idea, because due to how immutability works, you'd need to keep a length from -every item in a linked list-. That is, rather than
[4, 3, 2, 1]
you would need to have
[{4, 4}, {3, 3}, {2, 2}, {1, 1}]
Why? Because adding a new element to the head of a list needs to be O(1), -and- maintain the existing list. That is, if I have
List1 = [4, 3, 2, 1].
I need to be able to do
List2 = [5 | List1]
That is, List2 is now
[5, 4, 3, 2, 1]
and List1 remains unaffected. So both List1 and List2 would need to still know their lengths, as would any other variables with heads further down in the list.
So we have ~doubled the memory usage of a list, in that we need both a pointer (to the contents), and an integer, for each cell of the list. We can't simply store a 'total' value for a list (well, we could, and then every time we perform -any- operation on a list, that generates a new list(s) we re-calculate the length for it/them. Which would be a pain to code, results in a lot of extra garbage (since the intermediate values of the counter are thrown away, again, immutable), is unnecessary...and is actually impossible in lazy languages like Haskell that allow infinite lists. Quite frankly, in functional languages you don't need the length enough to go through such hassles, and when you do, yeah, it's easy enough to write it for your one use case).
And that said, we don't need any of that noise for implementing an efficient queue. An immutable queue in Erlang is implemented via Okasaki's queue implementation (though with a recognizable API by default; Okasaki's API is still there). This is actually a really cool data structure, and has parallels to the interview question "how would you implement a (double ended) queue using two stacks".
In short, it looks like this -
{[], []}.
Where you have a tuple of two elements; two lists (and an immutable list works identically to a stack; operations upon the head of it work in O(1) time)
Enqueue 1 at the front ->
{[1], []}
Enqueue 2 at the front; note that this is actually stacking the 2 in front of the 1 ->
{[2,1], []}
Dequeue examines the second list, sees it's empty, reverses the first list, puts it in for the second (and an empty list in the first), and pops off the top of it, a la ->
Where 1 is your return value, and {[], [2]} is the new queue.
Further enqueues keep going onto that first list, dequeues come off the second, until it's empty, whereupon the first list is reversed again and placed in for the second. Etc.
This has an amortized constant time enqueue and dequeue, despite being immutable, and a constant time 'is_empty' check, where you just check each list to see if it's empty or not (easily doable by checking to see if it pattern matches [] or [_ | _].
Pushing onto the start of the second list, and pulling off the first, allows you to work on the other end of the queue.
So all your usual operations for a double ended queue work in the same big-O time as for the mutable implementation, but you maintain immutability.
The problem here is that the implementation relied on an O(n) behavior, when all it needed was an O(1) one.
While the article states "The problem is not with erlang:length() being slow. It’s about being unpredictably slow.", I would counter that asking for length when you just want to see if it's empty or not is INDEED the problem. Length explicitly tells you it will traverse the list, and yes, it makes no guarantees about where the elements of the list are in memory. Same as in every other language. Because contiguous pointers in memory are called "an array of pointers" not a "list of pointers".
Note, too, that this article is from 2010. I'm guessing that basic.get has since been rewritten to not use an O(n) implementation when an obvious O(1) implementation exists.
I think the queue of Okasaki's that you mention is only amortized constant-time in a persistent setting if certain laziness is carefully enforced.
I don't know if Erlang's queues do this, but I do know that implementations of Hinze and Paterson's finger trees, which rely on a similar laziness for their amortization argument, are often implemented without the laziness. It wouldn't surprise me if this were common in translations of purely functional data structures.
We're moving into an area whose terminology I'm not as familiar with as the underlying concepts, but what I recall, yes, the actual big-o of a -single- operation, if this is implemented strictly and used as a single ended queue, has a worst case bound of O(n), and a best case bound of O(1), and an average case of O(1), and would be described in big-O terms as "amortized O(1), worst case of O(n)"; much like an arraylist (where the general case for adding at the end is O(1), but when it's full, is actually O(n) as we create a new array and copy elements over, but we describe it as having an amortized cost of O(1)).
Or put another way, for n enqueues, rather than thinking of each enqueue taking one operation, think of it taking two, one now, one deferred, for a still constant enqueue time. On a dequeue where the right side is empty, you have n deferred operations 'stored up', that you then use to reverse the list, and then you use one operation to take the head. So while that -particular- operation actually uses up n+1 operations, the amortized time across the entire sequence of operations so far is constant. And so while you may not have real time dequeue performance of a single operation if implemented strictly (that is, in the worst case, where the right list is empty), the act of enqueueing n things, and dequeueing n things, leads to a total time proportional to n, leading to an averaged constant time per operation.
Now, that said, if you treat it as a double ended queue, there is a worst case in the strict implementation that runs in O(n) per operation; that's by inserting onto one end, then alternately pulling off one end, then the other (so that for every dequeue, we have to reverse the remaining list again). But that's extremely degenerate, and is more egregious than describing quicksort as O(n^2) due to its degenerate case of an already sorted list (since that's far more common a use case than inserting always on one end, and perfectly alternating your dequeues).
1. Average and amortized have very different meanings in algorithmic analysis.
2. One mistake - Okasaki's queues in "Simple and Efficient Purely Functional Queues and Deques" are actually constant-time in the worst-case.
3. Okasaki makes a strong case in his book that the kind of analysis that rejects amortization over ephemeral usage when a structure can be used in a persistent setting is not "more egregious than describing quicksort as O(n^2)". Note, for instance, that quicksort is O(n log n) with high probability with good randomness, and O(n log n) in the worst case with median-of-medians. This quadratic queue case may be degenerate, but the degenerate algorithmic complexity attack of turning hash tables into linked lists by manufacturing collisions is well-documented and worth worrying about. I claim the same is true of queues if a persistent interface is exposed to the user.
"So we have ~doubled the memory usage of a list, in that we need both a pointer (to the contents), and an integer, for each cell of the list."
While I'm not going to say storing the length with each cell is advisable, this exaggerates the cost of it. First, you're forgetting the links between cells, so it shouldn't be "double" in any case. Second, of course much of the time the list items will dwarf the spine of the list. Finally, you should be able to get away with a smaller representation for list length than for pointers.
Heh, as I just replied to the comment you made elsewhere -
"Yeah, RabbitMQ opted for the 2nd because of that. BQ in that case is whatever queue implementation was selected; if you look at both the lqueue.erl and priority_queue.erl, you see that they implemented len() by pulling apart a tuple and taking a value verbatim; no further operations. As such, I'd assume every queue type they support has its own internal wrappers to maintain a length."
EDIT: Just to explain the Erlang a bit, what is happening is the #q record that is being pattern matched on in the function you mention has a background_queue field. This field is an atom, corresponding to the module the queue implementation lives in. So 'lqueue' or 'priority_queue', or whatever else fits the interface (behavior in Erlang terms). So it actually ends up calling lqueue:len(BQS) or priority_queue:len(BQS), or similar. If you look at the implementations of those, you'll see they've wrapped up the actual queue into a tuple, bundling along a length, such that they can simply return the length, rather than recalculate it every time. So it looks like in the intervening 4-5 years, they -have- fixed it.
"They said high level languages relieved us from thinking about memory issues" nobody said that. I'd even claim the opposite. For example, to develop a game on Android (Java) or in the browser (JavaScript) you need to work around the garbage collector the best you can with techniques like object pooling, or else your entire program stops every 3 seconds. It's just nasty. Know thine enemy, know thine garbage collector.
Actually, many people said it all the time -- especially during 1995/1996 when Java hype was gearing up. It was touted as a selling point in Java books, JavaWorld articles, usenet postings, etc.
Examples:
--Java also implements automatic garbage collection, so you don't have to worry about memory management issues. All of this frees you from having to worry about dangling pointers, invalid pointer references, and memory leaks, so you can spend your time developing the functionality of your programs."[1]
--Programmers can be relatively fearless about dealing with memory because they don’t have to worry about it getting messed up.[2]
--Java also implements automatic garbage collection, so the programmer does not have to worry about memory management issues.[3]
--The entire interaction with memory is worry-free in Java.[4]
The issue of "GC pauses" or "extra memory footprint required for performance" is not mentioned in any of those sources. It's not Java's fault that it's presented this way because every technology is hyped by only listing the advantages and avoiding the talk about disadvantages.
I see a very big gap between "memory issues" and "memory management issues" ... for me the first thing is "thinking about memory, how much you will use, etc.", the second is "Okay, I've allocated a string here. Who the heck will be the one that cleans it up?" - high level languages fix the second case (if you LIKE the fix, that's a different topic, but it is a fix), they don't (and cannot and never were claimed to) fix the first case, because that would need an AI like runtime environment.
(Okay, your quote [4] is actually talking about both, so .. one of four examples fits?)
You are absolutely right on the fact that it doesn't, but I have seen plenty of people claim that high-level languages absolve you from having to think about memory.
Wikipedia says:
>"[High level languages] may [...] automate (or even hide entirely) significant areas of computing systems (e.g. memory management)"
And I think most people who make such claims have read similar statements and simply confuse memory-management with thinking about memory issues. (The distinction is that well defined anyway).
This isn't a case of a memory management issue. It's a case of someone using an O(n) algorithm instead of O(1). There's not a memory layout that would change this. Even if you used an array, if you iterated over it, it'd still be rather expensive.
This is also what abstractions are for. Instead of killing your perf by using an unsuitable data structure, implement the same interface, but with a fast implementation. Using linked list for this scenario seems suboptimal.
To be fair, for a lot of things, GC certainly does delay or obviate thinking too much about memory management. This lets you explore the problem space and get something built and then maybe later optimize where needs be, if it proves necessary.
It's an interesting phenomenon in "slowish" languages that people feel this way
> This function [length] is implemented in the VM so it’s expected to be very, very fast.
Even being aware of the data structure and algorithmic bounds forced upon such a function, being "in the VM" indicates speed and a safe choice in using it.
Of course, as this post notes, that's sort of insane. O(N) is a big computation on a big list and it means that there's a chance of memory competition. High level language or otherwise, you can't ignore algorithmic complexity. That's just absurd.
It just feels like a weird quirk of Erlang's queue module. If you're mutable, then getting down to raw memory is probably a good idea. If you're doing this all with linked lists the dear god go read Okasaki and at least have an amortized constant queue pop!
or at least read the erlang:queue() documentation which specifically references this case, outlines the rationale for why it is the way that it is, provides an Okasaki-style API, and references his book.
Bad title. Interesting bug, and it shows you just how much being aware of implementation details can pay off. I'm kind of surprised that Erlang doesn't keep a 'length' field under the hood for lists, but that's nothing the OP couldn't fix by keeping one for himself.
The Erlang VM fix isn't really a fix from a user point of view, installing a package like this should not require patching the VM, I do think that the Erlang VM maintainers should be made aware of this issue and then they can decide if they want to roll the patch or a more serious version of that patch into the distribution, it would seem to me that they are not the only people that have been hit by this bug.
As for the bad title: The higher level your language the more place there is for bugs to hide and if you actually believed that higher level languages would isolate you from the machine sufficiently that you could ignore reality then you had it coming.
"The higher level your language the more place there is for bugs to hide and if you actually believed that higher level languages would isolate you from the machine sufficiently that you could ignore reality then you had it coming."
I think this is because the abstraction is leaking. For example, such things as linked lists are not abstract enough to forget about underlying implementation, as they translate to actual memory blocks actually allocated in physical memory, and the programmer has to consider caching problems. If the system could analyze caching problems on its own and decide when to use contiguous arrays, and when to use linked lists, it would be closer to actual high level programming.
its not a bug, its hidden optimization during GC (hibernation). You would want that optimization if you intend to touch values. Its something Erlang devs decided upon after analysing use cases.
I know nothing about erlang, but if performance matters and you won't be deleting from the middle, an array-based implementation will usually be better than a linked structure. Better alignment, less pointer chasing, fewer allocations and reallocations. Also the synchronization cost can be lower in a multi threaded context. Cost might be more memory allocated in some cases (but less in others and less fragmentation in your memory allocator's pool).
An array would work, most of the time. The rest of the time, it'd be reallocating to extend the array, then copying. Which also has undesirable performance at unexpected times. Don't see it as any better.
That's almost true under the assumption that you are using the most elementary implementation possible with only a single array (insert at end of array, reallocate as needed to increase array length, copy elements down during dequeue, etc.) - you would still be avoiding the garbage collector relocating your array elements and generally get better performance due to nice alignment within cache lines in memory. With a more advanced implementation using multiple arrays, you can do even better (basically a blocked linked list), and you can cache the blocks as you dequeue and free up blocks to avoid hitting the memory allocator more than necessary. If you want to get really crazy with it, you can start doing blocks of multiple sizes that grow and shrink dynamically with the usage behavior of the list, but at that point what you have done is implement a memory allocator.
Until recently De Anza College had C as their introductory programming language (and even now, their C++ courses appear to be nothing more than a warmed over C with slightly different syntax). My opinion is that that was a horrible choice-students spend waaaaay too much time bogged down in pointers and memory allocation, and less time thinking about algorithms and just getting used to writing code.
C/C++ are excellent languages to learn, but dear god, at an upper division level.
please use the original title, unless it is misleading or linkbait.
I'd say this title should be changed to the original, as it is now it smells like clickbait. Or at least its purpose seems to draw unnecessary attention.
It is only an implementation issue if it is possible to solve the issue.
Nobody has ever built a garbage collector that does not slow your program down or cause it to use vastly more resources than it would otherwise. (Claims to the contrary are always implicitly caveated).
Given that this is the case, it really does start looking like a language issue. Yes you can rearchitect the GC to care more about locality, but you are just pushing the dust around on the floor: you will find a different problem.
Your comment isn't wrong in itself. For any GC you can write some program that exploits its weaknesses
But you could deal with this bug in various different ways. For example, here you could be implementing erlang queues so that the length is stored, instead of checking the length. You could implement hibernation differently. You could implement popping differently. There are probably ideas from people who know more than nothing about erlang.
Anyways, I don't think this particular issue was really an issue with the concept of a GC'd language, more of a specific bug issue.
My point is that you can solve this one symptom but your program will have many other problems due to GC (provided it does a lot of work). It is like whack-a-mole in that there are always more moles.
Horrible link title: "They said high level languages relieved us from thinking about memory issues"
High level languages do relieve us from thinking about memory. The linker allows you to forget about program layout in memory and executable. C and other "system" languages free you from managing byte alignment for your structures, byte order and more, and very high level languages even manage your memory use with automatic garbage collection. Locality is a concern in all these languages, and there's no abstraction for that (yet).
The simple solution to the issue presented in the article itself is any of the following:
1. Remove the call to len(). Handle the empty case from the drop/get/peek call, which would appear to be the idiomatic style for Erlang.
2. Track the size of the queue as is suggested in the docs, to improve the efficiency of this specific call.
Or, you could tackle the underlying issue, which is a lack of locality in the list data type following a GC cycle. This would probably involve allocating a chunk of nodes at once so it's the size of a cache line on the platform that's running the VM. Seems to me like that would make a patch worthy of being merged upstream rather than prefetching random chunks of memory.
Yeah, RabbitMQ opted for the 2nd because of that. BQ in that case is whatever queue implementation was selected; if you look at both the lqueue.erl and priority_queue.erl, you see that they implemented len() by pulling apart a tuple and taking a value verbatim; no further operations. As such, I'd assume every queue type they support has its own internal wrappers to maintain a length.
Frankly, the problem described is not a problem at all, as it would be weird to rely on cache hits to ensure performance of linked lists. Cache misses are an expected drawback of linked lists, as they have other advantages.
I would think that if you were to ask people about a sudden and unexpected 9-fold drop in performance, they would say that it was a problem.
Certainly to me, never knowing when your system might slow down by three orders of magnitude (base 2, nearly one order of magnitude base 10) is a problem.
Yes, I agree that this is a peculiar case and it is worth reading and keeping in mind that such things can happen.
I am just saying that I think it would be more correct to consider this as a 9-fold performance increase gained from caching, as no one should rely on caching when dealing with linked lists.
Sure, it's a problem because the programmer used a completely unsuitable algorithm that is slow for this use case regardless of memory layout. It's similar to looping over a predicateless SELECT instead of using COUNT. And then blaming databases because I misunderstood the "promises" provided. Although, the original article doesn't make the claims that your submitted title did. (No hostility, just pointing out that you're adding a wrong interpretation.)
Nobody ever said that. Memory and time are especially important for high-level languages, because they spend them in an uncontrollable way inside said languages.
For instance, the common approach of garbage collection is a huge topic in every high-level language. Because it takes a relatively simple operation (delete one item) and batches it up. So when it happens, it takes thousands or millions of times as long AND inevitably happens when you can least afford it.
You never stop thinking about memory (and cpu time) in any environment.
Erlang actually doesn't take the common approach, which is part of the reason it usually does "soft" real time pretty well. This is the first thing I found in Google when searching for a description of how it works:
In my experience, if you use a List supplied by the standard library of your language, it isn't a linked list. In fact it's usually an "ArrayList" style. (1)
This is so that length() and item = list[index] operations are not O(n), they are O(1). You do have to be aware of the tradeoffs, e.g. that inserts into the middle of the list can be slower (and YMMV with adding at the end. It depends on if a realloc is triggered).
It's worth checking which implementation you actually are using, but linked lists are the exception, for good reasons.
And when you have no choice but to walk the list, it's well known that instead of writing this (in c# functional style pseudocode)
if (someList.Count(item => someCond(item)) > 0) doSomething();
... it's better to write
if (someList.Any(item => someCond(item)) doSomething();
This is for much the same performance reasons - you don't need the exact count when all you want to know is if there are any, and the count can become expensive if the list is long.
Yes, that's the idea. ArrayLists don't work well at all if you want persistent data structures as your language's default. Also arrays are less sensible for pattern-matching. (Note that most functional languages offer normal stateful arrays as an option, if they're necessary; interestingly, Erlang does not.)
I'm not sure why they're hibernating when they have a big queue? Usually, one would hibernate when you're idle: you've done all the work (queue is zero), and you haven't seen a request in a while.
But certainly, don't do length(List) if you need to see if it's empty; pattern match against ([]) and ([Head | Tail]).
They dont. In java we fairly often dive into memory layouts and how bytecode is JITed into machine code. But to be fair Java is low-level language at some levels.
If a linked list is a built-in Erlang data structure, why doesn't it keep a length count? If it's common knowledge that this operation is O(N), why does RabbitMQ call it on dequeue, and why doesn't it keep its own length count (and why isn't that the obvious fix)?
I can't think of any JavaScript built-in (the language where I spend my days) with a name like "length" that internally dereferences N pointers, but I would not be shocked to discover that dereferencing tens of thousands of pointers was taking a few milliseconds, and I would not write my data structure that way.