In the interim I've been using standard C# web requests and using the cookies from the ChromeDriver. Doesn't work for downloads where it is not a direct link unfortunately.
I encountered that problem. The ad hoc fix, was to have a version field in each event and functions that translate the old event into new event(s). The code that processes the events only processes events of the current version.
If your old events had been denormalized this might result into repetition of events when splitted.
To add to that, you can treat it like you would schema migration on databases: implement v1 to v2, v2 to v3, etc... and replay the migrations in order to migrate from whatever version of the event is to the latest version. This allows keeping event migration code as immutable as the event versions it migrates between.
Does anyone knows details about the KSQL-engine that computes the queries? According to their git, there can be multiple KSQL-engines in a Client-Server configuration. Is the workload for one query distributed? Is the SQL translated into a program using the Kafka streaming API?
Yes, queries are translated into Kafka Streams API.
In the client-server(cluster) mode each query will run on every instance of engine the same way kafka streams apps run on multiple instances.
A pump-and-dump scheme drives the price up by promoting the commodity only to allow insiders who bought it earlier to sell when the price peeks just before the the price plunges again.
You can easily see it in penny-stocks and/or OTC-stocks. The price stays flat, then peaks, then crashes to about the pre pump-and-dump price.
I haven't seen such a behaviour in any of the new coins. Does anyone have an example or link to a chart?
The scammy ICOs (which is most of them) aren't really pump and dumps. They're more like variations of the "big store" and other phony front investor scams. It's more like outright stock fraud. They're selling shares in something that doesn't exist and isn't going to exist.
I think this is biggest difference. Selling things that there is no real intent to actually make the systems. The other issue is that a real company could get sued and you'd be able to track the real assets bought with the ill gotten gains. With these ICOs, I'm not sure how you track the money to even know who to sue.
Does anyone knows the difference to other languages like STAN?
Both use Hamiltonian Monte Carlo. As far as I know STAN cannot model factors but can restrict a variable to integers. In this case, Turing.jl and STAN do the same and switch back to slower MCMC.
So why should I switch to turing.jl?
Their wiki shows that turing.jl is about 10 times slower than STAN:
Results
Model Turing Stan Ratio
Simple Gaussian Model 1.2 0.06 20.24
Bernoulli Model 1.53 0.05 32.73
School 8 2.34 0.1 24.41
Binormal (sampling from the prior) 0.88 0.11 8.37
Kid 32.66 4.72 6.92
LDA-vec 23.94 3.78 6.34
LDA 72.78 3.78 19.28
Mixture-of-Categorical 22.28 6.41 3.48
NOTE 1
Numbers here are inference time in second - smaller number indicates better performance.
We're setting up parameter estimation in DifferentialEquations.jl with both Stan.jl and Turing.jl as separate but connected projects. For generated data from the Lotka-Volterra equations, Turing.jl seems to be able to recover the four parameters well in minutes, whereas Stan was running for a few hours and didn't get it right. It's hard to know if it's because of the difference in the ODE solver and the accuracy there (Stan has to use its internal solvers, whereas Turing.jl gets to use our whole stack. For testing we chose Tsit5 which should be to some extent equivalent to their rk45, but without knowing the details of their solver it's hard to know how different it truly is) or due to the chosen method for MCMC method. One major advantage of Turing.jl though is it lets us use our full stack: Distributions.jl, all of JuliaDiffEq (ODE/SDE/DAE/DDE/etc. solvers), whereas Stan requires you use their two ODE solvers. So we're going to finish up this DiffEq->Stan automatic generation bridge for testing and benchmarking purposes, but the Turing.jl route looks much more promising. They have done a really good job.
I'm surprised you're able to get Turing to run faster, to be honest. I looked at Turing and was impressed by it, but in general my experience with MCMC libraries in Julia is similar to what the OP posted, being much slower than Stan (MAMBA and Turing being the two Julia libraries I've tried--both have seemed nice but have run significantly slower on the tasks I've used them for).
Incidentally, I haven't been able to get Turing to run on Julia 0.6, and am not going to reinstall an earlier version just for Turing. I've been waiting for a little while now.
My experience with Turing and MAMBA have kind of diminished my enthusiasm for Julia. Both libraries kind of represent what I was looking for in Julia (similar to what you mention about using your Julia stack), but the speed was a kind of a rude awakening. I'm kind of coming to the opinion that LLVM-based languages need to demonstrate much more consistent performance before they're really ready to replace C (Rust, Nim, and Crystal seem like they might be on their way, though).
I recommend someone thinking about Julia pay attention to these benchmarks:
While I have absolutely no problem believing that a carefully, tuned special purpose system like Stan readily beats the equivalent julia libraries, I'd caution against making inferences about the speed of any code written in a particular language as a result. Once a certain level of performance is reached, for most applications, the primary performance factor is algorithmic (and I include things like memory layout/cache friendliness in these). As the simplest example, consider matrix multiply. The venerable triple-loop will perform about the same whether it is written in Julia, C, Assembly, Rust, Nim or Fortran (ignoring for the moment compilers with specific matmul pattern recognition). However, depending on your CPU, you're leaving anywhere between 10-100x performance on the table by not using BLAS. Now, BLAS happens to be written in assembly and Fortran, but that's not the point. If you used the same algorithms in Julia code, it would achieve the same performance.
As for the benchmarks you posted, the worst I see Julia performing in that list is on the JSON benchmark.
However, again, that's benchmarking a library. Doing high performance text parsing well is a non-trivial problem and JSON.jl isn't a super high performance library. The fact that nobody has sat down and written one yet is more of a reflection of the user community (since JSON is mostly associated with the web stack, JSON-parsing tends to be more incidental for julia users, than languages/libraries that are more focused in that area) than any performance limitation of the language - in fact I think I recently did see somebody working on a higher-performance version, but I don't have the link handy.
To summarize, it is true that some languages (or rather their implementations) are slow, causing an automatic slowdown to projects written therein, but that is generally not the case for julia (or C or Nim or Rust, etc). Once you get to that point, other things matter more, and it is important to distinguish that. In the case of Turing.jl, the linked wiki mentions that one of the primary factors impacting performance is their use of forward AD, when reverse AD would be algorithmically better. It seems understandable, that a newer project would be algorithmically less sophisticated than a more established one. However, that'll get fixed over time and the other benefits will become more and more prominent as a result.
> Yes, Julia is really fast with some things, but for other things it's much slower
No sane person would deny this :) Regarding those benchmarks, however, I would note that:
a) They are two major Julia versions old, many of the benchmarks Julia appears in are string-centric, and since then Julia's string handling has been improved.
b) Glancing through I see a few pretty basic ways in which the Julia code could be sped up without doing anything convoluted
The benchmarks are based on old Julia--something I've wondered about--but they still are consistent with my experience (and with the benchmarks the OP posted).
I really like Julia, so I don't want to come across as a hostile critic, more of a friendly one. I'm someone who has, and continues to try, Julia for lots of things. But repeatedly I've bumped into this phenomenon where Julia is maybe 3-5x slower than what you'd see in C, maybe more, and am faced with what that means in practice. It's not just Turing, and it's not just a single version of Julia. It's a heck of a lot better than Python or R, but when I realize it's still significantly faster to do something in C, and you're going to wrap something around C anyway, maybe Python or R would be better just because of the resource base associated with them.
Maybe there's some fundamental issue I haven't grappled with, like Julia will never actually be as fast as C, and I just need to be aware of what that means, practically speaking... I don't know. Maybe it will eventually hit that performance range, and it's just a matter of optimizing algorithm choice and whatnot. But I get the sense that the way I was hoping to use Julia isn't actually doable, at least at the moment.
I do think many people are coming to Julia because of some sense they can get the performance of C with the expressivity of Python, and at the moment I think that's an unrealistic expectation for a lot of things. I think the parent post in this subthread exemplifies that quite well: for many, many reasons, it shouldn't be surprising that Stan outperforms Turing. If this was any other language, it would be a matter of course. But because of the expectations set up for Julia, you see discussions like this.
If you can turn one of those trials into a small example and post it on the forums, it'd be well worthwhile. We've often helped folks squeeze the last few bits of performance out of their code – and really, anything past 2x of C is considered a bug.
You defined rewrite rules that reduce an expression down to a 'normalized' form such that two expression are equivalent if they reduce to the same 'normalized' form. These rules are obvious for simple arithmetic expressions.
Is it possible to deduce the normalized form and the rewrite rules from a set axioms automatically (e.g. for relational (like SQL) expression with operands like join, project, filter)?
In the terminology (may I say: terms) of term rewriting, you are looking for a so-called completion procedure. This takes a set of axioms and turns it into a convergent term rewriting system (or not, because it may not terminate!).
If the completion procedure succeeds, you end up with rules that always terminate, and which reduce any given term to its normal form. In the literature, search for Knuth-Bendix and Huet completion procedure, if you are interested. For SQL, it may get quite complex, but still doable, potentially with some extensions of the primary method.
There are all kinds of variations on the general scheme. If you are interested in term rewriting, a solid starting point is Term rewriting and all that by Baader and Nipkow.
Very cool. The whole pandas API is huge. What was your focus for your implementation?
Isn't there another well known js library that allows SQL operations including aggregates? How fast is it compared to that?
What happens if I load a 100MB csv file into a pandas.js dataframe in the browser and compute an aggregate? How fast is this?
The initial focus was on using it for plotting in React. We wanted to make shouldComponentUpdate decisions quickly and the Immutable base of pandas-js allows for quick, cheap comparisons after querying data directly from our API. See the post on it here: “Pandas & Immutable.js” @StratoDem https://insights.stratodem.com/pandas-immutable-js-2d9bf0106...
The next step will involve performance optimizations as the aggregates will be slow at 100MB. At that scale, we do the operations server side (we've exclusively been using this client side). Thanks for raising that concern!
May I ask why you have chosen Spark? I get that it it more convenient than Hadoop. But why not use an analytics DB like Redshift or Impala. What's the use case for Spark? Do you store your data in flat files and how do you access it with standard tools ala Power BI, Tableau, Looker?
The European Union General Data Privacy Regulation regulates only personally identifiable information. IF(just an assuption) the collected data does not identify the user, does it still apply?