No, the US is _leading_ the AI race, but the race isn't over.
What's the point of leading the race for 90% of it, if they're gonna slip on their own sweat and fall down by the end? In non metaphorical terms, what's the point of spending billions of dollars rushing to get the best AI tech at all costs, when the competition can distil your progress and catch up in 6-12 months while only spending 1% of what you spent.
Even in the aspect the article cares about, commercialization, the US is starting to lose marketshare, I've seen people move from cc/codex plans to use glm/opencode plans due to the recent squeeze the US companies put on plan usage, the US companies are screwed if that sticks, not everyone needs the bleeding edge models, they just want to pay $20/month and have the models be decently capable.
What if it is not winner take all? What if there is no race. What if what USA has been doing is just burning money with possibly unsustainable debt load and way over build valuations...
AI being commodity server capacity might be a thing. And the customers might even manage without hyperscalers... In that sort of end scenario whole current market might look rather foolish.
>What if there is no race. What if what USA has been doing is just burning money with possibly unsustainable debt load and way over build valuations...
You mean, what if the hype-based billionaire-class is wrong? Isn't suggesting that a sin in America these days?
When a cyclist is leading a pack and pushing themselves against the air resistance for half the race, do you expect that cyclist to win, or one of the ones behind that's been taking it easy in the slipstream?
I find it very strange that the GP felt the need to correct a difference between leading and winning. If you're at the front of the pack in a race, you are both leading the pack and winning the race.
If your team has more points than the other team, you are both leading the contest and winning the contest.
It is a distinction without a difference.
The elephant in the room, and where the analogy breaks down, is that a race has an end, the finish line. A sports match has a victory condition of some type. Nobody has a damn clue as to the victory condition of this hyperscalar craze. Anyone who says otherwise is incorrect.
GP here, leading and winning are different things in the race context/metaphor.
In foot/cycling races there's often a pack leader, that leader is often not the winner of the race, all they're doing is taking the brunt of the air resistance while everyone else slipstreams behind. For a casual observer it seems that the pack leader will win, but everyone knows that it's gonna be someone that paced themselves that's going to overtake the first spot at the tail end of the race.
You’re moving the goalposts and tying to one specific sport. I didn’t say “winner” nor did anyone else. “Winning” is the operative word, and tying the whole analogy to cycling is as close to a strawman as one can get while having the ability to claim otherwise.
I have never ever heard a commentator say something like "Arsenal are currently winning with 2-0 against X". It is always leading: "Arsenal are currently leading with 2 goals against X".
Leading the race makes sense if it's a winner takes all market. AI cannot be a winner takes all market, because of national security reasons.
I would also argue that as AI gets better it will also be more fungible. It will be valuable like electricity. Lots of companies make good money producing electricity, but not the kind of money current investors are hoping for.
Mark Cuban in a recent interview answered your question: companies are afraid there is going to be just one in the end—sort of the way there is one ad-company now on the internet. They want to be that one.
Whether they're correct that there can be only one is of course a matter of debate. But that is at least the mind-set they are operating under according to Cuban.
That was all based on the assumption that scaling LLMs would lead to AGI. That didn't happen and won't, it can be proven for anyone capable of digging into the details of what models actually are and what they are doing (I recommend Chris Hay's videos).
It's becoming more obvious everyday to people. They'll realize that 1000x the cost for marginal improvements isn't worth it and the market will open up. It'll become more based on tooling and smaller more task-focused models instead of this crazy project to create a data center god to rule over humanity.
Why would Mark Cuban know anything about the motivations of today’s big tech companies? He has not been involved in tech businesses since he sold a radio on the internet website 26 years ago.
He was never based in Silicon Valley, and the closest he got was selling a website to Yahoo in 1999. After that, he has mainly sold sports and his media personality for TV shows.
Moreover, why would leaders of trillion dollar big tech companies subject to myriad securities laws be discussing intimate business details with random people that have no domain expertise or influence?
I do not see the names of people in big tech business leadership positions, except maybe Andressen, if he counts. All the other ones look like media personalities or journalists or some two-bit SV founder.
If I'm understanding this right, this presupposes that the models were pre-trained on unfiltered data like with the "floor" models, so when comparing between the "retail" and uncensored models they will obviously not match the floor because they were not trained on the same data in the first place.
To me it stands to reason that a model that has only seen a limited amount of smut, hate speech, etc. can't just start writing that stuff at the same level just because it not longer refuses to do it.
The reason uncensored models are popular is because the uncensored models treat the user as an adult, nobody wants to ask the model some question and have it refuse because it deemed the situation too dangerous or whatever. Example being if you're using a gemma model on a plane or a place without internet and ask for medical advice and it refuses to answer because it insists on you seeking professional medical assistance.
> speculative decoding which, generally speaking, is not the same quality as serving the model without it.
I've never heard of ANY speculative decoding that wasn't lossless. If it was lossy it'd be called something else.
This page is just a port of DFLASH to gguf format, it only implements greedy decoding like you said so the outputs will be inferior, but not inferior to greedy decoding on the original model. Tho that's just a matter of implementing temperature, top_k, etc.
Same reason why prompt processing is faster than text generation.
When you already know the tokens ahead of time you can calculate the probabilities of all tokens batched together, incurring significant bandwidth savings. This won't work if you're already compute bound so people with macs/etc. won't get as much benefits from this.
Are Macs/etc compute bound with their 'it fits in unified memory' language models? Certainly by the time you're streaming weights from SSD you must be back in a bandwidth-bound regime.
From what I understood, if we’re talking a single user on a mac (not batching) you’re rarely compute bound in the first place. More rows per pass is nearly free that way when cores were sitting idle anyway.
If that’s wrong I would certainly appreciate being corrected, though. But if it’s right, a 2.9x speed-up after rejected tokens, nearly for free, sounds amazing.
That will depend on the model, but they'll hit compute limits before a typical GPU in almost all cases. Macs will still benefit a speedup from this, just not one as big as the one reported.
Official sites make things worse on purpose after getting any sort of traction because they can't stop chasing profits.
I don't watch sports, but my father watches soccer. He really only cares about 1 team and the national games from our home country. He was spending over $100/month to be able to watch the games, and they werent even in his native language. Now he pays $80/year for a pirate IPTV service and not only can he watch the games anywhere he wants, he also gets native language commentary for the games, national tv channels like news, etc.
When pirates can charge you money and offer a superior service, it absolutely is a service problem. You can claim that the realities of licensing and whatnot don't allow official channels to provide the best service they can, but that's not true in this case. When the same provider is splitting game broadcast from one team into different packages you know they're just trying to extract the most amount of money possible.
IDK the deal with scanlator sites nowadays, but I assume the official sites can provide more timely translations for manga since they can access the source material before anyone has seen it. I know most popular manga gets translated within hours of release, but if you're following some more niche stuff it can be several days. I also know a lot of scanlators have patreon pages so it's not like the demand from paying customers for translated media isn't there.
Not parent but I can guess from watching mostly from the sidelines.
They introduced a 1M context model semi-transparently without realizing the effects it would have, then refused to "make it right' to the customer which is a trait most people expect from a business when they spend money on it, specially in the US, and specially when the money spent is often in the thousands of dollars.
Unless anthropic has some secret sauce, I refuse to believe that their models perform anywhere near the same on >300k context sizes than they do on 100k. People don't realize but even a small drop in success rate becomes very noticeable if you're used to have near 100%, i.e. 99% -> 95% is more noticeable than 55% -> 50%.
I got my first claude sub last month (it expires in 4 days) and I've used it on some bigish projects with opencode, it went from compacting after 5-10 questions to just expanding the context window, I personally notice it deteriorating somewhere between 200-300k tokens and I either just fork a previous context or start a new one after that because at that size even compacting seems to generate subpar summaries. It currently no longer works with opencode so I can't attest to how it well it worked the past week or so.
If the 1M model introduction is at fault for this mass user perception that the models are getting worse, then it's anthropics fault for introducing confusion into the ecosystem. Even if there was zero problems introduced and the 1M model was perfect, if your response when the users complain is to blame it on the user, then don't expect the user will be happy. Nobody wants to hear "you're holding it wrong", but it seems that anthropic is trying to be apple of LLMs in all the wrong ways as well.
Especially since Codex faced the same issue but the team decided to explicitly default to only ~200k context to avoid surprises and degradation for users.
Instead of asking the model: "Here's this codebase, report any vulnerability." you ask. "Here's this codebase, report any vulnerability in module\main.c".
The model can still explore references and other files inside the codebase, but you start over a new context/session for each file in the codebase.
Honestly, that's the only way I've ever been able to trust the output. Once you go beyond the scope of one file it really degrades. But within a single file I've seen amazing results.
Are you not supposed to include as many _preconditions_ (in the form of test cases or function constraints like "assert" macro in C) as you can into your prompt describing an input for a particular program file before asking AI to analyze the file?
Please, read my reply to one of the authors of Angr, a binary analysis tool. Here is an excerpt:
> A "brute-force" algorithm (an exhaustive search, in other words) is the easiest way to find an answer to almost any engineering problem. But it often must be optimized before being computed. The optimization may be done by an AI agent based on neural nets, or a learning Mealy machine.
> Isn't it interesting what is more efficient: neural nets or a learning Mealy machine?
...Then I describe what is a learning Mealy machine. And then:
> Some interesting engineering (and scientific) problems are: - finding an input for a program that hacks it; - finding a machine code for a controller of a bipedal robot, which makes it able to work in factories;
Anyone familiar with the literature knows if anyone tried figuring out why we don't add "speaker" embeddings? So we'd have an embedding purely for system/assistant/user/tool, maybe even turn number if i.e. multiple tools are called in a row. Surely it would perform better than expecting the attention matrix to look for special tokens no?
You can charge $10 on the account and get unlimited requests. I abused this last week with the nemotron super to test out some stuff and made probably over 10000 requests over a couple of days and didn't get blocked or anything, expect 5xx errors and slowdowns tho.
It's probably another ASR model that focuses on benchmarks and simple uses instead of more challenging real use cases.
I upload edited gameplay vods of twitch streams on youtube, and use whisper-large-v3 to provide subtitles for accessibility reasons (youtube's own auto-subtitles suck, tho they've been getting better).
My checklist for a good ASR model for my use case is:
1. Have timestamp support.
2. Support overlapping speakers.
3. Accurate transcripts that don't coalesce half words/interrupted sentences.
4. Support non verbal stuff like [coughs], [groans], [laughs], [sighs], etc.
5. Allow context injection of non-trivial sizes (10k+ words)
1 is obvious because without it we can't have subtitles. Force alignment fails too often.
2 is crucial for real world scenarios because in the real world people talk over each other all the time, in my case it's a streamer talking over gameplay audio, or when the streamer has guests over. When 2 people speak the transcript either ignores one of them, or in the worst case, both of them.
3 and 4 are an accessibility thing, if you're deaf or hard of hearing having a more literal transcript of what's being said conveys better how the speaker is speaking. If all subtitles are properly "spell-checked" then it's clear your model is overfit to the benchmarks.
5 Is not a requirement per se, but more of a nice to have. In my use cause the streamer is often reading stream chat so feeding the model the list of users that recently talked, recent chat messages, text on screen, etc. Would make for more accurate transcripts.
I've tried many models, and the closest that fulfill my needs are LLM style models on top of forced alignment. It's too slow, so I've been sticky with whisper because with whisperx I can get a transcript in 5 minutes with just a single command.
One thing all these models do (including whisper) is just omit full sentences, it's the worst thing a model can do.
What's the point of leading the race for 90% of it, if they're gonna slip on their own sweat and fall down by the end? In non metaphorical terms, what's the point of spending billions of dollars rushing to get the best AI tech at all costs, when the competition can distil your progress and catch up in 6-12 months while only spending 1% of what you spent.
Even in the aspect the article cares about, commercialization, the US is starting to lose marketshare, I've seen people move from cc/codex plans to use glm/opencode plans due to the recent squeeze the US companies put on plan usage, the US companies are screwed if that sticks, not everyone needs the bleeding edge models, they just want to pay $20/month and have the models be decently capable.