Hacker Newsnew | past | comments | ask | show | jobs | submit | ansk's commentslogin

I see what you're getting at, but determinism isn't the right word either. LLMs are fundamentally deterministic -- they are pure functions which output text as a function of the input text and the network parameters[1]. Depending on your views on free will, it could be effectively argued that humans are deterministic as well.

The concept you're touching on is the idea that LLMs (and humans) are functions which are inscrutable. Their behavior cannot be distilled into a series of logical steps that you can fit in your head, there are no invariants which neatly decompose their complexity into a few interpretable states, and the input and output spaces are unstructured, ambiguous, underspecified, and essentially infinite. This makes them just about impossible to reason about or compose using the same strategies and analysis we apply to traditional programs.

[1] Optionally, they can take in a source of entropy to add nondeterminism, but this is not essential. If LLM providers all fixed their prng seeds to a static value, hardly anyone would notice. I can't imagine there are many workflows which feed an LLM the exact same prompt multiple times and rely on the output having some statistical distribution. In fact, even if you wanted this you may just end up getting a cached response.


Let's be real, if you and I both ask claude to generate a feature on the same project, what are the chances that it spits out 100% replicated code? But if we are to build the project using a Dockerfile, we will get the same binary and the same image. Products around LLMs are non deterministic unlike compilers.

I can assure you that a fully deterministic and equally effective claude is possible to build. And yes, that would mean identical prompts would yield 100% identical output 100% of the time. It would still make the occasional logical or factual error, but it would do so deterministically. Would this solve any of the problems with building reliable programs using LLMs?

it's nondeterministic because we chosen it by having higher 'temperature' in settings. I bet if you run open weights model with temperature 0 and on the same device the same prompt and turn off parallelism you will have more deterministic result (excluding some floating point operations).

> Optionally, they can take in a source of entropy to add nondeterminism, but this is not essential. If LLM providers all fixed their prng seeds to a static value, hardly anyone would notice

Everyone added /dev/random to their offerings, so every LLM tools for coding are non deterministic.


The guy writing a thumbnail pipeline isn't getting petabytes (exabytes?) of storage to cache all videos from the past week in their entirety. If this quantity of data is being stored, it's being stored deliberately and at significant cost.


The other explanations here don't explain the long delay between the start of the investigation and the release of the footage. Yes, storing customer data is what we'd expect from Google and yes, the FBI can coerce Google to provide this data for their investigations. But it does not take a week for Google to find a file on their servers.

My hunch is that Google initially tried to play dumb to avoid compliance, as to not reveal they do in fact retain customer data. They had a plausible excuse as well -- the owner had no subscription so they don't store the data -- and took a gamble that this explanation would suffice until the situation resolved itself. I suspect that authorities initially took Google's excuse at face value, since they parroted this explanation to the public as well. As pressure mounted on authorities to make some headway on the case, they likely formally exercised whatever legal mechanisms they have at their disposal to force Google's hand, and only then was the footage released.


This is a wild claim. I would think criminal charges for something like obstruction would be possible if Google intentionally hid this from investigators for up to a week. That could result in the difference between the victim being found alive or not.


The implication that OpenAI is a YC company in the same sense as the other listed companies is somewhere between misleading and dishonest. Even more distasteful to show founding teams for all the others, then just Sam for OpenAI.


Of all Schmidhuber's credit-attribution grievances, this is the one I am most sympathetic to. I think if he spent less time remarking on how other people didn't actually invent things (e.g. Hinton and backprop, LeCun and CNNs, etc.) or making tenuous arguments about how modern techniques are really just instances of some idea he briefly explored decades ago (GANs, attention), and instead just focused on how this single line of research (namely, gradient flow and training dynamics in deep neural networks) laid the foundation for modern deep learning, he'd have a much better reputation and probably a Turing award. That said, I do respect the extent to which he continues his credit-attribution crusade even to his own reputational detriment.


I think one of the best things to learn from Schmidhuber is that progress involves a lot of players and over a lot of time. Attribution is actually a difficult game and usually we are only assigning credit to those at the end of some milestone. It's like giving a gold medal to the runner in the last leg of a relay race or focusing only on the lead singer of a band. It's never one person that does it alone. Shoulders of giants, but those giants are just a couple of dudes in a really big trenchcoat.

Another important lesson is that often good ideas get passed over because of hype or politics. We often like to pretend that science is all about the merit and what is correct. Unfortunately this isn't true. It is that way in the long run, but in the short run there's a lot of politics and humans still get in their own way. This is a solvable problem, but we need to acknowledge it and create systematic changes. Unfortunately a lot of that is coupled to the aforementioned one.

  > I do respect the extent to which he continues his credit-attribution crusade even to his own reputational detriment.

As should we all. Clearly he was upset that others got credit for his contributions. But what I do appreciate is that he has recognized that it is a problem bigger than him, and is trying to combat the problem at large and not just his own little battlefield. That's respectable.


It's a bit of an aside but I believe this is one reason Zuckerberg's vision for establishing the superintelligence lab is misguided. Including VCs, too many people get distracted by rock stars in this gold rush.


Just last week I said something inline with that[0]. Many people conflated my claim that Meta has a lot of good people with "Meta /is/ winning the AI race". I just claimed they had some of who I think are some of the best researchers in the field, but do not give them nearly the same resources or capacity to further their research that they give to these "rock stars". Tbh, the same is true for any top lab, I just think this happens more at Meta because Meta is so metric and rock star focused.

So I agree. The vision is misguided. I think they'd have done better had they taken that same money and just thrown it at the people they already have but who are working in different research areas. Everyone is trying to win my doing the same things. That's not a smart strategy. You got all that money, you gotta take risks. It's all the money dumped into research that got us to this point in the first place.

It's good to shift funds around and focus on what is working now, but you also have to have a pipeline of people working on what will work tomorrow, next year, 5 years, and 10 years. The people are there that can do that work. The people are there that want to do the work. The only thing is there's little to no people that want to fund that work. Unfortunately it takes time to bake a cake.

Quite frankly, these companies also have more than enough money to do both. They have enough money to throw cash hand over fist at every wild and crazy idea. But they get caught in the hype, which is no different than an over focus on the attribution rather than the process or pipeline that got us the science in the first place.

[0] https://news.ycombinator.com/item?id=45554147


Also, it reminds us that the powerful write history. But history can be rewritten as the balance of power shifts. I imagine the world will hear all about China's contributions to the field if they continue their ascent.


> That said, I do respect the extent to which he continues his credit-attribution crusade even to his own reputational detriment.

Lol, I still used to notice him before covid when he was railing against Bengio, Hinton, and LeCun. Can't believe he's still going.


I can only imagine what the Taiwanese can do in Arizona. Truly a synergy for the ages.


Maybe that's why yields there are better? [1]

[1] https://www.tomshardware.com/tech-industry/semiconductors/ts...


Once you're in a air-conditioned environment the outside world doesn't matter.

More likely he compared the 4nm yield to the 3nm yield in Taiwan?


The moisture of the outside world might not matter. But aircon doesn't protect you from earthquakes, alas.


Yep, you need to install a ground conditioner for that.


So, how many earthquakes there were?


https://en.wikipedia.org/wiki/2025_Tainan%E2%80%93Chiayi_ear... has a recent example. Google or your favourite LLM can easily give you more or even a complete list of earthquakes in Taiwan.


I meant this part of USA


China Airlines recently opened a new direct flight route between Taoyuan and Phoenix. They've been plastering it all over their plane signage. I thought it was funny that the flight must be pretty empty other than the handful of TSMC employees that need to go there.


Apparently China Airlines and Starlux are both going to fly that route next year. I have a hard time imagining there's demand for one let alone both.


Phoenix is the fifth largest city in the United States. It is also one of the major hubs of the west, being in a good location (midway north), having good weather for planes, and having America West headquartered there.

I would think there should be plenty of traffic going through there to Taiwan, similar to the amount going through a hub such as Chicago or NY.


From Wikipedia:

PHX was the 11th-busiest airport in the United States in terms of passenger boardings and 35rd-busiest in the world in 2024. The airport serves as a hub for American Airlines and a base for Frontier Airlines and Southwest Airlines.


Earlier this year Eva Air also announced a direct route to Dallas, supposedly starting next month. At the time I felt like it was a tariff negotiation tactic because that one also does not make sense.


FYI to all: China Airlines is a Taiwanese company (ROC) and has no affiliation with mainland China (PRC).


Then why would nations around the world protect Taiwan?


Why do you think the Chinese people from Taiwan want to do anything in Arizona? They're there just to placate the orange guy's rage. They'll never do anything special there.


My personal experience is that the cost of enduring a negative stimulus is not simply a function of the magnitude of the negative stimulus, but rather the magnitude of the negative stimulus in relation to the magnitude of all other concurrent negative stimuli. This study controls the environment so that a single negative stimulus is isolated and additional external negative stimuli are minimized, but it cannot control for the fact that a depressed person also endures a constant barrage of negative stimuli which are generated internally (hopelessness, exhaustion, fear, self-doubt, etc). The magnitude of these internally generated negative stimuli is likely much larger than that of the aversive external stimulus used in this study, so it seems reasonable that the marginal relief obtained by avoiding the external stimulus may be perceived as relatively negligible, or at least diminished to the point that the cost of avoiding is greater than the cost of enduring.


For future reference, if you want proper python bindings for ffmpeg* you should use pyav.

* To be more precise, these are bindings for the libav* libraries that underlie ffmpeg


And on the seventh day, God ended His work which He had done and began vibe coding the remainder of the human genome.


this should do the trick...

  while creatures:
    c = get_random_creature()
    if c.is_dead():
      creatures.pop(c)
    else:
      creatures.add(c.mutate())


You also need selection, not just mutation (I know you are being silly, so am I)


Selection is handled by asynchronous events which populate the is_dead() boolean.

Critiquing my own code, though, it should really be a check against 'can_reproduce()' rather than 'is_dead()'.


The scientific impact of the transformer paper is large, but in my opinion the novelty is vastly overstated. The primary novelty is adapting the (already existing) dot-product attention mechanism to be multi-headed. And frankly, the single-head -> multi-head evolution wasn't particularly novel -- it's the same trick the computer vision community applied to convolutions 5 years earlier, yielding the widely-adopted grouped convolution. The lasting contribution of the Transformer paper is really just ordering the existing architectural primitives (attention layers, feedforward layers, normalization, residuals) in a nice, reusable block. In my opinion, the most impactful contributions in the lineage of modern attention-based LLMs are the introduction of dot-product attention (Bahdanau et al, 2015) and the first attention-based sequence-to-sequence model (Graves, 2013). Both of these are from academic labs.

As a side note, a similar phenomenon occurred with the Adam optimizer, where the ratio of public/scientific attribution to novelty is disproportionately large (the Adam optimizer is very minor modification of the RMSProp + momentum optimization algorithm presented in the same Graves, 2013 paper mentioned above)


I think the most novel part of it, and where a lot of the power comes from, is in the key based attention, which then operationally gives rise to the emergence of induction heads (whereby pair of adjacent layers coordinate to provide a powerful context lookup and copy mechanism).

The reusable/stackable block is of course a key part of the design since the key insight was that language is as much hierarchical as sequential, and can therefore be processed in parallel (not in sequence) with a hierarchical stack of layers that each use the key-based lookup mechanism to access other tokens whether based on position or not.

In any case, if you look at the seq2seq architectures than preceded it, it's hard to claim that the Transformer is really based-on/evolved-from any of them (especially prevailing recurrent approaches), notwithstanding that it obviously leveraged the concept of attention.

I find the developmental history of the Transformer interesting, and wish more had been documented about it. It seems from interview with Uszkoreit that the idea of parallel language processing based on an hierarchical design using self-attention was his, but that he was personally unable to realize this idea in a way that beat other contemporary approaches. Noam Shazeer was the one who then took the idea and realized it in the the form that would eventually become the Transformer, but it seems there was some degree of throw the kitchen sink at it and then a later ablation process to minimize the design. What would be interesting to know would be an honest assessment of how much of the final design was inspiration and how much experimentation. It's hard to imagine that Shazeer anticipated the emergence of induction heads when this model was trained at sufficient scale, so the architecture does seem to at least partly be an a accidental discovery, and more than the next generation seq2seq model that it seems to have been conceived as.


Key-based attention is not attributable to the Transformer paper. First paper I can find where keys, queries, and values are distinct matrices is https://arxiv.org/abs/1703.03906, described at the end of section 2. The authors of the Transformer paper are very clear in how they describe their contribution to the attention formulation, writing "Dot-product attention is identical to our algorithm, except for the scaling factor". I think it's fair to state that multi-head is the paper's only substantial contribution to the design of attention mechanisms.

I think you're overestimating the degree to which this type of research is motivated by big-picture, top-down thinking. In reality, it's a bunch of empirically-driven, in-the-weeds experiments that guide a very local search in a intractably large search space. I can just about guarantee the process went something like this:

- The authors begin with an architecture similar to the current SOTA, which was a mix of recurrent layers and attention

- The authors realize that they can replace some of the recurrent layers with attention layers, and performance is equal or better. It's also way faster, so they try to replace as many recurrent layers as possible.

- They realize that if they remove all the recurrent layers, the model sucks. They're smart people and they quickly realize this is because the attention-only model is invariant to sequence order. They add positional encodings to compensate for this.

- They keep iterating on the architecture design, incorporating best-practices from the computer vision community such as normalization and residual connections, resulting in the now-famous Transformer block.

At no point is any stroke of genius required to get from the prior SOTA to the Transformer. It's the type of discovery that follows so naturally from an empirically-driven approach to research that it feels all but inevitable.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: