Hacker Newsnew | past | comments | ask | show | jobs | submit | svat's commentslogin

This is cool, good luck with your project! For this page, you may want to pick as the default examples some longer words (like say "nevertheless") that show both positive (odd number) and negative (even number) patterns.

Also for anyone else reading this, Frank Liang's thesis "Word Hy-phen-a-tion by Com-put-er" (https://tug.org/docs/liang/) is a great read, and the data structure it uses (packed tries) is clever too. The section of the TeX program that describes hyphenation is also an interesting read, and Knuth added a further twist (what we may call a hash-packed trie) in his frequent-words literate program for Bentley's column.


Thank for the suggestion, implemented it! Now all suggestions have at least one hyphen and one position where the patterns disagree (one pattern says hyphenate, another to not hyphenate). I discovered you don't necessarily need long words to get "interesting" hyphenation results; e.g. https://hyphenate.dev/zero.

I think by "have 32 bit factors" the GP actually means "has a factorization as a product of two 32-bit numbers", rather than the natural meaning that doesn't work, as you pointed out.


It's related, but not the same thing. For example, for b=10, the number 70=2x5x7 is b-smooth, but it cannot be written as the product of two numbers less than b. Here are the other b-smooth (counter)examples for b=10:

    | n   | factorization  | products of two numbers
    |-----|----------------|------------------------------------
    | 50  | 2 * 5^2        | 1x50, 2x25, 5x10
    | 60  | 2^2 * 3 * 5    | 1x60, 2x30, 3x20, 4x15, 5x12, 6x10
    | 70  | 2 * 5 * 7      | 1x70, 2x35, 5x14, 7x10
    | 75  | 3 * 5^2        | 1x75, 3x25, 5x15
    | 80  | 2^4 * 5        | 1x80, 2x40, 4x20, 5x16, 8x10
    | 84  | 2^2 * 3 * 7    | 1x84, 2x42, 3x28, 4x21, 6x14, 7x12
    | 90  | 2 * 3^2 * 5    | 1x90, 2x45, 3x30, 5x18, 6x15, 9x10
    | 96  | 2^5 * 3        | 1x96, 2x48, 3x32, 4x24, 6x16, 8x12
    | 98  | 2 * 7^2        | 1x98, 2x49, 7x14


The word “most” is indeed not very surprising if taken to mean >50% (as you do), but the surprising fact (admittedly not relevant for 2^64 in the quoted sentence) is that “most” can be replaced with “almost all” in the technical sense meaning “tending to 1” i.e. “1 - o(1)”, i.e. “eventually greater than 1-epsilon for any epsilon”.

Here's the multiplication table up to 9 (so for n=10 in place of n=2^64), and it already contains only 37 distinct products among its 100 entries:

     × |  0  1  2  3  4  5  6  7  8  9
    ---+------------------------------
     0 |  0  0  0  0  0  0  0  0  0  0
     1 |  0  1  2  3  4  5  6  7  8  9
     2 |  0  2  4  6  8 10 12 14 16 18
     3 |  0  3  6  9 12 15 18 21 24 27
     4 |  0  4  8 12 16 20 24 28 32 36
     5 |  0  5 10 15 20 25 30 35 40 45
     6 |  0  6 12 18 24 30 36 42 48 54
     7 |  0  7 14 21 28 35 42 49 56 63
     8 |  0  8 16 24 32 40 48 56 64 72
     9 |  0  9 18 27 36 45 54 63 72 81
That is, in decimal, only 37% of (up to) two-digit numbers can be written as products of two one-digit numbers. This fraction, which drops to 28% at n=100, only drops to 17% at n=2^64 (per the article). So it decreases VERY slowly, and it's nontrivial that it actually goes to 0.


It's an interesting fact, but it's weakened a lot by happening so slowly and not matching the everyday definition of "almost all".

And it's weakened even more by realizing that while you can get the raw fraction as low as you want, shrinking your list of products by n digits requires numbers with an exponential number of digits.


I think your second paragraph is the same as “so slowly”, just quantified. And I'd argue that the technical meaning of “almost all” does match the everyday definition: if you told someone that “for sufficiently large N, almost all 2N-bit numbers are not the product of two N-bit numbers”, they'd probably think of “almost all” as a fraction like 99.99% or 99.999999% or whatever, and whatever fraction they picked, the statement would be true (with the threshold for “sufficiently large” depending on the fraction they picked).

So whether “only 17%” is interesting or not depends on whether you see it as a stand-in for “less than half”, or “a number close to 0”.

(Posting this comment mainly to correct an error in my previous comment: in both places that I wrote “n=2^64” I should have instead written “n=2^32”.)


> if you told someone that “for sufficiently large N, almost all 2N-bit numbers are not the product of two N-bit numbers”, they'd probably think of “almost all” as a fraction like 99.99% or 99.999999% or whatever, and whatever fraction they picked, the statement would be true (with the threshold for “sufficiently large” depending on the fraction they picked).

I think people would agree with that, yes. But that's a significantly weaker claim than your original one. The original version was "tends to 1 at large n" = "almost all", but this version is that once you reach large n it's "almost all". These different tests give completely different answers for the numbers you'd ever actually use.

And entirely separate from that, if you laid it out as "for 8 million* digit numbers, only one in a billion are products of 4 million digit numbers, so only enough to fill out 7,999,991 digits", I don't know if that really qualifies for "almost all" anymore. The fraction of hits is important, but so is the fraction of digits and entropy, and as you make the numbers bigger you approach 0.0% loss of digits and entropy.

* Placeholder number, I did not do the actual calculation here.


See “The Concept of a Meta-Font”: https://gwern.net/doc/design/typography/1982-knuth.pdf


An Applied Mathematician's Apology by Nick Trefethen[1] has a chapter (a couple of pages) titled "Cleve Moler and Matlab".

> I first met Cleve Moler when I was a graduate student and he visited Stanford, where his loud and friendly voice reverberated around Serra House. Moler is the antithesis of a European, and as a transatlantic soul, I love both Europeans and their antitheses. A room with Moler in it is a no-nonsense zone. He has no interest in showing you how your problem is connected with the theory of pseudodifferential operators. He just wants to get things done computationally, and nobody has done it better. Moler is about the same age as Knuth, and while Knuth was writing his great books on the analysis of discrete algorithms, Moler was creating the modern era of numerical software. He was an author of both of the foundational software packages of the 1970s, EISPACK and LINPACK, and he also published two influential software-based numerical analysis textbooks. And then, in around 1977 in the Computer Science department at the University of New Mexico, he invented Matlab, which changed the world.

I never liked using Matlab (the little I used it), but after reading this I understood better what its innovation was: “All the right algorithms would be invoked in all the right places, without the user needing to know the details.”

A footnote I found interesting:

> For me `eig(A)` epitomizes the successful contribution of numerical analysis to our technological world. Physicists, chemists, engineers, and mathematicians know that computing eigenvalues of matrices is a solved problem. Simply invoke `eig(A)`, or its equivalent in whatever language you are using, and you tap into the work of generations of numerical analysts. The algorithm involved, the QR algorithm, is completely reliable, utterly nonobvious, and amazingly fast. On my laptop, for a 1000 × 1000 matrix A, `eig(A)` computes all 1000 eigenvalues in half a second.

[1]: https://sites.math.rutgers.edu/~zeilberg/akherim/NickApology...


Actually:

• He had already published the first editions of Volume 1, 2, 3, and the second edition of Volume 1, by 1973. It was in 1977 when the publishers sent him galley proofs for the second edition of Volume 2, having switched to phototypesetting (away from hot-metal typesetting a la Linotype, though IIRC it was actually Monotype) that he was disappointed with the results. And he had some back-and-forth with them and they did improve their fonts (https://tex.stackexchange.com/a/367133/48), but he was still dissatisfied.

> I didn't know what to do. I had spent 15 years writing those books, but if they were going to look awful I didn't want to write any more.

• At this time he came to know of the existence of digital typesetters. Typesetting with computers had existed before, but it had always seemed a crude toy, rather than something suitable for “real books”. But he saw Patrick Winston's Artificial Intelligence that had been just published (I think he got an early proof copy to review or something), and he realized for the first time that digital typesetting was an option (apparently Winston's book was printed at >1000dpi, and Knuth later got his hands on a machine that claimed a resolution of 5333 dpi: see this wonderful comment from Knuth's student and “right-hand man”, David Fuchs: https://news.ycombinator.com/item?id=20009875)

• In fact it was the fonts that he was dissatisfied with rather than the typesetting, so METAFONT was in some sense the primary/motivating project and TeX was only written in order to be able to use METAFONT.

• Actually his first idea was to simply take the old fonts, get high-resolution scans of them (not easy to obtain at that time) and use them directly. He approached Xerox Research Center but:

> I asked if I could use Xerox's lab facilities to create my fonts. The answer was yes, but there was a catch: Xerox insisted on all rights to the use of any fonts that I developed with their equipment. Of course that was their privilege, but such a deal was unacceptable to me: A mathematical formula should never be "owned" by anybody! Mathematics belongs to God.

• So he went home and (after trying a bit with TV cameras) tried projecting photographs of the pages onto the wall and tracing the outlines, and it was while staring at these images that he realized that the shapes of letters were not arbitrary but there was some logic to them (e.g. in the font he was using, the spacing between the vertical strokes in 'm' was equal, and equal to that in 'n'), and he decided (as a computer programmer) to capture this design in code — something that had never before been done. The hardest letter to capture this way is S, hence the paper in the OP.

> Finally, a simple thought struck me. Those letters were designed by people. If I could understand what those people had in their minds when they were drawing the letters, then I could program a computer to carry out the same ideas. Instead of merely copying the form of the letters my new goal was therefore to copy the intelligence underlying that form. I decided to learn what type designers knew, and to teach that knowledge to a computer.

• This is also why METAFONT never really caught on among typographers: as Charles Bigelow (quoted by Richard Southall, https://luc.devroye.org/Southall-METAFONT1986.pdf) observed, “the designer thinks with images, not about images”. Knuth did not want crude “geometric” constructions of letters (as some prior 16th century typographers had attempted: https://www.ams.org/journals/bull/1979-01-02/S0273-0979-1979... and as some typographers only passingly familiar with METAFONT think!). He wanted actual real typographically beautiful shapes, but to be able to generate those shapes with code. This is obviously much harder than simply drawing the shapes using visual intuition, even if it enables variation. (See “The Concept of a Meta-Font”: https://gwern.net/doc/design/typography/1982-knuth.pdf — again, many people in the typography world confuse the abstract concept of a meta-font introduced in this paper with (their incorrect impressions of) the METAFONT program, and omit crediting Knuth for variable fonts).

• The second edition of Volume 2 was not printed with Linotype. Yes the machines still existed in Europe and he talked to typesetters (he mentions in particular a person from Belfast), but it was in fact published using TeX (the first version, TeX77 and MF78). He was still unhappy with the results, though, and spent a few more years learning more about typography and working with people like Bigelow and Hermann Zapf, before the rewrite into the current TeX82 and MF84 (and current version of Computer Modern). I think it's only with the third edition (1997) that he's finally satisfied.


Thanks for this great comment!

The motivation behind METAFONT is amusing to me because it seems to have some of the same hubris of the most extreme AI proponents nowadays: we can replace art by technology. I'm fascinated with TeX (and have spent a lot of my life rewriting it http://github.com/jamespfennell/texcraft) but I always found the situation with fonts in the TeX ecosystem a bit odd. There are people in our society whose vocation is font design (e.g. https://en.wikipedia.org/wiki/Robert_Slimbach). But the TeX ecosystem landed in a place where we use fonts created by computer scientists rather than font designers.


Hi (looks like I've already starred your repo at some point months/years ago)!

The motivation was not to replace art with technology, but to preserve/resurrect an art that was going away, by truly capturing the human understanding[^1]. When the rest of the industry was perfectly content with the deterioration of typesetting, Knuth set out to capture the aesthetics of the best journals of the past. A quote from the Mathematical Typography paper I linked above (https://websites.umich.edu/~millerpd//docs/501_Winter13/Knut...):

> At this point I regretfully stopped submitting papers to the American Mathematical Society, since the finished product was just too painful for me to look at. Similar fluctuations of typographical quality have appeared recently in all technical fields, especially in physics where the situation has gotten even worse.

Frankly, I think the "replace art by technology" impression is a very shallow one, that I alluded to earlier. When Knuth wrote his “The Concept of a Meta-Font” in a journal (Visible Language) mostly read by designers/typographers, many of them wrote letters in response (https://shreevatsa.net/tex/metafont/concept#reactions). What you can see is that the best of them were supportive (even bringing up new points like how it could be useful in educating the next generation of font designers), but some were sharply critical, more or less resenting this intrusion of technology into their art medium. But now a few decades later, basically all fonts are distributed and stored digitally anyway, except that (without METAFONT) the shapes of letters are now basically just stored as binary blobs / sequences of numbers, without any METAFONT-like understanding of typographically relevant quantities like (say) x-height, comma depth, slab thickness, etc. Which one is truer to the art?

(Not a rhetorical question BTW: as in the Bigelow/Southall quote above, one could say that Knuth's approach is to achieve typographical/artistic excellence through understanding, but the artistic approach is visual and intuitive without a cognitive component. But this is a different complaint from the "replace art by technology" take.)

(BTW apart from the default Computer Modern fonts designed by Knuth, who based them on earlier Monotype fonts, almost all fonts used by people with TeX too are designed by font designers, not computer scientists.)

[1]: Related quote from Knuth (sorry paraphrasing from memory): “People say that the best way to understand something is to teach it. I say no: the best way to understand something is to teach it to a computer.” But then again he has also said: "Science is what we understand well enough to explain to a computer. Art is everything else we do."


The idea wasn't to replace type designers, but to create a new tool for artists to use.


I had a slightly different reaction, though I'm only going by what I read in this thread. It wasn't to replace art with technology for everyone, but to scratch a personal itch. He liked the artist-made typography just fine, but it was going away regardless and that was demotivating to him. I think this is in the finest tradition of hacks, even if it took decades.


The first METAFONT was Computer Modern, which (short version) was a re-creation of Monotype 8A.


Have you considered having a detailed version history for each book (etext)? The process of submitting fixes to typos etc in books involves sending an email (https://www.gutenberg.org/help/errata.html) and although the last time I did this (2011) the fixes did get applied reasonably quickly (couple of days), it all felt a bit opaque. The version history could also include the project (usually PGDP correct?) the etext originated from; that way one would be able to compare against the actual page scans.

I have very mixed feelings about Standard Ebooks and would much prefer being able to use Project Gutenberg directly, but one good thing Standard Ebooks does is that every book has an associated git repository (on GitHub), so it's (in principle) possible to see a history of fixes to the text over time.


We're using git repos internally to keep history for each book. They existed on github for a while, but our implementation was awkward, and too big of project for the volunteer dev team. But it's likely that we'll evolve towards that.


> I have very mixed feelings about Standard Ebooks[…]

Why?


I was hoping to reply to this in detail but as I never got around to it, I'll keep it short: mostly it's about the editorial changes they make to the text, modernizing spelling etc. Many of the changes are unjustified IMO, and often detract from the charm of the original, and I'm uncomfortable reading a text I know has been tampered with in this way. Of course it's their project and they can do whatever they want, and they clearly love books, so with strong opinions there will be some that I may disagree with. I'd much rather read books from Project Gutenberg or Wikisource, both of which don't even correct obvious typos without marking up in some way that they've done so.

I also have many positive things to say about Standard Ebooks, but I don't think you were asking about those. :)

----

Edit: Without going into what I think are the most egregious sort of changes they introduce (which I think will require a longer post) and limiting myself to ones easy to find immediately:

See the earlier discussion (linked in a sibling comment here) where the editor-in-chief says it's ok to change punctuation because "The sounds out of his mouth do not include an apostrophe whether it's there in the spelling or not." (a very American view IMO): https://news.ycombinator.com/item?id=16956931

And looking at a recent commit on one of their books, here's a recent (https://github.com/standardebooks/agatha-christie_the-secret...) revert of one of their aggressive "modernizations" from 2024 (https://github.com/standardebooks/agatha-christie_the-secret...), that had, in line with their usual practice, changed "every one" to "everyone" (in one place even when referring to "a good many risks"), and the same commit made other changes (including one still present) like "they ought to have it lithographed. It must be a frightful nuisance doing every one separately." having the last four words turned into "doing everyone separately."!


On the “every one” example, that’s a definite mistake that shouldn’t have made its way in to the book in the first place. The production process has a specific step for “every one” (https://standardebooks.org/contribute/producing-an-ebook-ste...) that guides producers through making the correct choices when modern usage has two different possible choices. It shouldn’t have happened, but it’s a mistake that was fixed at least.


Your comment makes it sound as though the mistake was introduced by an inexperienced contributor who did not read the guide, when in fact it was introduced by the founder/editor-in-chief of the project. :) And in case it wasn't clear, only one of the mistakes was reverted, and the other one I quoted is still present in the book even as of this moment.

More broadly, the position of Standard Ebooks is that a modern reader would be distracted by spellings like "some one" and "every thing", and by time written like "2.30" instead of "2:30", and that books in British quotation style must be converted to American quotation style. I think most readers can in fact tolerate such small differences, and this position is frankly insulting — the punctuation and spelling of works are part of their character, and if anything, I'm more distracted by such anachronisms in style introduced as part of the Standard Ebooks process.


And to be honest, that position is totally reasonable, and the good thing is that you have the option of Gutenberg, Faded Page, and a bunch of other archival sites, also for free, if you don’t want that.

But nearly all print publishers also do what SE does. Why do you think they do, when it costs additional money and time to do that? A reasonable answer is that some, or a majority of, people prefer it.


> But nearly all print publishers also do what SE does.

Do they? To check, I tried to find a recent publication of Agatha Christie, and found the collection “Country Christie: Twelve Devonshire Mysteries” which says “First published by HarperCollins Publishers Ltd 2025”. It still has British-style punctuation (throughout the book), and times like “1.30”, “9.30”, “11.30”, “7.30 a.m.”, “12.30 p.m.”, and “8.30”. I checked a couple of other recent publications and admittedly they do modernize (though not in phrases like “every one of you”), but again I found the collection “The Last Seance: Haunting Tales from the Queen of Mystery” (2019) which does not. So it seems mixed.

In any case, I think it's fine to do what Standard Ebooks does, and if it were instead called something like “Modernized Ebooks with American punctuation”—if readers would know before picking one up—it would be totally unobjectionable. The name “Standard” gives the wrong impression. It's a bit like colorizing old black-and-white movies (or dubbing foreign-language movies instead of subtitling them): yes possibly even a majority of people may prefer it, but IMO it would be good to be more explicit what has been done.


It splits the community and number of possible volunteer hours for one. It also splits the canon into different versions. More projects fight for the attention attention (and possibly donations) of the audience.

There are lots of reasons it could be preferable to centralize. OTOH their mission is limited and some competition is healthy, if only to explore alternative ways to do things.


It’s a different mission.

PG focuses on an accurate digital translation of the source material, sometimes hosting multiple different versions of the same text, and doing things like putting work into recreating the adverts at the back of some novels.

SE focuses less of preservation and more on making readers’ versions of the texts, like other publishing imprints. So there’s typography standardisation, a light-touch moderinisation of hyphenation and soundalike spelling, and things like author-wide collections of short fiction and poetry even if it didn’t previously exist.

Both are valuable, but they serve different segments.


Not the GP, but I also have mixed feelings about Standard Ebooks. They modernise texts for American readers. This means changing the punctuation, merging some words, altering the syntax, etc.

When I read an old novel, written two centuries ago in England, the little differences to modern English are part of the charm, and I certainly don't want any Americanism mixed in. For one of my favorite novels, The Forsyte saga, the author deliberately used some rare forms of words, which SE replaced with the mainstream forms.


SE editor in chief here. What you describe is incorrect. The only thing we do is very light sound-alike spelling modernization, like "to-night" -> "tonight". We do not do things like change from en-GB to en-US, replace old words with different modern words, or change text for "American readers", whatever that means. I have no idea where you got that impression.

I personally worked on the Forsyte saga. If you think something was done in error, please let us know and we'll be happy to fix it.


I commented on this kind of editing several years ago:

https://news.ycombinator.com/item?id=16957359

The edit is still in place, and I still maintain that changing 'phone to phone in dialogue changes the meaning.


Yeah, that edit clearly changes the meaning of the text.


> The only thing we do is very light sound-alike spelling modernization, like "to-night" -> "tonight".

Curious. Why even bother?


Guess: screen readers and such.


One could argue that this falls into the previous poster's thought about "the little differences to modern English are part of the charm" ...


You may already be aware, but SE marks all commits making those kinds of changes as '[Editorial]', so it is generally trivial to use their tooling to build your own high-quality ebook without any of the editorial changes.


When I tried this in the past, it was non-trivial because the editorial changes are mixed with the technical changes. Reverting the editorial changes broke the technical changes.


SE sounds truly, truly awful. Thanks for making me aware of its existence so I can avoid it.


They're providing beautifully made ebooks for free...

The only thing they are is truly, truly wonderful.


But why not be true to the original author's text? What's the need to modify it?


Not parent, but while I can appreciate your viewpoint, I would like to point out that many many many books have abridged, reworded, simplified, or disambiguated versions for different audiences.

The Bible is I daresay the most famous of these. Translations aside, even the English versions have had significant alterations done to wording, spelling, and meaning depending on the version.

There's also the Great Illustrated Classics imprint for certain classic novels like H.G. Wells's The Invisible Man. (I read that one like 10 times as a kid and it's what got me into sci-fi as a whole I'd argue. Haha.)

Whether these alternate versions are good or bad is obviously up for debate and depends on the person, but I'm just saying that what SE does is hardly new in the publishing world.


SE is an amazing and wonderful resource


I believe our new-ish CEO Eric Hellman actually did some work on something very similar


That's an interesting idea. not a small feat to accomplish though ...


In what way? And from what sources? (Wikipedia as a tertiary source is supposed to be a summary of information present in reliable secondary sources — see for instance https://en.wikipedia.org/wiki/Wikipedia:Based_upon. So if the information on the Wikipedia article is incomplete or out of date, where is the correct information available?)


There's quite a lot of information here: https://www.gutenberg.org/about/ All our text is now utf-8. No Plucker! Almost every book is HTML(5).


good question. Eric - any pointers?


Curious, what are the advantages you see in each relative to the other?

Also one should probably compare the former to the single-page version on standardebooks: https://standardebooks.org/ebooks/william-shakespeare/romeo-...


Personally I find the formatting used by the Gutenberg one to be a lot nicer/easier to read, despite (or perhaps because of) being simpler, more plain.

At least for the first few pages of content that I looked at on both versions.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: