Pieces is (correction: used to be, prior to the AI slopification) an app for storing code snippets. so i think you can imagine the general idea of, e.g., "cool API usage example from a YouTube video, let me screenshot it!"
> To best support software engineers when they want to transcribe code from images, we fine-tuned our pre-processing pipeline to screenshots of code in IDEs, terminals, and online resources like YouTube videos and blog posts.
Even with these examples that seems like a very narrow use case.
It worries me that stuff like that becoming easier will lead to wacky data pipelines being normalized (pulling display output off systems and "scraping" it to get data, of dubious quality, versus just building a proper interface). The kind of crowd that likes "low code" tools like MSFT's "Power Automate" is going to love to make Rube Goldberg nightmares out of tools like this.
It fills me with a deep sadness that we created deterministic machines then, though laziness, exploit every opportunity to "contaminate" them with sloppy practices that make them produce output with the same fuzzy inaccuracy as human brains.
Old man yells a neural networks take: We're entering a "The Machine Stops" era where nobody is going to know how to formulate basic algorithms.
"We need to add some numbers. Let's point a camera at the input, OCR it, then feed it to an LLM that 'knows math'. Then we don't have to figure out an algorithm to add numbers."
I wish compute "cost" more so people would be forced to actually make efficient use of hardware. Sadly, I think it'll take mass societal and infrastructure collapse for that to happen. Until it does, though, let the excess compute flow freely!
has anyone tried feeding the admittedly noisy OCR-ed text -at a document level - to an LLM for making sense? Presumably some of the less capable ones should be quite affordable and accurate at scale as well.
Quite simply, you’re completely wrong. Modern tesseract versions include a modern LSTM AI. It can very affordably be deployed on CPU, yet its performance is competitive with much more expensive large GPU-based models. Especially if you handle a high volume of scans, chances are that tesseract will have the best bang per buck.
My company probably spent close to 6 figures overall creating Tesseract 5 custom models for various languages. Surya beats them all and is open source (and quite faster).
Surya weights for the models are licensed cc-by-nc-sa-4.0. They have an exception for small companies. If you're company is not small you either need to pay them or use them illegally.
Their training code and data is closed source. They are barely open weight and only inference is open source.
5.5.0 released November last year. Still a very active project as far as I can tell and runs on CPU. Even compared to best open source GPU option it is still pretty good. VLMs work very differently and don't work as well for everything. Why is it out of date?
Surya weights for the models are licensed cc-by-nc-sa-4.0 so not free for commercial usage. Also, as far as I know, the training data is 100% unavailable. Given they use well trained, but standard models, it isn't really open source and barely, maybe, open weight. I kinda hate how their repo says gpl cause that is only true for the inference code. The training code is closed source.
I’ve tested a bunch of vision models on particularly difficult documents (handwritten in a German script that’s no longer used), and I have yet to be impressed. They’re good at BSing to the point that you almost think they nailed it, until you realize that it’s mostly/all made-up text that doesn’t appear in the document.
Is it, though? If the important parts of the code are new, does it matter that other parts are older or derived from older code? (Of course, I think this whole line of thought is pointless; what matters is not age, but how well it works, and tesseract generally does seem to work.)
Not the actual implementations heh ...I heard even Linus has dropped support for the 486. Even the infra is finally giving way...did you see the NVLINK SPINE announcement a few days ago? It's going to be deployed in Stargate UAE that was announced Thursday.
No son, Linux is not a version of Unix anymore than MINIX is.
NeXTStep was real UNIX, but macOS is not.
BTW, I was taught to program in C by one of the original core Unix team members and I worked for DEC long before I could have discussed TesseractOCR with people who didn't. Keep those ignorant downvotes commin'
Perhaps the specific idea is to harvest coding textbooks as training data for LLMs?