DOGE employee

SwordInStone@lemmy.world · 2 months ago

DOGE employee

tetris11@lemmy.ml · edit-2 2 months ago

I have to admit, PDF parsing being such a hot and profitable topic in computer science was really something I never saw coming.

PDFs? The things you can select text from? And when not, there’s decent OCR? And when not, you just ask the person to send you an email or a word doc?

It sounds like LLMs are looking for a new unpolluted source of historical data that they can learn from, and this source exists in the form of old scanned-in paper documents. That’s the only reason I can fathom as to why this is such a big thing now.

chicken@lemmy.dbzer0.com · 2 months ago

Every time I try to convert a PDF to epub or something, or OCR one that doesn’t actually have selectable text, it turns out shit. I assume the real reason people would want to get LLMs involved is that there is actually a lot of ambiguity in what a correct conversion would be, and there are a lot of PDFs out there.

JustAnotherKay@lemmy.world · edit-2 2 months ago

I self host sterling-pdf and I haven’t had an issue with file conversion in… When did I set this thing up?

To be truthful, the machine I had it running on has been sent to the grave (I sold it) so I don’t actually have this service running right now

sudo@programming.dev · 2 months ago

Training the most insane AI model on classified federal documents.

MonkderVierte@lemmy.ml · 2 months ago

Selecting text doesn’t work in most multi-column pdfs and good OCR cost money. And if the original source is lost and you want an exact copy in word, the OCR tools need to be really good at guessing whitespace-to-line ratio, because pdf is only an output format and not a processing format.

For most other converting needs, there’s pandoc, imagemagick and ffmpeg.