You will see that I have posted about this before asking for suggestions on which software I can use to convert PDF to docx/odt.
I am a teacher. During my time as a researcher I wrote a lot of documents and regularly draw upon them to teach my students. I often have to take the text, modify them, or build upon them. A lot of my material is bound up in PDFs. Sometimes, I have grant applications to write where a previous draft I wrote was stored as a PDF. Converting them to text has become the bane of my life.
I am forced to use online tools because none of the software I have seem to do the trick. Lot of people keep saying pandoc. Pandoc does not convert PDF to any other format. It can only be the output format.
Is there a magic open source solution that I have missed out?
The problem lies in the PDFs themselves. In there are objects that represent lines of glyphs. If you are lucky. A conversion tool can guess which of those lines belong together and produce the text.
It cannot know any intentions behind it, though. Take a numbered list. The first line is two line objects: the number plus the . or the ), and the first line of text. The conversion tool can now guess. As the line blocks with the numbers are all left of the line blocks with text, this could be a numbered list. Or it could be a table with two columns. Nothing in the PDF is giving any hints.
And that is the easy part. This assumes that the document either uses default fonts, or keeps its embedded fonts untouched. If they use embedded fonts and a PDF optimizer that only embeds the used characters and renumbers them, any copy or conversion tool is bound to fail.
Same with protected PDFs where you simply cannot copy the text from the start.
And then there are PDFs that just consist of scanned pages. Here you would need an OCR software to get something readable out of them.
PDF is an archival, output format, the end of a process. Not something to work from.
Always preserve the original file. Keep it safe. If you change tools, make sure you have a conversion path into something editable. The PDF is for giving away, nothing else.
Renumbering characters during font minimization? I haven’t encountered that, it would break searching and copying.
Anyway, PDFs for example don’t even say whether a line of text is left, center or justified – they usually store the coordinates of the first character and then spacing to each subsequent one unless defined by the font.
And what if the document contains text boxes, or other Word objects? Well, the text is separate from the underlying rectangle (if there is one) and it’s up to the conversion tool to guess if it’s part of the main text layer.
Sorry, it’s really hard to edit PDFs. You might want to use Inkscape for editing the graphical parts. If you also need to edit paragraphs, I suggest recreating the document by pasting them into Word/LibreOffice, and importing any graphical shapes as SVGs (use Inkscape for the conversion, then you can try Word’s “Graphic > Convert to Shapes” feature).
Really, every software that outputs PDF should treat it as an export process, hopefully making it clearer that “saving as PDF” is visually lossless but structurally lossy and messy.
The compressing and renumbering seems to be more common with embedded Chinese fonts - Space-wise it makes a lot of sense. But yes, mark and copy text, paste it into word or writer, and you get gibberish. Can’t verify the search, though. And, of course, Google translate can’t do anything with it, either.