I’ve just scanned a section of a book (in French) that unfortunately uses a very fine typeface and a lot of italics that seem to confuse the OCR.

I’m on Linux, so I’m switching off between gscan2pdf (which makes use of the remarkable unpaper program) and Master PDF Editor (a proprietary program) to clean up and deskew the scans before OCRing them (since each program has their own strengths and weaknesses). I did this, got the scanned pages looking pretty good, and then OCRed them using Tesseract (which is an option in gscan2pdf). I also tried GOCR, which produced garbage-level results.

Tesseract didn’t do too badly, but what did happen is that it occasionally mixes lines of text together–despite me trying to get them as straight as possible, and doing what I thought was a pretty good job! Also, it will put spaces in the midst of words and sentences, like this: “J e t’ai m e” which is kind of annoying to have to go through and fix, especially since there are a lot of those spaces! Can anyone recommend a better approach to this, some different software maybe, or is this the best I can reasonably hope for?

  • hedge@beehaw.orgOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    @MasterBuilder@lemmy.one & @donio@beehaw.org, revisiting the subject, if I may: I’ve now run ocrmypdf through its paces and am pretty impressed with the results. One thing though, is that I would like to be able to edit the OCR text that it generates, usually to join hyphenated words, remove line breaks, preserve ¶ breaks, and correct the rare spelling error. Is there a way to do this? (I believe there is a way to do this on gImageReader, but I don’t think it will let you save the OCR’d text to the PDF!) Looking at ocrmypdf on github, it looks like there might be a way to do this, but darned if I can figure out how. I wasn’t able to find anything about this in the documentation either. I’d be much obliged for any suggestions you might have.

    • MasterBuilder@lemmy.one
      link
      fedilink
      arrow-up
      2
      ·
      11 months ago

      I don’t know, but there might be pdf viewers that permit editing layers. Try LibreOffice Draw or gscan2pdf. Maybe The Gimp can do it.