• Jérome Perrin's avatar
    dms: use ghostscript to convert PDF to text · f775724e
    Jérome Perrin authored
    For historical reasons, PDF to text involved conversion first of the PDF to
    png, then this png to tiff and the tiff was sent to tesseract. This works, but
    it consumes a lot of resources with large PDFs, especially because the
    intermediate png/tiff are created with a resolution of 300 DPI, which easily
    needs serveral Go of RAM and temporary disk space.
    This was obsorved with the PDF created by erp5_document_scanner, which are
    usually high quality (1 or 2Mo per page) and even a one page PDF sometimes
    took more than one minute to OCR.
    
    Since 9.53 ghostscript integrates tesseract engine directly, we don't need to
    prepare a tiff beforehand, we can directly send the PDF data to ghostscript.
    
    These change use ghostscript if available and otherwise fallback to the same
    pipeline as before. This will allow the transition until all ERP5 instances
    are running a recent enough SlapOS with ghostscript 9.54. Fortunately, before
    SlapOS include ghostscript 9.54, ERP5 software release did not have ghostscript
    in $PATH, so we don't have to check ghostscript version, we assume that if gs
    is in $PATH, it means we have a recent enough SlapOS.
    
    This new approach was less tolerant regarding broken/password-protected PDFs
    so we perform a new check that the PDF is valid and not encrypted before
    trying to use OCR.
    f775724e
document.erp5.PDFDocument.py 14.3 KB