bt5/erp5_dms/DocumentTemplateItem/portal_components/document.erp5.PDFDocument.py · 9e375b8e9f14c3eb0b8fe6b3ac5a17a7997b7935 · nexedi / erp5

For historical reasons, PDF to text involved conversion first of the PDF to png, then this png to tiff and the tiff was sent to tesseract. This works, but it consumes a lot of resources with large PDFs, especially because the intermediate png/tiff are created with a resolution of 300 DPI, which easily needs serveral Go of RAM and temporary disk space. This was obsorved with the PDF created by erp5_document_scanner, which are usually high quality (1 or 2Mo per page) and even a one page PDF sometimes took more than one minute to OCR. Since 9.53 ghostscript integrates tesseract engine directly, we don't need to prepare a tiff beforehand, we can directly send the PDF data to ghostscript. These change use ghostscript if available and otherwise fallback to the same pipeline as before. This will allow the transition until all ERP5 instances are running a recent enough SlapOS with ghostscript 9.54. Fortunately, before SlapOS include ghostscript 9.54, ERP5 software release did not have ghostscript in $PATH, so we don't have to check ghostscript version, we assume that if gs is in $PATH, it means we have a recent enough SlapOS. This new approach was less tolerant regarding broken/password-protected PDFs so we perform a new check that the PDF is valid and not encrypted before trying to use OCR.

document.erp5.PDFDocument.py 14.3 KB

Replace document.erp5.PDFDocument.py