1. 02 Jun, 2021 1 commit
    • Jérome Perrin's avatar
      dms: use ghostscript to convert PDF to text · f775724e
      Jérome Perrin authored
      For historical reasons, PDF to text involved conversion first of the PDF to
      png, then this png to tiff and the tiff was sent to tesseract. This works, but
      it consumes a lot of resources with large PDFs, especially because the
      intermediate png/tiff are created with a resolution of 300 DPI, which easily
      needs serveral Go of RAM and temporary disk space.
      This was obsorved with the PDF created by erp5_document_scanner, which are
      usually high quality (1 or 2Mo per page) and even a one page PDF sometimes
      took more than one minute to OCR.
      
      Since 9.53 ghostscript integrates tesseract engine directly, we don't need to
      prepare a tiff beforehand, we can directly send the PDF data to ghostscript.
      
      These change use ghostscript if available and otherwise fallback to the same
      pipeline as before. This will allow the transition until all ERP5 instances
      are running a recent enough SlapOS with ghostscript 9.54. Fortunately, before
      SlapOS include ghostscript 9.54, ERP5 software release did not have ghostscript
      in $PATH, so we don't have to check ghostscript version, we assume that if gs
      is in $PATH, it means we have a recent enough SlapOS.
      
      This new approach was less tolerant regarding broken/password-protected PDFs
      so we perform a new check that the PDF is valid and not encrypted before
      trying to use OCR.
      f775724e
  2. 27 May, 2021 2 commits
    • Jérome Perrin's avatar
      PortalTransforms/tiff_to_text: run tesseract with OMP_THREAD_LIMIT=1 · d74981c3
      Jérome Perrin authored
      By default, tesseract runs on 4 CPU and this can be controlled by
      OMP_THREAD_LIMIT=1 to run on only one CPU (as documented on
      https://tesseract-ocr.github.io/tessdoc/FAQ.html)
      
      In ERP5, we tend to use one zope node per CPU, so we don't want each
      of these zope nodes to spawn a process which will run on 4 CPU.
      
      In a quick benchmark it's not slower, even a bit faster to disable threads:
      
          ## a big image in france (a picture of an invoice)
          $ time ./bin/tesseract /tmp/input.tiff /tmp/out.txt
          Tesseract Open Source OCR Engine v4.1.1 with Leptonica
          Page 1
          Error in pixClipBoxToForeground: box not within image
          Error in pixClipBoxToForeground: box not within image
      
          ________________________________________________________
          Executed in   14.41 secs   fish           external
            usr time   27.88 secs  1002.00 micros   27.88 secs
            sys time    0.74 secs    0.00 micros    0.74 secs
      
          $ time OMP_THREAD_LIMIT=1 ./bin/tesseract /tmp/input.tiff /tmp/out.txt
          Tesseract Open Source OCR Engine v4.1.1 with Leptonica
          Page 1
          Error in pixClipBoxToForeground: box not within image
          Error in pixClipBoxToForeground: box not within image
      
          ________________________________________________________
          Executed in   12.58 secs   fish           external
            usr time   11.84 secs  955.00 micros   11.84 secs
            sys time    0.52 secs  503.00 micros    0.52 secs
      
          ## a small japanese image
      
          $ time ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt
          Tesseract Open Source OCR Engine v4.1.1 with Leptonica
          Page 1
      
          ________________________________________________________
          Executed in    2.16 secs   fish           external
            usr time    3.77 secs  590.00 micros    3.77 secs
            sys time    0.27 secs  209.00 micros    0.27 secs
      
          $ time OMP_THREAD_LIMIT=1 ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt
          Tesseract Open Source OCR Engine v4.1.1 with Leptonica
          Page 1
      
          ________________________________________________________
          Executed in    2.02 secs   fish           external
            usr time  1766.07 millis  1437.00 micros  1764.63 millis
            sys time  214.06 millis  522.00 micros  213.54 millis
      d74981c3
    • Jérome Perrin's avatar
      officejs_support_request_ui: typo in Post_ingestWebMessageForSupportRequest script name · 2cbd5640
      Jérome Perrin authored
      This script creates Web Message, not Mail Message
      2cbd5640
  3. 26 May, 2021 6 commits
    • Roque's avatar
      Migrate app objects for wildcard frontend · fb733871
      Roque authored
      See merge request nexedi/erp5!1411
      fb733871
    • Jérome Perrin's avatar
      Improve Developer experience (mostly ERP5 Workflow/Python Scripts) · 1b31fcbd
      Jérome Perrin authored
      Fixes [#20210517-960A47](https://erp5js.nexedi.net/#/bug_module/20210517-960A47)
      
      The most important changes are:
       - coding style is enabled again for workflow scripts and starts to be enabled for ERP5 Python scripts
       - monaco editor support for workflow scripts, SQL methods and .less
       - small fixes for python/workflow scripts forms and ZMI
      
      See merge request !1422
      1b31fcbd
    • Jérome Perrin's avatar
      ingestion: review publication_state argument · 015bc1c1
      Jérome Perrin authored
      Changing state directly in Base_contribute was only functional for the case
      where metadata was discovered asynchronously. In the case of synchronous
      discovery, the state was first changed state, and Document_convertToBaseFormatAndDiscoverMetadata
      was executed - but this this was causing Unauthorized like this:
      
            Module script, line 10, in Document_convertToBaseFormatAndDiscoverMetadata
            - <PythonScript at /erp5/Document_convertToBaseFormatAndDiscoverMetadata used for /erp5/document_module/163>
            - Line 10
              return context.discoverMetadata(filename=filename,
          Unauthorized: You are not allowed to access 'discoverMetadata' in this context
      
      because once we have already changed state, regular user no longer have
      permission to access discoverMetadata, because that method needs ModifyPortalContent
      permission.
      
      Instead, of handling publication_state only in Base_contribute, treat it
      like others user input parameter and change state during discovery.
      
      Tests were also re-organised to move Base_contribute related test in testIngestion
      and also to run Base_contribute tests as a non-manager user.
      015bc1c1
    • Jérome Perrin's avatar
      l10n: remove all translations for transitions with [transition in $workflow_id] · e3e2c240
      Jérome Perrin authored
      This was never supported, we support only [state in $workflow_id]
      
      See also:
      
        https://erp5js.nexedi.net/#/bug_module/1740
      
        b6dcbc19 (l10n_fr,l10n_jp: Fix translation of "Open", 2021-04-30)
      
      Generated from this script:
      
          #!/srv/slapgrid/slappart3/srv/runner/software/cc0326f0dcb093f56c01291c300c8481/parts/erp5/venv/bin/python
      
          import polib
          import sys
          import re
      
          pofile = polib.pofile(sys.argv[1])
      
          msgs = dict()
          for entry in pofile:
            msgs[entry.msgid] = entry.msgstr
      
          transition_re = re.compile(r'(.*) \[transition in .*\]')
      
          fixed_messages = dict()
          for entry in pofile:
            match = transition_re.match(entry.msgid)
            if match:
              # in erp5_l10n_de some msgstr also have the [transition in ...], we drop them
              if transition_re.match(entry.msgstr):
                continue
              short = match.groups()[0]
              if short.endswith('Action'):
                continue
              if short not in msgs:
                print(f"🤔  {short} not translated ( from {entry.msgid} )")
                fixed_messages[short] = entry.msgstr
            else:
              fixed_messages[entry.msgid] = entry.msgstr
      
          pofile.clear()
          for k, v in fixed_messages.items():
            pofile.append(polib.POEntry(msgid=k, msgstr=v))
      
          pofile.save(sys.argv[1])
      
          import subprocess
          subprocess.check_output(
            [
              '/opt/slapos-shared/gettext/4df93a547efd86e0eb70495b88a5d3b1/bin/msgattrib',
              sys.argv[1],
              "--no-fuzzy",
              "--translated",
              "-s",
              "--no-wrap",
              "-o",
              sys.argv[1]
      
            ]
          )
      e3e2c240
    • Jérome Perrin's avatar
      l10n: sort all message catalogs and remove non translated messages · e70ba50b
      Jérome Perrin authored
      using:
      
          msgattrib translation.po --no-fuzzy --translated -s --no-wrap -o translation.po
      e70ba50b
    • Jérome Perrin's avatar
      catalog: make translated_title behave more like title regarding % · 063de327
      Jérome Perrin authored
      translated_title is used in listbox search columns, so it's very confusing
      for users if they can not use the usual % character for partial matches.
      
      This changes the behaviour of translated_title to autodetect the presence of
      % and use LIKE comparison operator in such case.
      063de327
  4. 24 May, 2021 11 commits
  5. 21 May, 2021 4 commits
  6. 20 May, 2021 1 commit
  7. 19 May, 2021 2 commits
  8. 18 May, 2021 3 commits
  9. 17 May, 2021 3 commits
  10. 12 May, 2021 4 commits
  11. 11 May, 2021 3 commits