1. 04 Jun, 2021 1 commit
    • Jérome Perrin's avatar
      Lighter processing for OCR activities · 9e375b8e
      Jérome Perrin authored
      When running OCR, we sometimes have issues because processing is "too heavy":
       - [x] use 2 or 3 Go of disk space for a one page PDF created by erp5_document_scanner, because we convert pdf -> png -> tiff before sending to tesseract. Modern Ghostscript supports running tesseract directly, so we use it if it's available.
       - [x] use 300% of CPU. Fixed by setting `OMP_THREAD_LIMIT` when running tesseract. This will only apply when OCR from Images. OCR embedded in Ghostscript does not seem to need this.
       - [x] ... and often crash, so is restarted. This is fixed by updated tesseract.
      
      Updates of ghostscript and tesseract are part of nexedi/slapos!985
      
      See merge request nexedi/erp5!1420
      9e375b8e
  2. 03 Jun, 2021 2 commits
    • Jérome Perrin's avatar
      Base: support more image formats · f084c646
      Jérome Perrin authored
      By relying on PIL after our monkey-patched OFS.Image.getImageInfo.
      
      We keep this monkey-patch for now, because it adds supports to svg
      
      See merge request nexedi/erp5!1426
      f084c646
    • Jérome Perrin's avatar
      dms: drop PDF thumbnail view · 6dce55b0
      Jérome Perrin authored
      Since 7f32f8cd (erp5_dms: Add PDF Reader using the pdf.js, 2016-06-24)
      we have a PDF preview with a javascript PDF view, which is much better way
      of viewing PDF.
      
      This commit made the Thumbnail preview obsolete, also it does not really
      work on ERP5JS, so remove the thumbnail preview.
      6dce55b0
  3. 02 Jun, 2021 2 commits
    • Julien Muchembled's avatar
    • Jérome Perrin's avatar
      dms: use ghostscript to convert PDF to text · f775724e
      Jérome Perrin authored
      For historical reasons, PDF to text involved conversion first of the PDF to
      png, then this png to tiff and the tiff was sent to tesseract. This works, but
      it consumes a lot of resources with large PDFs, especially because the
      intermediate png/tiff are created with a resolution of 300 DPI, which easily
      needs serveral Go of RAM and temporary disk space.
      This was obsorved with the PDF created by erp5_document_scanner, which are
      usually high quality (1 or 2Mo per page) and even a one page PDF sometimes
      took more than one minute to OCR.
      
      Since 9.53 ghostscript integrates tesseract engine directly, we don't need to
      prepare a tiff beforehand, we can directly send the PDF data to ghostscript.
      
      These change use ghostscript if available and otherwise fallback to the same
      pipeline as before. This will allow the transition until all ERP5 instances
      are running a recent enough SlapOS with ghostscript 9.54. Fortunately, before
      SlapOS include ghostscript 9.54, ERP5 software release did not have ghostscript
      in $PATH, so we don't have to check ghostscript version, we assume that if gs
      is in $PATH, it means we have a recent enough SlapOS.
      
      This new approach was less tolerant regarding broken/password-protected PDFs
      so we perform a new check that the PDF is valid and not encrypted before
      trying to use OCR.
      f775724e
  4. 01 Jun, 2021 1 commit
  5. 31 May, 2021 3 commits
  6. 27 May, 2021 9 commits
    • Jérome Perrin's avatar
      PortalTransforms/tiff_to_text: run tesseract with OMP_THREAD_LIMIT=1 · d74981c3
      Jérome Perrin authored
      By default, tesseract runs on 4 CPU and this can be controlled by
      OMP_THREAD_LIMIT=1 to run on only one CPU (as documented on
      https://tesseract-ocr.github.io/tessdoc/FAQ.html)
      
      In ERP5, we tend to use one zope node per CPU, so we don't want each
      of these zope nodes to spawn a process which will run on 4 CPU.
      
      In a quick benchmark it's not slower, even a bit faster to disable threads:
      
          ## a big image in france (a picture of an invoice)
          $ time ./bin/tesseract /tmp/input.tiff /tmp/out.txt
          Tesseract Open Source OCR Engine v4.1.1 with Leptonica
          Page 1
          Error in pixClipBoxToForeground: box not within image
          Error in pixClipBoxToForeground: box not within image
      
          ________________________________________________________
          Executed in   14.41 secs   fish           external
            usr time   27.88 secs  1002.00 micros   27.88 secs
            sys time    0.74 secs    0.00 micros    0.74 secs
      
          $ time OMP_THREAD_LIMIT=1 ./bin/tesseract /tmp/input.tiff /tmp/out.txt
          Tesseract Open Source OCR Engine v4.1.1 with Leptonica
          Page 1
          Error in pixClipBoxToForeground: box not within image
          Error in pixClipBoxToForeground: box not within image
      
          ________________________________________________________
          Executed in   12.58 secs   fish           external
            usr time   11.84 secs  955.00 micros   11.84 secs
            sys time    0.52 secs  503.00 micros    0.52 secs
      
          ## a small japanese image
      
          $ time ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt
          Tesseract Open Source OCR Engine v4.1.1 with Leptonica
          Page 1
      
          ________________________________________________________
          Executed in    2.16 secs   fish           external
            usr time    3.77 secs  590.00 micros    3.77 secs
            sys time    0.27 secs  209.00 micros    0.27 secs
      
          $ time OMP_THREAD_LIMIT=1 ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt
          Tesseract Open Source OCR Engine v4.1.1 with Leptonica
          Page 1
      
          ________________________________________________________
          Executed in    2.02 secs   fish           external
            usr time  1766.07 millis  1437.00 micros  1764.63 millis
            sys time  214.06 millis  522.00 micros  213.54 millis
      d74981c3
    • Jérome Perrin's avatar
      9ac96204
    • Jérome Perrin's avatar
      dms: run testERP5Base.TestImage with DMS installed · 1d0aeccb
      Jérome Perrin authored
      because DMS extends image portal types with interaction workflows etc,
      it's better to also cover the case where DMS is installed.
      1d0aeccb
    • Jérome Perrin's avatar
      core_test: enable coding style · 56ece63b
      Jérome Perrin authored
      56ece63b
    • Jérome Perrin's avatar
    • Jérome Perrin's avatar
      Image: fallback to PIL to guess images content and size · befb7d84
      Jérome Perrin authored
      This fixes problem that some formats such as tiff were not supported.
      befb7d84
    • Jérome Perrin's avatar
      core_test: fix list of included tests · a148fcc9
      Jérome Perrin authored
      testSQLCachedWorklist is now part of a dedicated erp5_worklist_sql_test
      business template.
      a148fcc9
    • Jérome Perrin's avatar
      mimetypes_registry: merge image/x-ms-bmp in Windows BMP image · 0991f839
      Jérome Perrin authored
      We had two mimetypes entries, which caused inconsistencies depending on
      wether the lookup was done by mimetype, by glob or by extension.
      
      We had:
      
      - name:
         "Windows BMP image"
      - mimetypes:
          image/bmp
          image/x-bmp
          image/x-MS-bmp
      - extensions:
      - globs:
          *.bmp
      
      and
      
      - name:
         "image/x-ms-bmp"
      - mimetypes:
          image/x-ms-bmp
      - extensions:
          bmp
      - globs:
      
      With this commit they are merged into one:
      
      - name:
         "Windows BMP image"
      - mimetypes:
          image/x-ms-bmp
          image/bmp
          image/x-bmp
          image/x-MS-bmp
      - extensions:
          bmp
      - globs:
          *.bmp
      
      This way we only have one consistent mimetype.
      
      For compatibility with extension lookups (that are done in
      Document_guessMimeType interaction workflow from DMS), image/x-ms-bmp is
      kept as default. This might not be the best choice, according to
      https://www.iana.org/assignments/media-types/media-types.xhtml
      0991f839
    • Jérome Perrin's avatar
      officejs_support_request_ui: typo in Post_ingestWebMessageForSupportRequest script name · 2cbd5640
      Jérome Perrin authored
      This script creates Web Message, not Mail Message
      2cbd5640
  7. 26 May, 2021 6 commits
    • Roque's avatar
      Migrate app objects for wildcard frontend · fb733871
      Roque authored
      See merge request nexedi/erp5!1411
      fb733871
    • Jérome Perrin's avatar
      Improve Developer experience (mostly ERP5 Workflow/Python Scripts) · 1b31fcbd
      Jérome Perrin authored
      Fixes [#20210517-960A47](https://erp5js.nexedi.net/#/bug_module/20210517-960A47)
      
      The most important changes are:
       - coding style is enabled again for workflow scripts and starts to be enabled for ERP5 Python scripts
       - monaco editor support for workflow scripts, SQL methods and .less
       - small fixes for python/workflow scripts forms and ZMI
      
      See merge request nexedi/erp5!1422
      1b31fcbd
    • Jérome Perrin's avatar
      ingestion: review publication_state argument · 015bc1c1
      Jérome Perrin authored
      Changing state directly in Base_contribute was only functional for the case
      where metadata was discovered asynchronously. In the case of synchronous
      discovery, the state was first changed state, and Document_convertToBaseFormatAndDiscoverMetadata
      was executed - but this this was causing Unauthorized like this:
      
            Module script, line 10, in Document_convertToBaseFormatAndDiscoverMetadata
            - <PythonScript at /erp5/Document_convertToBaseFormatAndDiscoverMetadata used for /erp5/document_module/163>
            - Line 10
              return context.discoverMetadata(filename=filename,
          Unauthorized: You are not allowed to access 'discoverMetadata' in this context
      
      because once we have already changed state, regular user no longer have
      permission to access discoverMetadata, because that method needs ModifyPortalContent
      permission.
      
      Instead, of handling publication_state only in Base_contribute, treat it
      like others user input parameter and change state during discovery.
      
      Tests were also re-organised to move Base_contribute related test in testIngestion
      and also to run Base_contribute tests as a non-manager user.
      015bc1c1
    • Jérome Perrin's avatar
      l10n: remove all translations for transitions with [transition in $workflow_id] · e3e2c240
      Jérome Perrin authored
      This was never supported, we support only [state in $workflow_id]
      
      See also:
      
        https://erp5js.nexedi.net/#/bug_module/1740
      
        b6dcbc19 (l10n_fr,l10n_jp: Fix translation of "Open", 2021-04-30)
      
      Generated from this script:
      
          #!/srv/slapgrid/slappart3/srv/runner/software/cc0326f0dcb093f56c01291c300c8481/parts/erp5/venv/bin/python
      
          import polib
          import sys
          import re
      
          pofile = polib.pofile(sys.argv[1])
      
          msgs = dict()
          for entry in pofile:
            msgs[entry.msgid] = entry.msgstr
      
          transition_re = re.compile(r'(.*) \[transition in .*\]')
      
          fixed_messages = dict()
          for entry in pofile:
            match = transition_re.match(entry.msgid)
            if match:
              # in erp5_l10n_de some msgstr also have the [transition in ...], we drop them
              if transition_re.match(entry.msgstr):
                continue
              short = match.groups()[0]
              if short.endswith('Action'):
                continue
              if short not in msgs:
                print(f"🤔  {short} not translated ( from {entry.msgid} )")
                fixed_messages[short] = entry.msgstr
            else:
              fixed_messages[entry.msgid] = entry.msgstr
      
          pofile.clear()
          for k, v in fixed_messages.items():
            pofile.append(polib.POEntry(msgid=k, msgstr=v))
      
          pofile.save(sys.argv[1])
      
          import subprocess
          subprocess.check_output(
            [
              '/opt/slapos-shared/gettext/4df93a547efd86e0eb70495b88a5d3b1/bin/msgattrib',
              sys.argv[1],
              "--no-fuzzy",
              "--translated",
              "-s",
              "--no-wrap",
              "-o",
              sys.argv[1]
      
            ]
          )
      e3e2c240
    • Jérome Perrin's avatar
      l10n: sort all message catalogs and remove non translated messages · e70ba50b
      Jérome Perrin authored
      using:
      
          msgattrib translation.po --no-fuzzy --translated -s --no-wrap -o translation.po
      e70ba50b
    • Jérome Perrin's avatar
      catalog: make translated_title behave more like title regarding % · 063de327
      Jérome Perrin authored
      translated_title is used in listbox search columns, so it's very confusing
      for users if they can not use the usual % character for partial matches.
      
      This changes the behaviour of translated_title to autodetect the presence of
      % and use LIKE comparison operator in such case.
      063de327
  8. 24 May, 2021 11 commits
  9. 21 May, 2021 4 commits
  10. 20 May, 2021 1 commit