Commits · 9e375b8e9f14c3eb0b8fe6b3ac5a17a7997b7935 · nexedi / erp5

04 Jun, 2021 1 commit

Lighter processing for OCR activities · 9e375b8e

Jérome Perrin authored Jun 04, 2021

When running OCR, we sometimes have issues because processing is "too heavy":
 - [x] use 2 or 3 Go of disk space for a one page PDF created by erp5_document_scanner, because we convert pdf -> png -> tiff before sending to tesseract. Modern Ghostscript supports running tesseract directly, so we use it if it's available.
 - [x] use 300% of CPU. Fixed by setting `OMP_THREAD_LIMIT` when running tesseract. This will only apply when OCR from Images. OCR embedded in Ghostscript does not seem to need this.
 - [x] ... and often crash, so is restarted. This is fixed by updated tesseract.

Updates of ghostscript and tesseract are part of nexedi/slapos!985

See merge request nexedi/erp5!1420

9e375b8e

03 Jun, 2021 2 commits

Base: support more image formats · f084c646

Jérome Perrin authored Jun 03, 2021

By relying on PIL after our monkey-patched OFS.Image.getImageInfo.

We keep this monkey-patch for now, because it adds supports to svg

See merge request nexedi/erp5!1426

f084c646

dms: drop PDF thumbnail view · 6dce55b0

Jérome Perrin authored Jun 01, 2021

Since 7f32f8cd (erp5_dms: Add PDF Reader using the pdf.js, 2016-06-24)
we have a PDF preview with a javascript PDF view, which is much better way
of viewing PDF.

This commit made the Thumbnail preview obsolete, also it does not really
work on ERP5JS, so remove the thumbnail preview.

6dce55b0

02 Jun, 2021 2 commits

fixup! MailHost: Set SMTP socket timeout to 16s (#20161019-4A3BD2) · 685810c3
Julien Muchembled authored Jun 02, 2021
```
See commit ec3c9cbc.
```
685810c3

dms: use ghostscript to convert PDF to text · f775724e

Jérome Perrin authored May 28, 2021

For historical reasons, PDF to text involved conversion first of the PDF to
png, then this png to tiff and the tiff was sent to tesseract. This works, but
it consumes a lot of resources with large PDFs, especially because the
intermediate png/tiff are created with a resolution of 300 DPI, which easily
needs serveral Go of RAM and temporary disk space.
This was obsorved with the PDF created by erp5_document_scanner, which are
usually high quality (1 or 2Mo per page) and even a one page PDF sometimes
took more than one minute to OCR.

Since 9.53 ghostscript integrates tesseract engine directly, we don't need to
prepare a tiff beforehand, we can directly send the PDF data to ghostscript.

These change use ghostscript if available and otherwise fallback to the same
pipeline as before. This will allow the transition until all ERP5 instances
are running a recent enough SlapOS with ghostscript 9.54. Fortunately, before
SlapOS include ghostscript 9.54, ERP5 software release did not have ghostscript
in $PATH, so we don't have to check ghostscript version, we assume that if gs
is in $PATH, it means we have a recent enough SlapOS.

This new approach was less tolerant regarding broken/password-protected PDFs
so we perform a new check that the PDF is valid and not encrypted before
trying to use OCR.

f775724e

01 Jun, 2021 1 commit
- core_test: Add test to make sure that wendelin.core basically works · 248940fe
  Rafael Monnerat authored Jun 01, 2021
```
See merge request nexedi/erp5!1429
```
  248940fe
31 May, 2021 3 commits

core_test: Add test to make sure that wendelin.core basically works · 5796a17a

Kirill Smelkov authored May 31, 2021

Wendelin.core is now integral part of ERP5 (see [1,2]), but nothing
inside ERP5 currently uses it. And even though wendelin.core has its own
testsuite, integration problems are always possible.

-> Add test to erp5_core_test that minimally makes sure that basic
wendelin.core operations work.

This test currently passes with wendelin.core 1, which is the default.
It also passes as live test with wendelin.core 2.
However with wendelin.core 2 it currently fails on testnodes like e.g.

ValueError: ZODB.MappingStorage.MappingStorage is in-RAM storage
in-RAM storages are not supported:
a zurl pointing to in-RAM storage in one process would lead to
another in-RAM storage in WCFS process.

and

RuntimeError: wcfs: join file:///srv/slapgrid/slappart8/srv/testnode/djk/test_suite/unit_test.2/var/Data.fs: server not started
(https://nexedijs.erp5.net/#/test_result_module/20210530-92EF3124/102)

because we need to amend ERP5 test driver

1) to run tests on a real storage instead of in-RAM Mapping Storage(*), and
2) to spawn WCFS server for each such storage.

I will try to address those points in a later patch.

In the meantime there should be no reason not to merge this, because we
do not use wendelin.core 2 yet, and solving "1" and "2" first are
preconditions to begin such a usage.

/cc @rafael, @tomo, @seb, @jerome, @romain, @vpelletier, @Tyagov, @klaus, @jp

(*) Combining Zope and WCFS working together requires data to be on a real
storage, not on in-RAM MappingStorage inside Zope's Python process.

[1] nexedi/slapos@7f877621
[2] nexedi/slapos!874 (comment 122339)

5796a17a

core: fix script selection on InteractionWorkflowInteraction_view · 53d2ec13
Jérome Perrin authored May 31, 2021
```
We don't want to show the ID which has a prefix, but the reference
```
53d2ec13
Enable coding style for erp5_core_test · d10cf0da
Jérome Perrin authored May 31, 2021
```
See merge request nexedi/erp5!1427
```
d10cf0da

27 May, 2021 9 commits

PortalTransforms/tiff_to_text: run tesseract with OMP_THREAD_LIMIT=1 · d74981c3

Jérome Perrin authored May 20, 2021

By default, tesseract runs on 4 CPU and this can be controlled by
OMP_THREAD_LIMIT=1 to run on only one CPU (as documented on
https://tesseract-ocr.github.io/tessdoc/FAQ.html)

In ERP5, we tend to use one zope node per CPU, so we don't want each
of these zope nodes to spawn a process which will run on 4 CPU.

In a quick benchmark it's not slower, even a bit faster to disable threads:

    ## a big image in france (a picture of an invoice)
    $ time ./bin/tesseract /tmp/input.tiff /tmp/out.txt
    Tesseract Open Source OCR Engine v4.1.1 with Leptonica
    Page 1
    Error in pixClipBoxToForeground: box not within image
    Error in pixClipBoxToForeground: box not within image

    ________________________________________________________
    Executed in   14.41 secs   fish           external
      usr time   27.88 secs  1002.00 micros   27.88 secs
      sys time    0.74 secs    0.00 micros    0.74 secs

    $ time OMP_THREAD_LIMIT=1 ./bin/tesseract /tmp/input.tiff /tmp/out.txt
    Tesseract Open Source OCR Engine v4.1.1 with Leptonica
    Page 1
    Error in pixClipBoxToForeground: box not within image
    Error in pixClipBoxToForeground: box not within image

    ________________________________________________________
    Executed in   12.58 secs   fish           external
      usr time   11.84 secs  955.00 micros   11.84 secs
      sys time    0.52 secs  503.00 micros    0.52 secs

    ## a small japanese image

    $ time ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt
    Tesseract Open Source OCR Engine v4.1.1 with Leptonica
    Page 1

    ________________________________________________________
    Executed in    2.16 secs   fish           external
      usr time    3.77 secs  590.00 micros    3.77 secs
      sys time    0.27 secs  209.00 micros    0.27 secs

    $ time OMP_THREAD_LIMIT=1 ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt
    Tesseract Open Source OCR Engine v4.1.1 with Leptonica
    Page 1

    ________________________________________________________
    Executed in    2.02 secs   fish           external
      usr time  1766.07 millis  1437.00 micros  1764.63 millis
      sys time  214.06 millis  522.00 micros  213.54 millis

d74981c3

dms: test more cases of converting PDFs to images · 9ac96204
Jérome Perrin authored May 27, 2021

9ac96204

dms: run testERP5Base.TestImage with DMS installed · 1d0aeccb

Jérome Perrin authored May 27, 2021

because DMS extends image portal types with interaction workflows etc,
it's better to also cover the case where DMS is installed.

1d0aeccb

core_test: enable coding style · 56ece63b
Jérome Perrin authored May 27, 2021

56ece63b
core_test: address pylint messages and other small cleanups · 8d8066c9
Jérome Perrin authored May 27, 2021

8d8066c9
Image: fallback to PIL to guess images content and size · befb7d84
Jérome Perrin authored May 27, 2021
```
This fixes problem that some formats such as tiff were not supported.
```
befb7d84

core_test: fix list of included tests · a148fcc9

Jérome Perrin authored May 27, 2021

testSQLCachedWorklist is now part of a dedicated erp5_worklist_sql_test
business template.

a148fcc9

mimetypes_registry: merge image/x-ms-bmp in Windows BMP image · 0991f839

Jérome Perrin authored May 27, 2021

We had two mimetypes entries, which caused inconsistencies depending on
wether the lookup was done by mimetype, by glob or by extension.

We had:

- name:
   "Windows BMP image"
- mimetypes:
    image/bmp
    image/x-bmp
    image/x-MS-bmp
- extensions:
- globs:
    *.bmp

and

- name:
   "image/x-ms-bmp"
- mimetypes:
    image/x-ms-bmp
- extensions:
    bmp
- globs:

With this commit they are merged into one:

- name:
   "Windows BMP image"
- mimetypes:
    image/x-ms-bmp
    image/bmp
    image/x-bmp
    image/x-MS-bmp
- extensions:
    bmp
- globs:
    *.bmp

This way we only have one consistent mimetype.

For compatibility with extension lookups (that are done in
Document_guessMimeType interaction workflow from DMS), image/x-ms-bmp is
kept as default. This might not be the best choice, according to
https://www.iana.org/assignments/media-types/media-types.xhtml

0991f839

officejs_support_request_ui: typo in Post_ingestWebMessageForSupportRequest script name · 2cbd5640
Jérome Perrin authored May 26, 2021
```
This script creates Web Message, not Mail Message
```
2cbd5640

26 May, 2021 6 commits

Migrate app objects for wildcard frontend · fb733871
Roque authored May 26, 2021
```
See merge request nexedi/erp5!1411
```
fb733871

Improve Developer experience (mostly ERP5 Workflow/Python Scripts) · 1b31fcbd

Jérome Perrin authored May 26, 2021

Fixes [#20210517-960A47](https://erp5js.nexedi.net/#/bug_module/20210517-960A47)

The most important changes are:
 - coding style is enabled again for workflow scripts and starts to be enabled for ERP5 Python scripts
 - monaco editor support for workflow scripts, SQL methods and .less
 - small fixes for python/workflow scripts forms and ZMI

See merge request nexedi/erp5!1422

1b31fcbd

ingestion: review publication_state argument · 015bc1c1

Jérome Perrin authored May 24, 2021

Changing state directly in Base_contribute was only functional for the case
where metadata was discovered asynchronously. In the case of synchronous
discovery, the state was first changed state, and Document_convertToBaseFormatAndDiscoverMetadata
was executed - but this this was causing Unauthorized like this:

      Module script, line 10, in Document_convertToBaseFormatAndDiscoverMetadata
      - <PythonScript at /erp5/Document_convertToBaseFormatAndDiscoverMetadata used for /erp5/document_module/163>
      - Line 10
        return context.discoverMetadata(filename=filename,
    Unauthorized: You are not allowed to access 'discoverMetadata' in this context

because once we have already changed state, regular user no longer have
permission to access discoverMetadata, because that method needs ModifyPortalContent
permission.

Instead, of handling publication_state only in Base_contribute, treat it
like others user input parameter and change state during discovery.

Tests were also re-organised to move Base_contribute related test in testIngestion
and also to run Base_contribute tests as a non-manager user.

015bc1c1

l10n: remove all translations for transitions with [transition in $workflow_id] · e3e2c240

Jérome Perrin authored May 21, 2021

This was never supported, we support only [state in $workflow_id]

See also:

  https://erp5js.nexedi.net/#/bug_module/1740

  b6dcbc19 (l10n_fr,l10n_jp: Fix translation of "Open", 2021-04-30)

Generated from this script:

    #!/srv/slapgrid/slappart3/srv/runner/software/cc0326f0dcb093f56c01291c300c8481/parts/erp5/venv/bin/python

    import polib
    import sys
    import re

    pofile = polib.pofile(sys.argv[1])

    msgs = dict()
    for entry in pofile:
      msgs[entry.msgid] = entry.msgstr

    transition_re = re.compile(r'(.*) \[transition in .*\]')

    fixed_messages = dict()
    for entry in pofile:
      match = transition_re.match(entry.msgid)
      if match:
        # in erp5_l10n_de some msgstr also have the [transition in ...], we drop them
        if transition_re.match(entry.msgstr):
          continue
        short = match.groups()[0]
        if short.endswith('Action'):
          continue
        if short not in msgs:
          print(f"🤔  {short} not translated ( from {entry.msgid} )")
          fixed_messages[short] = entry.msgstr
      else:
        fixed_messages[entry.msgid] = entry.msgstr

    pofile.clear()
    for k, v in fixed_messages.items():
      pofile.append(polib.POEntry(msgid=k, msgstr=v))

    pofile.save(sys.argv[1])

    import subprocess
    subprocess.check_output(
      [
        '/opt/slapos-shared/gettext/4df93a547efd86e0eb70495b88a5d3b1/bin/msgattrib',
        sys.argv[1],
        "--no-fuzzy",
        "--translated",
        "-s",
        "--no-wrap",
        "-o",
        sys.argv[1]

      ]
    )

e3e2c240

l10n: sort all message catalogs and remove non translated messages · e70ba50b
Jérome Perrin authored May 21, 2021
```
using:

    msgattrib translation.po --no-fuzzy --translated -s --no-wrap -o translation.po
```
e70ba50b

catalog: make translated_title behave more like title regarding % · 063de327

Jérome Perrin authored May 17, 2021

translated_title is used in listbox search columns, so it's very confusing
for users if they can not use the usual % character for partial matches.

This changes the behaviour of translated_title to autodetect the presence of
% and use LIKE comparison operator in such case.

063de327

24 May, 2021 11 commits
- core: fix pylint warnings · b1f36ffd
  Jérome Perrin authored May 24, 2021
```
these ERP5 Python Scripts were not covered by coding style tests
```
  b1f36ffd
- accounting_ui_test: fix pylint warnings · 3304fe9b
  Jérome Perrin authored May 24, 2021
```
This ERP5 Python script was not checked until now, and add wrong indentation
```
  3304fe9b
- core: fix script listbox columns on Workflow · 08a1b22a
  Jérome Perrin authored May 21, 2021
```
The ID is calculated from reference, showing reference is enough.
There is no "callable_type" property on workflow script
```
  08a1b22a
- core: fix roles on Python Script proxy role view · a6fb9ff2
  Jérome Perrin authored May 21, 2021
```
Don't hardcode a few roles, use the API to get all roles
```
  a6fb9ff2
- core: enable source code editor for SQL Methods · 7f3c6086
  Jérome Perrin authored May 17, 2021
  
  7f3c6086
- core: check ERP5 scripts coding style tests · 0d522ef3
  Jérome Perrin authored May 17, 2021
```
Only zope Python Scripts were checked
```
  0d522ef3
- core: fix naming of fields for python scripts · e33833f7
  Jérome Perrin authored May 17, 2021
```
the convention is to use a my_ or your_ prefix
```
  e33833f7
- SourceCodeEditorZMI: support less (the CSS preprocessor) · affd0574
  Jérome Perrin authored May 17, 2021
```
This is only supported in monaco editor
```
  affd0574
- monaco_editor: support SQL Methods · 6f536bf7
  Jérome Perrin authored May 21, 2021
```
use syntax highlighting for SQL language, not really correct because of
dtml, but probably better than plain text
```
  6f536bf7
- monaco_editor: support Workflow Script · ed6765e8
  Jérome Perrin authored May 16, 2021
  
  ed6765e8
- core: improve support of ERP5 scripts in ZMI · 4c701c70
  Jérome Perrin authored May 16, 2021
```
 - fix order of ZMI tabs to make `Edit` the default
 - enable ZMI code editor
 - add an icon
 - fix links to python scripts in BusinessTemplate_getPythonSourceCodeMessageList
```
  4c701c70
21 May, 2021 4 commits
- erp5_officejs_appstore_base: software product interaction workflow · ff17a9b5
  Roque authored May 18, 2021
  
  ff17a9b5
- erp5_officejs_appstore_base: software product action to change app url · 4738032b
  Roque authored May 07, 2021
  
  4738032b
- pdm: fix a wrongly named and not translated field · 8fa6e5b8
  Jérome Perrin authored May 17, 2021
```
this field did not use a (my_ / your_) prefix
```
  8fa6e5b8
- simulation: fix unused variable · c8226458
  Jérome Perrin authored May 21, 2021
  
  c8226458
20 May, 2021 1 commit
- erp5_l10n_zh: update translation · 852f7ce4
  Lu Xu authored May 20, 2021
```
See merge request nexedi/erp5!1419
```
  852f7ce4