- 04 Jun, 2021 1 commit
-
-
Jérome Perrin authored
When running OCR, we sometimes have issues because processing is "too heavy": - [x] use 2 or 3 Go of disk space for a one page PDF created by erp5_document_scanner, because we convert pdf -> png -> tiff before sending to tesseract. Modern Ghostscript supports running tesseract directly, so we use it if it's available. - [x] use 300% of CPU. Fixed by setting `OMP_THREAD_LIMIT` when running tesseract. This will only apply when OCR from Images. OCR embedded in Ghostscript does not seem to need this. - [x] ... and often crash, so is restarted. This is fixed by updated tesseract. Updates of ghostscript and tesseract are part of nexedi/slapos!985 See merge request nexedi/erp5!1420
-
- 03 Jun, 2021 2 commits
-
-
Jérome Perrin authored
By relying on PIL after our monkey-patched OFS.Image.getImageInfo. We keep this monkey-patch for now, because it adds supports to svg See merge request nexedi/erp5!1426
-
Jérome Perrin authored
Since 7f32f8cd (erp5_dms: Add PDF Reader using the pdf.js, 2016-06-24) we have a PDF preview with a javascript PDF view, which is much better way of viewing PDF. This commit made the Thumbnail preview obsolete, also it does not really work on ERP5JS, so remove the thumbnail preview.
-
- 02 Jun, 2021 2 commits
-
-
Julien Muchembled authored
See commit ec3c9cbc.
-
Jérome Perrin authored
For historical reasons, PDF to text involved conversion first of the PDF to png, then this png to tiff and the tiff was sent to tesseract. This works, but it consumes a lot of resources with large PDFs, especially because the intermediate png/tiff are created with a resolution of 300 DPI, which easily needs serveral Go of RAM and temporary disk space. This was obsorved with the PDF created by erp5_document_scanner, which are usually high quality (1 or 2Mo per page) and even a one page PDF sometimes took more than one minute to OCR. Since 9.53 ghostscript integrates tesseract engine directly, we don't need to prepare a tiff beforehand, we can directly send the PDF data to ghostscript. These change use ghostscript if available and otherwise fallback to the same pipeline as before. This will allow the transition until all ERP5 instances are running a recent enough SlapOS with ghostscript 9.54. Fortunately, before SlapOS include ghostscript 9.54, ERP5 software release did not have ghostscript in $PATH, so we don't have to check ghostscript version, we assume that if gs is in $PATH, it means we have a recent enough SlapOS. This new approach was less tolerant regarding broken/password-protected PDFs so we perform a new check that the PDF is valid and not encrypted before trying to use OCR.
-
- 01 Jun, 2021 1 commit
-
-
Rafael Monnerat authored
See merge request !1429
-
- 31 May, 2021 3 commits
-
-
Kirill Smelkov authored
Wendelin.core is now integral part of ERP5 (see [1,2]), but nothing inside ERP5 currently uses it. And even though wendelin.core has its own testsuite, integration problems are always possible. -> Add test to erp5_core_test that minimally makes sure that basic wendelin.core operations work. This test currently passes with wendelin.core 1, which is the default. It also passes as live test with wendelin.core 2. However with wendelin.core 2 it currently fails on testnodes like e.g. ValueError: ZODB.MappingStorage.MappingStorage is in-RAM storage in-RAM storages are not supported: a zurl pointing to in-RAM storage in one process would lead to another in-RAM storage in WCFS process. and RuntimeError: wcfs: join file:///srv/slapgrid/slappart8/srv/testnode/djk/test_suite/unit_test.2/var/Data.fs: server not started (https://nexedijs.erp5.net/#/test_result_module/20210530-92EF3124/102) because we need to amend ERP5 test driver 1) to run tests on a real storage instead of in-RAM Mapping Storage(*), and 2) to spawn WCFS server for each such storage. I will try to address those points in a later patch. In the meantime there should be no reason not to merge this, because we do not use wendelin.core 2 yet, and solving "1" and "2" first are preconditions to begin such a usage. /cc @rafael, @tomo, @seb, @jerome, @romain, @vpelletier, @Tyagov, @klaus, @jp (*) Combining Zope and WCFS working together requires data to be on a real storage, not on in-RAM MappingStorage inside Zope's Python process. [1] slapos@7f877621 [2] slapos!874 (comment 122339)
-
Jérome Perrin authored
We don't want to show the ID which has a prefix, but the reference
-
Jérome Perrin authored
See merge request !1427
-
- 27 May, 2021 9 commits
-
-
Jérome Perrin authored
By default, tesseract runs on 4 CPU and this can be controlled by OMP_THREAD_LIMIT=1 to run on only one CPU (as documented on https://tesseract-ocr.github.io/tessdoc/FAQ.html) In ERP5, we tend to use one zope node per CPU, so we don't want each of these zope nodes to spawn a process which will run on 4 CPU. In a quick benchmark it's not slower, even a bit faster to disable threads: ## a big image in france (a picture of an invoice) $ time ./bin/tesseract /tmp/input.tiff /tmp/out.txt Tesseract Open Source OCR Engine v4.1.1 with Leptonica Page 1 Error in pixClipBoxToForeground: box not within image Error in pixClipBoxToForeground: box not within image ________________________________________________________ Executed in 14.41 secs fish external usr time 27.88 secs 1002.00 micros 27.88 secs sys time 0.74 secs 0.00 micros 0.74 secs $ time OMP_THREAD_LIMIT=1 ./bin/tesseract /tmp/input.tiff /tmp/out.txt Tesseract Open Source OCR Engine v4.1.1 with Leptonica Page 1 Error in pixClipBoxToForeground: box not within image Error in pixClipBoxToForeground: box not within image ________________________________________________________ Executed in 12.58 secs fish external usr time 11.84 secs 955.00 micros 11.84 secs sys time 0.52 secs 503.00 micros 0.52 secs ## a small japanese image $ time ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt Tesseract Open Source OCR Engine v4.1.1 with Leptonica Page 1 ________________________________________________________ Executed in 2.16 secs fish external usr time 3.77 secs 590.00 micros 3.77 secs sys time 0.27 secs 209.00 micros 0.27 secs $ time OMP_THREAD_LIMIT=1 ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt Tesseract Open Source OCR Engine v4.1.1 with Leptonica Page 1 ________________________________________________________ Executed in 2.02 secs fish external usr time 1766.07 millis 1437.00 micros 1764.63 millis sys time 214.06 millis 522.00 micros 213.54 millis
-
Jérome Perrin authored
-
Jérome Perrin authored
because DMS extends image portal types with interaction workflows etc, it's better to also cover the case where DMS is installed.
-
Jérome Perrin authored
-
Jérome Perrin authored
-
Jérome Perrin authored
This fixes problem that some formats such as tiff were not supported.
-
Jérome Perrin authored
testSQLCachedWorklist is now part of a dedicated erp5_worklist_sql_test business template.
-
Jérome Perrin authored
We had two mimetypes entries, which caused inconsistencies depending on wether the lookup was done by mimetype, by glob or by extension. We had: - name: "Windows BMP image" - mimetypes: image/bmp image/x-bmp image/x-MS-bmp - extensions: - globs: *.bmp and - name: "image/x-ms-bmp" - mimetypes: image/x-ms-bmp - extensions: bmp - globs: With this commit they are merged into one: - name: "Windows BMP image" - mimetypes: image/x-ms-bmp image/bmp image/x-bmp image/x-MS-bmp - extensions: bmp - globs: *.bmp This way we only have one consistent mimetype. For compatibility with extension lookups (that are done in Document_guessMimeType interaction workflow from DMS), image/x-ms-bmp is kept as default. This might not be the best choice, according to https://www.iana.org/assignments/media-types/media-types.xhtml
-
Jérome Perrin authored
This script creates Web Message, not Mail Message
-
- 26 May, 2021 6 commits
-
-
Jérome Perrin authored
Fixes [#20210517-960A47](https://erp5js.nexedi.net/#/bug_module/20210517-960A47) The most important changes are: - coding style is enabled again for workflow scripts and starts to be enabled for ERP5 Python scripts - monaco editor support for workflow scripts, SQL methods and .less - small fixes for python/workflow scripts forms and ZMI See merge request !1422
-
Jérome Perrin authored
Changing state directly in Base_contribute was only functional for the case where metadata was discovered asynchronously. In the case of synchronous discovery, the state was first changed state, and Document_convertToBaseFormatAndDiscoverMetadata was executed - but this this was causing Unauthorized like this: Module script, line 10, in Document_convertToBaseFormatAndDiscoverMetadata - <PythonScript at /erp5/Document_convertToBaseFormatAndDiscoverMetadata used for /erp5/document_module/163> - Line 10 return context.discoverMetadata(filename=filename, Unauthorized: You are not allowed to access 'discoverMetadata' in this context because once we have already changed state, regular user no longer have permission to access discoverMetadata, because that method needs ModifyPortalContent permission. Instead, of handling publication_state only in Base_contribute, treat it like others user input parameter and change state during discovery. Tests were also re-organised to move Base_contribute related test in testIngestion and also to run Base_contribute tests as a non-manager user.
-
Jérome Perrin authored
This was never supported, we support only [state in $workflow_id] See also: https://erp5js.nexedi.net/#/bug_module/1740 b6dcbc19 (l10n_fr,l10n_jp: Fix translation of "Open", 2021-04-30) Generated from this script: #!/srv/slapgrid/slappart3/srv/runner/software/cc0326f0dcb093f56c01291c300c8481/parts/erp5/venv/bin/python import polib import sys import re pofile = polib.pofile(sys.argv[1]) msgs = dict() for entry in pofile: msgs[entry.msgid] = entry.msgstr transition_re = re.compile(r'(.*) \[transition in .*\]') fixed_messages = dict() for entry in pofile: match = transition_re.match(entry.msgid) if match: # in erp5_l10n_de some msgstr also have the [transition in ...], we drop them if transition_re.match(entry.msgstr): continue short = match.groups()[0] if short.endswith('Action'): continue if short not in msgs: print(f"
🤔 {short} not translated ( from {entry.msgid} )") fixed_messages[short] = entry.msgstr else: fixed_messages[entry.msgid] = entry.msgstr pofile.clear() for k, v in fixed_messages.items(): pofile.append(polib.POEntry(msgid=k, msgstr=v)) pofile.save(sys.argv[1]) import subprocess subprocess.check_output( [ '/opt/slapos-shared/gettext/4df93a547efd86e0eb70495b88a5d3b1/bin/msgattrib', sys.argv[1], "--no-fuzzy", "--translated", "-s", "--no-wrap", "-o", sys.argv[1] ] ) -
Jérome Perrin authored
using: msgattrib translation.po --no-fuzzy --translated -s --no-wrap -o translation.po
-
Jérome Perrin authored
translated_title is used in listbox search columns, so it's very confusing for users if they can not use the usual % character for partial matches. This changes the behaviour of translated_title to autodetect the presence of % and use LIKE comparison operator in such case.
- 24 May, 2021 11 commits
-
-
Jérome Perrin authored
these ERP5 Python Scripts were not covered by coding style tests
-
Jérome Perrin authored
This ERP5 Python script was not checked until now, and add wrong indentation
-
Jérome Perrin authored
The ID is calculated from reference, showing reference is enough. There is no "callable_type" property on workflow script
-
Jérome Perrin authored
Don't hardcode a few roles, use the API to get all roles
-
Jérome Perrin authored
-
Jérome Perrin authored
Only zope Python Scripts were checked
-
Jérome Perrin authored
the convention is to use a my_ or your_ prefix
-
Jérome Perrin authored
This is only supported in monaco editor
-
Jérome Perrin authored
use syntax highlighting for SQL language, not really correct because of dtml, but probably better than plain text
-
Jérome Perrin authored
-
Jérome Perrin authored
- fix order of ZMI tabs to make `Edit` the default - enable ZMI code editor - add an icon - fix links to python scripts in BusinessTemplate_getPythonSourceCodeMessageList
-
- 21 May, 2021 4 commits
-
-
Roque authored
-
Roque authored
-
Jérome Perrin authored
this field did not use a (my_ / your_) prefix
-
Jérome Perrin authored
-
- 20 May, 2021 1 commit
-