- 02 Jun, 2021 1 commit
-
-
Jérome Perrin authored
For historical reasons, PDF to text involved conversion first of the PDF to png, then this png to tiff and the tiff was sent to tesseract. This works, but it consumes a lot of resources with large PDFs, especially because the intermediate png/tiff are created with a resolution of 300 DPI, which easily needs serveral Go of RAM and temporary disk space. This was obsorved with the PDF created by erp5_document_scanner, which are usually high quality (1 or 2Mo per page) and even a one page PDF sometimes took more than one minute to OCR. Since 9.53 ghostscript integrates tesseract engine directly, we don't need to prepare a tiff beforehand, we can directly send the PDF data to ghostscript. These change use ghostscript if available and otherwise fallback to the same pipeline as before. This will allow the transition until all ERP5 instances are running a recent enough SlapOS with ghostscript 9.54. Fortunately, before SlapOS include ghostscript 9.54, ERP5 software release did not have ghostscript in $PATH, so we don't have to check ghostscript version, we assume that if gs is in $PATH, it means we have a recent enough SlapOS. This new approach was less tolerant regarding broken/password-protected PDFs so we perform a new check that the PDF is valid and not encrypted before trying to use OCR.
-
- 27 May, 2021 2 commits
-
-
Jérome Perrin authored
By default, tesseract runs on 4 CPU and this can be controlled by OMP_THREAD_LIMIT=1 to run on only one CPU (as documented on https://tesseract-ocr.github.io/tessdoc/FAQ.html) In ERP5, we tend to use one zope node per CPU, so we don't want each of these zope nodes to spawn a process which will run on 4 CPU. In a quick benchmark it's not slower, even a bit faster to disable threads: ## a big image in france (a picture of an invoice) $ time ./bin/tesseract /tmp/input.tiff /tmp/out.txt Tesseract Open Source OCR Engine v4.1.1 with Leptonica Page 1 Error in pixClipBoxToForeground: box not within image Error in pixClipBoxToForeground: box not within image ________________________________________________________ Executed in 14.41 secs fish external usr time 27.88 secs 1002.00 micros 27.88 secs sys time 0.74 secs 0.00 micros 0.74 secs $ time OMP_THREAD_LIMIT=1 ./bin/tesseract /tmp/input.tiff /tmp/out.txt Tesseract Open Source OCR Engine v4.1.1 with Leptonica Page 1 Error in pixClipBoxToForeground: box not within image Error in pixClipBoxToForeground: box not within image ________________________________________________________ Executed in 12.58 secs fish external usr time 11.84 secs 955.00 micros 11.84 secs sys time 0.52 secs 503.00 micros 0.52 secs ## a small japanese image $ time ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt Tesseract Open Source OCR Engine v4.1.1 with Leptonica Page 1 ________________________________________________________ Executed in 2.16 secs fish external usr time 3.77 secs 590.00 micros 3.77 secs sys time 0.27 secs 209.00 micros 0.27 secs $ time OMP_THREAD_LIMIT=1 ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt Tesseract Open Source OCR Engine v4.1.1 with Leptonica Page 1 ________________________________________________________ Executed in 2.02 secs fish external usr time 1766.07 millis 1437.00 micros 1764.63 millis sys time 214.06 millis 522.00 micros 213.54 millis
-
Jérome Perrin authored
This script creates Web Message, not Mail Message
-
- 26 May, 2021 6 commits
-
-
Roque authored
See merge request nexedi/erp5!1411
-
Jérome Perrin authored
Fixes [#20210517-960A47](https://erp5js.nexedi.net/#/bug_module/20210517-960A47) The most important changes are: - coding style is enabled again for workflow scripts and starts to be enabled for ERP5 Python scripts - monaco editor support for workflow scripts, SQL methods and .less - small fixes for python/workflow scripts forms and ZMI See merge request !1422
-
Jérome Perrin authored
Changing state directly in Base_contribute was only functional for the case where metadata was discovered asynchronously. In the case of synchronous discovery, the state was first changed state, and Document_convertToBaseFormatAndDiscoverMetadata was executed - but this this was causing Unauthorized like this: Module script, line 10, in Document_convertToBaseFormatAndDiscoverMetadata - <PythonScript at /erp5/Document_convertToBaseFormatAndDiscoverMetadata used for /erp5/document_module/163> - Line 10 return context.discoverMetadata(filename=filename, Unauthorized: You are not allowed to access 'discoverMetadata' in this context because once we have already changed state, regular user no longer have permission to access discoverMetadata, because that method needs ModifyPortalContent permission. Instead, of handling publication_state only in Base_contribute, treat it like others user input parameter and change state during discovery. Tests were also re-organised to move Base_contribute related test in testIngestion and also to run Base_contribute tests as a non-manager user.
-
Jérome Perrin authored
This was never supported, we support only [state in $workflow_id] See also: https://erp5js.nexedi.net/#/bug_module/1740 b6dcbc19 (l10n_fr,l10n_jp: Fix translation of "Open", 2021-04-30) Generated from this script: #!/srv/slapgrid/slappart3/srv/runner/software/cc0326f0dcb093f56c01291c300c8481/parts/erp5/venv/bin/python import polib import sys import re pofile = polib.pofile(sys.argv[1]) msgs = dict() for entry in pofile: msgs[entry.msgid] = entry.msgstr transition_re = re.compile(r'(.*) \[transition in .*\]') fixed_messages = dict() for entry in pofile: match = transition_re.match(entry.msgid) if match: # in erp5_l10n_de some msgstr also have the [transition in ...], we drop them if transition_re.match(entry.msgstr): continue short = match.groups()[0] if short.endswith('Action'): continue if short not in msgs: print(f"
🤔 {short} not translated ( from {entry.msgid} )") fixed_messages[short] = entry.msgstr else: fixed_messages[entry.msgid] = entry.msgstr pofile.clear() for k, v in fixed_messages.items(): pofile.append(polib.POEntry(msgid=k, msgstr=v)) pofile.save(sys.argv[1]) import subprocess subprocess.check_output( [ '/opt/slapos-shared/gettext/4df93a547efd86e0eb70495b88a5d3b1/bin/msgattrib', sys.argv[1], "--no-fuzzy", "--translated", "-s", "--no-wrap", "-o", sys.argv[1] ] ) -
Jérome Perrin authored
using: msgattrib translation.po --no-fuzzy --translated -s --no-wrap -o translation.po
-
Jérome Perrin authored
translated_title is used in listbox search columns, so it's very confusing for users if they can not use the usual % character for partial matches. This changes the behaviour of translated_title to autodetect the presence of % and use LIKE comparison operator in such case.
-
- 24 May, 2021 11 commits
-
-
Jérome Perrin authored
these ERP5 Python Scripts were not covered by coding style tests
-
Jérome Perrin authored
This ERP5 Python script was not checked until now, and add wrong indentation
-
Jérome Perrin authored
The ID is calculated from reference, showing reference is enough. There is no "callable_type" property on workflow script
-
Jérome Perrin authored
Don't hardcode a few roles, use the API to get all roles
-
Jérome Perrin authored
-
Jérome Perrin authored
Only zope Python Scripts were checked
-
Jérome Perrin authored
the convention is to use a my_ or your_ prefix
-
Jérome Perrin authored
This is only supported in monaco editor
-
Jérome Perrin authored
use syntax highlighting for SQL language, not really correct because of dtml, but probably better than plain text
-
Jérome Perrin authored
-
Jérome Perrin authored
- fix order of ZMI tabs to make `Edit` the default - enable ZMI code editor - add an icon - fix links to python scripts in BusinessTemplate_getPythonSourceCodeMessageList
-
- 21 May, 2021 4 commits
-
-
Roque authored
-
Roque authored
-
Jérome Perrin authored
this field did not use a (my_ / your_) prefix
-
Jérome Perrin authored
-
- 20 May, 2021 1 commit
-
-
Lu Xu authored
See merge request nexedi/erp5!1419
-
- 19 May, 2021 2 commits
-
-
Lu Xu authored
-
Gabriel Monnerat authored
-
- 18 May, 2021 3 commits
-
-
Gabriel Monnerat authored
-
Georgios Dagkakis authored
with previous implementation, categories_list=[...] could be included and it would delete existing categories not included in it So use getPropertyAndCategoryList instead of propertyIds, that should return a cleaner list for input_parameter_dict here
-
Georgios Dagkakis authored
-
- 17 May, 2021 3 commits
-
-
Romain Courteaud authored
-
Romain Courteaud authored
-
Jérome Perrin authored
-
- 12 May, 2021 4 commits
-
-
Jérome Perrin authored
2 letters code as reference, 3 letters codes as codifications
-
Jérome Perrin authored
When viewing the worklist page, a request is made to traverse portal_workflow, which translate global worklist actions including the document count. The non regression test was failing with: [u'Draft To Validate (1)'] != ['Draft To Validate'] and in real usage, several messages were added to Localizer.
-
Gabriel Monnerat authored
-
Jérome Perrin authored
We should not show a link to jump to workflow configuration to end users, these are not relevant to them and can only cause confusion.
-
- 11 May, 2021 3 commits
-
-
Romain Courteaud authored
Some erp5 view action use a TALES expression to be visible only in draft state. When validation a workflow transition dialog, ERP5JS tries to go back to the previous view which is now hidden. In such case, switch back to the default document view.
-
Romain Courteaud authored
-
Boxiang Sun authored
Do not return wrong content type
-