PortalTransforms/tiff_to_text: run tesseract with OMP_THREAD_LIMIT=1
By default, tesseract runs on 4 CPU and this can be controlled by OMP_THREAD_LIMIT=1 to run on only one CPU (as documented on https://tesseract-ocr.github.io/tessdoc/FAQ.html) In ERP5, we tend to use one zope node per CPU, so we don't want each of these zope nodes to spawn a process which will run on 4 CPU. In a quick benchmark it's not slower, even a bit faster to disable threads: ## a big image in france (a picture of an invoice) $ time ./bin/tesseract /tmp/input.tiff /tmp/out.txt Tesseract Open Source OCR Engine v4.1.1 with Leptonica Page 1 Error in pixClipBoxToForeground: box not within image Error in pixClipBoxToForeground: box not within image ________________________________________________________ Executed in 14.41 secs fish external usr time 27.88 secs 1002.00 micros 27.88 secs sys time 0.74 secs 0.00 micros 0.74 secs $ time OMP_THREAD_LIMIT=1 ./bin/tesseract /tmp/input.tiff /tmp/out.txt Tesseract Open Source OCR Engine v4.1.1 with Leptonica Page 1 Error in pixClipBoxToForeground: box not within image Error in pixClipBoxToForeground: box not within image ________________________________________________________ Executed in 12.58 secs fish external usr time 11.84 secs 955.00 micros 11.84 secs sys time 0.52 secs 503.00 micros 0.52 secs ## a small japanese image $ time ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt Tesseract Open Source OCR Engine v4.1.1 with Leptonica Page 1 ________________________________________________________ Executed in 2.16 secs fish external usr time 3.77 secs 590.00 micros 3.77 secs sys time 0.27 secs 209.00 micros 0.27 secs $ time OMP_THREAD_LIMIT=1 ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt Tesseract Open Source OCR Engine v4.1.1 with Leptonica Page 1 ________________________________________________________ Executed in 2.02 secs fish external usr time 1766.07 millis 1437.00 micros 1764.63 millis sys time 214.06 millis 522.00 micros 213.54 millis
Showing
Please register or sign in to comment