• Jérome Perrin's avatar
    PortalTransforms/tiff_to_text: run tesseract with OMP_THREAD_LIMIT=1 · d74981c3
    Jérome Perrin authored
    By default, tesseract runs on 4 CPU and this can be controlled by
    OMP_THREAD_LIMIT=1 to run on only one CPU (as documented on
    https://tesseract-ocr.github.io/tessdoc/FAQ.html)
    
    In ERP5, we tend to use one zope node per CPU, so we don't want each
    of these zope nodes to spawn a process which will run on 4 CPU.
    
    In a quick benchmark it's not slower, even a bit faster to disable threads:
    
        ## a big image in france (a picture of an invoice)
        $ time ./bin/tesseract /tmp/input.tiff /tmp/out.txt
        Tesseract Open Source OCR Engine v4.1.1 with Leptonica
        Page 1
        Error in pixClipBoxToForeground: box not within image
        Error in pixClipBoxToForeground: box not within image
    
        ________________________________________________________
        Executed in   14.41 secs   fish           external
          usr time   27.88 secs  1002.00 micros   27.88 secs
          sys time    0.74 secs    0.00 micros    0.74 secs
    
        $ time OMP_THREAD_LIMIT=1 ./bin/tesseract /tmp/input.tiff /tmp/out.txt
        Tesseract Open Source OCR Engine v4.1.1 with Leptonica
        Page 1
        Error in pixClipBoxToForeground: box not within image
        Error in pixClipBoxToForeground: box not within image
    
        ________________________________________________________
        Executed in   12.58 secs   fish           external
          usr time   11.84 secs  955.00 micros   11.84 secs
          sys time    0.52 secs  503.00 micros    0.52 secs
    
        ## a small japanese image
    
        $ time ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt
        Tesseract Open Source OCR Engine v4.1.1 with Leptonica
        Page 1
    
        ________________________________________________________
        Executed in    2.16 secs   fish           external
          usr time    3.77 secs  590.00 micros    3.77 secs
          sys time    0.27 secs  209.00 micros    0.27 secs
    
        $ time OMP_THREAD_LIMIT=1 ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt
        Tesseract Open Source OCR Engine v4.1.1 with Leptonica
        Page 1
    
        ________________________________________________________
        Executed in    2.02 secs   fish           external
          usr time  1766.07 millis  1437.00 micros  1764.63 millis
          sys time  214.06 millis  522.00 micros  213.54 millis
    d74981c3
tiff_to_text.py 1.89 KB