OA0

OA0 是一个探索 AI 的社区

现在注册

已注册用户请登录

Python Tesseract

.. image:: https://img.shields.io/pypi/pyversions/pytesseract.svg
:target: https://pypi.python.org/pypi/pytesseract
:alt: 支持的 Python 版本

.. image:: https://img.shields.io/github/release/madmaze/pytesseract.svg
:target: https://github.com/madmaze/pytesseract/releases
:alt: GitHub 发行版

.. image:: https://img.shields.io/pypi/v/pytesseract.svg?color=blue
:target: https://pypi.python.org/pypi/pytesseract
:alt: PyPI 发行版

.. image:: https://img.shields.io/conda/vn/conda-forge/pytesseract.svg?color=blue
:target: https://anaconda.org/conda-forge/pytesseract
:alt: Conda 发行版

.. image:: https://results.pre-commit.ci/badge/github/madmaze/pytesseract/master.svg
:target: https://results.pre-commit.ci/latest/github/madmaze/pytesseract/master
:alt: Pre-commit CI 状态

.. image:: https://github.com/madmaze/pytesseract/workflows/CI/badge.svg?branch=master
:target: https://github.com/madmaze/pytesseract/actions?query=workflow%3ACI
:alt: CI 工作流状态

Python-tesseract 是一个用于 Python 的光学字符识别（OCR）工具。它能够识别并“读取”图像中嵌入的文本。

Python-tesseract 是 Google 的 Tesseract-OCR 引擎 <https://github.com/tesseract-ocr/tesseract>_ 的一个封装器。它也可以作为一个独立的 Tesseract 调用脚本使用，因为它能够读取 Pillow 和 Leptonica 图像库支持的所有图像类型，包括 jpeg、png、gif、bmp、tiff 等。此外，如果作为脚本使用，Python-tesseract 会打印识别出的文本，而不是将其写入文件。

用法

快速开始

注意：测试图像位于 Git 仓库的 tests/data 文件夹中。

库用法：

.. code-block:: python

from PIL import Image

import pytesseract

# 如果你的 PATH 中没有 tesseract 可执行文件，请包含以下内容：
pytesseract.pytesseract.tesseract_cmd = r'<tesseract可执行文件的完整路径>'
# 示例：tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'

# 简单地将图像转换为字符串
print(pytesseract.image_to_string(Image.open('test.png')))

# 为了绕过 pytesseract 的图像转换，可以直接使用相对或绝对图像路径
# 注意：在这种情况下，你应该提供 Tesseract 支持的图像，否则 Tesseract 会返回错误
print(pytesseract.image_to_string('test.png'))

# 列出可用的语言
print(pytesseract.get_languages(config=''))

# 将法语文本图像转换为字符串
print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

# 使用包含多个图像文件路径列表的单个文件进行批处理
print(pytesseract.image_to_string('images.txt'))

# 在一段时间后超时/终止 tesseract 任务
try:
    print(pytesseract.image_to_string('test.jpg', timeout=2)) # 2 秒后超时
    print(pytesseract.image_to_string('test.jpg', timeout=0.5)) # 半秒后超时
except RuntimeError as timeout_error:
    # Tesseract 处理被终止
    pass

# 获取边界框估计
print(pytesseract.image_to_boxes(Image.open('test.png')))

# 获取详细数据，包括边界框、置信度、行号和页码
print(pytesseract.image_to_data(Image.open('test.png')))

# 获取方向和脚本检测信息
print(pytesseract.image_to_osd(Image.open('test.png')))

# 获取可搜索的 PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf) # 默认情况下 pdf 类型为 bytes

# 获取 HOCR 输出
hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr')

# 获取 ALTO XML 输出
xml = pytesseract.image_to_alto_xml('test.png')

# 通过一次调用获取多种类型的输出以节省计算时间
# 目前支持以下类型的混合：txt, pdf, hocr, box, tsv
text, boxes = pytesseract.run_and_get_multiple_output('test.png', extensions=['txt', 'box'])

支持 OpenCV 图像/NumPy 数组对象

.. code-block:: python

import cv2

img_cv = cv2.imread(r'/<图像路径>/digits.png')

# 默认情况下，OpenCV 以 BGR 格式存储图像，而 pytesseract 假定为 RGB 格式，
# 因此我们需要从 BGR 转换为 RGB 格式/模式：
img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
print(pytesseract.image_to_string(img_rgb))
# 或者
img_rgb = Image.frombytes('RGB', img_cv.shape[:2], img_cv, 'raw', 'BGR', 0, 0)
print(pytesseract.image_to_string(img_rgb))

如果需要自定义配置，如 oem/psm，请使用 config 关键字。

.. code-block:: python

# 添加任何额外选项的示例
custom_oem_psm_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(image, config=custom_oem_psm_config)

# 使用预定义的包含选项的 tesseract 配置文件示例
cfg_filename = 'words'
pytesseract.run_and_get_output(image, extension='txt', config=cfg_filename)

如果遇到类似 "Error opening data file..." 的 tessdata 错误，请添加以下配置：

.. code-block:: python

# 示例配置：r'--tessdata-dir "C:\Program Files (x86)\Tesseract-OCR\tessdata"'
# 在目录路径周围添加双引号很重要。
tessdata_dir_config = r'--tessdata-dir "<替换为你的 tessdata 目录路径>"'
pytesseract.image_to_string(image, lang='chi_sim', config=tessdata_dir_config)

函数

get_languages 返回 Tesseract OCR 当前支持的所有语言。
get_tesseract_version 返回系统中安装的 Tesseract 版本。
image_to_string 将 Tesseract OCR 处理的输出作为字符串返回，不做修改。
image_to_boxes 返回包含识别字符及其边界框的结果。
image_to_data 返回包含边界框、置信度和其他信息的结果。需要 Tesseract 3.05+。更多信息，请查看 Tesseract TSV 文档 <https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html>_。
image_to_osd 返回包含方向和脚本检测信息的结果。
image_to_alto_xml 以 Tesseract 的 ALTO XML 格式返回结果。
run_and_get_output 返回 Tesseract OCR 的原始输出。对发送给 tesseract 的参数提供了更多控制。
run_and_get_multiple_output 功能类似于 run_and_get_output，但可以处理多个扩展。此函数将 extension: str 关键字参数替换为 extension: List[str] 关键字参数，可以指定扩展名列表，并在仅一次 tesseract 调用后返回相应的数据。当需要多种输出格式（如文本和边界框）时，此函数减少了调用 tesseract 的次数。

参数

image_to_data(image, lang=None, config='', nice=0, output_type=Output.STRING, timeout=0, pandas_config=None)

image 对象或字符串 - 可以是 PIL Image 对象、NumPy 数组或要由 Tesseract 处理的图像的文件路径。如果传递对象而不是文件路径，pytesseract 会将图像隐式转换为 RGB 模式 <https://pillow.readthedocs.io/en/stable/handbook/concepts.html#modes>_。
lang 字符串 - Tesseract 语言代码字符串。如果未指定，默认为 eng！多语言示例：lang='eng+fra'。
config 字符串 - 任何额外的自定义配置标志，这些标志无法通过 pytesseract 函数获得。例如：config='--psm 6'。
nice 整数 - 修改 Tesseract 运行的处理器优先级。Windows 不支持。Nice 调整类 Unix 进程的友好度。
output_type 类属性 - 指定输出的类型，默认为 string。有关所有支持类型的完整列表，请查看 pytesseract.Output <https://github.com/madmaze/pytesseract/blob/master/pytesseract/pytesseract.py>_ 类的定义。
timeout 整数或浮点数 - OCR 处理的持续时间（秒），之后 pytesseract 将终止并引发 RuntimeError。
pandas_config 字典 - 仅用于 Output.DATAFRAME 类型。包含 pandas.read_csv <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas-read-csv>_ 自定义参数的字典。允许你自定义 image_to_data 的输出。

命令行界面（CLI）用法：

.. code-block:: bash

pytesseract [-l lang] image_file

安装

先决条件：

Python-tesseract 需要 Python 3.6+
你需要 Python Imaging Library (PIL)（或其分支 Pillow <https://pypi.org/project/Pillow/>）。
请查看 Pillow 文档 <https://pillow.readthedocs.io/en/stable/installation.html#basic-installation> 了解基本的 Pillow 安装。
安装 Google Tesseract OCR <https://github.com/tesseract-ocr/tesseract>_（关于如何在 Linux、Mac OSX 和 Windows 上安装引擎的额外信息）。
你必须能够以 tesseract 命令调用 tesseract。如果不是这种情况，例如因为 tesseract 不在你的 PATH 中，你将需要更改 "tesseract_cmd" 变量 pytesseract.pytesseract.tesseract_cmd。
在 Debian/Ubuntu 下，你可以使用 tesseract-ocr 包。
对于 Mac OS 用户，请安装 homebrew 包 tesseract。

注意：在极少数情况下，如果操作系统特定的包不包含它们，你可能需要额外安装来自 tesseract-ocr/tessconfigs <https://github.com/tesseract-ocr/tessconfigs>_ 的 tessconfigs 和 configs。

| 通过 pip 安装：

查看 pytesseract 包页面 <https://pypi.python.org/pypi/pytesseract>_ 获取更多信息。

.. code-block:: bash

pip install pytesseract

| 或者，如果你安装了 git：

.. code-block:: bash

pip install -U git+https://github.com/madmaze/pytesseract.git

| 从源码安装：

.. code-block:: bash

git clone https://github.com/madmaze/pytesseract.git
cd pytesseract && pip install -U .

| 通过 conda 安装（通过 conda-forge <https://anaconda.org/conda-forge/pytesseract>_）：

.. code-block:: bash

conda install -c conda-forge pytesseract

测试

要运行此项目的测试套件，请安装并运行 tox。确保已安装 tesseract 并在你的 PATH 中。

.. code-block:: bash

pip install tox
tox

许可证

请查看 Python-tesseract 仓库/发行版中包含的 LICENSE 文件。
从 Python-tesseract 0.3.1 开始，许可证为 Apache License Version 2.0。

贡献者

最初由 Samuel Hoffstaetter <https://github.com/h>_ 编写
完整贡献者列表 <https://github.com/madmaze/pytesseract/graphs/contributors>_

项目地址：https://github.com/madmaze/pytesseract

25 次点击 ∙ 0 人收藏

登录后收藏

0 条回复

pytesseract Google Tesseract OCR的Python封装工具

Python Tesseract

用法

安装

测试

许可证

贡献者