如何在 Python 中搜索子字符串是否在二進製文件中？ (How to search if a substring is into a binary file in Python?)

問題描述

如何在 Python 中搜索子字符串是否在二進製文件中？ (How to search if a substring is into a binary file in Python?)

我很想從 PDF 文件中獲取一條信息。如果我在文本編輯器中顯示 PDF 文件，我總能在表單中找到我要查找的內容

/Contents (Localidad: XXXXXXXXXXX)

X 是我想要的。

我知道有PyPDF2 來讀取 PDF 文件，問題是，我沒有找到這塊使用它。我可以閱讀文件的其餘部分，並將其轉換為文本等，但我無法使用 PyPDF2 獲取此特定信息。

所以，我想我可以這樣做：

file = open("yada.pdf", "rb")
for line in file:
    if "(Localidad:" in line:
        # Extract the XXX's

當然，如果我這樣做，Python會報錯：TypeError: a bytes‑like object is required, not 'str'

你怎麼看？我這樣做是完全錯誤的嗎？有什麼更好的方法？

參考解法

方法 1:

Try it like this:

import re
with open("yada.pdf", "rb") as f:
    text = str(f.read()) # This will return the contents of the pdf in string format
    location = re.findall("/Contents \(Localidad: (.+?)\)", text)
    f.close()

(by luisfer、Carles Mitjans)

參考文件

How to search if a substring is into a binary file in Python? (CC BY‑SA 2.5/3.0/4.0)

#Python #pdf

相關問題

留言討論