低级接口#

有许多方法可用于以相对较低的级别访问和操作PDF文件。诚然，“低级别”和“正常”功能之间的明确区分并不总是可能的，或者有赖于个人口味。

也可能发生，之前被认为是低级功能的功能后来被评估为正常接口的一部分。这在v1.14.0中发生在类Tools上 - 你现在可以在类章节中找到它作为一个项目。

这仅仅是文档的问题，您可以在文档的哪个章节找到您所寻找的内容。一切都是可用的，总是通过相同的接口。

如何遍历`xref`表#

PDF的 xref 表是文件中定义的所有对象的列表。这个表可能包含数千个条目 - 例如，手册 Adobe PDF References 有127,000个对象。表条目“0”是保留的，必须不被触碰。以下脚本循环遍历 xref 表并打印每个对象的定义：

>>> xreflen = doc.xref_length()  # length of objects table
>>> for xref in range(1, xreflen):  # skip item 0!
        print("")
        print("object %i (stream: %s)" % (xref, doc.xref_is_stream(xref)))
        print(doc.xref_object(xref, compressed=False))

这将产生以下输出：

object 1 (stream: False)
<<
    /ModDate (D:20170314122233-04'00')
    /PXCViewerInfo (PDF-XChange Viewer;2.5.312.1;Feb  9 2015;12:00:06;D:20170314122233-04'00')
>>

object 2 (stream: False)
<<
    /Type /Catalog
    /Pages 3 0 R
>>

object 3 (stream: False)
<<
    /Kids [ 4 0 R 5 0 R ]
    /Type /Pages
    /Count 2
>>

object 4 (stream: False)
<<
    /Type /Page
    /Annots [ 6 0 R ]
    /Parent 3 0 R
    /Contents 7 0 R
    /MediaBox [ 0 0 595 842 ]
    /Resources 8 0 R
>>
...
object 7 (stream: True)
<<
    /Length 494
    /Filter /FlateDecode
>>
...

A PDF 对象定义是一个普通的 ASCII 字符串。

如何处理对象流#

某些对象类型包含除了对象定义之外的附加数据。例子包括图像、字体、嵌入文件或描述页面外观的命令。

这些类型的对象被称为“流对象”。PyMuPDF允许通过方法 Document.xref_stream() 使用对象的 xref 作为参数读取对象的流。还可以使用 Document.update_stream() 写回修改后的流版本。

假设以下代码片段想要读取PDF的所有流，无论出于什么原因：

>>> xreflen = doc.xref_length() # number of objects in file
>>> for xref in range(1, xreflen): # skip item 0!
        if stream := doc.xref_stream(xref):
            # do something with it (it is a bytes object or None)
            # e.g. just write it back:
            doc.update_stream(xref, stream)

Document.xref_stream() 自动返回一个作为字节对象解压缩的流 - 而 Document.update_stream() 如果有益则自动压缩它。

如何处理页面内容#

一个PDF页面可以包含零个或多个 contents 对象。这些是描述什么出现在哪里和如何在页面上（如文本和图像）的流对象。它们使用一种特殊的迷你语言编写，例如在Adobe PDF References的第643页“附录A - 操作符摘要”一章中描述。

每个PDF阅读器应用程序都必须能够解释内容语法，以重现页面的预期外观。

如果提供多个 contents 对象，它们必须按照指定的顺序被解释，就像它们是作为多个对象的连接一样。

有很多技术上的理由支持拥有多个 contents 对象：

添加新的 contents 对象要比维护一个大的对象容易得多且更快（这意味着在每次更改时，需要读取、解压、修改、重新压缩和重写）。
在进行增量更新时，修改过的大的 contents 对象将使更新增量膨胀，因此很容易抵消增量保存的效率。

例如，PyMuPDF 在方法 Page.insert_image()、Page.show_pdf_page() 和 Shape 方法中添加了新的、小的 contents 对象。

然而，在某些情况下，单个 contents 对象是有益的：它比多个更小的对象更易于理解，更具可压缩性。

这里有两种组合页面多个内容的方法：

>>> # method 1: use the MuPDF clean function
>>> page.clean_contents()  # cleans and combines multiple Contents
>>> xref = page.get_contents()[0]  # only one /Contents now!
>>> cont = doc.xref_stream(xref)
>>> # this has also reformatted the PDF commands

>>> # method 2: extract concatenated contents
>>> cont = page.read_contents()
>>> # the /Contents source itself is unmodified

clean 函数 Page.clean_contents() 的功能远不止于粘合 contents 对象：它还修正和优化页面的 PDF 操作符语法，并消除与页面对象定义的任何不一致。

如何访问PDF目录#

这是PDF的一个中心（“根”）对象。它作为到达其他重要对象的起点，同时也包含了一些PDF的全局选项：

>>> import pymupdf
>>> doc=pymupdf.open("PyMuPDF.pdf")
>>> cat = doc.pdf_catalog()  # get xref of the /Catalog
>>> print(doc.xref_object(cat))  # print object definition
<<
    /Type/Catalog                 % object type
    /Pages 3593 0 R               % points to page tree
    /OpenAction 225 0 R           % action to perform on open
    /Names 3832 0 R               % points to global names tree
    /PageMode /UseOutlines        % initially show the TOC
    /PageLabels<</Nums[0<</S/D>>2<</S/r>>8<</S/D>>]>> % labels given to pages
    /Outlines 3835 0 R            % points to outline tree
>>

注意

缩进、换行和注释仅用于澄清目的，通常不会出现。有关PDF目录的更多信息，请参见Adobe PDF References第71页第7.7.2节。

如何访问PDF文件的尾部#

PDF 文件的尾部是一个 dictionary，位于文件的末尾。它包含特殊对象和指向其他重要信息的指针。请参见 Adobe PDF References 第 42 页。以下是概述：

关键字	类型	值
大小	int	交叉引用表中的条目数量 + 1。
上一个	整数	前一个 `xref` 部分的偏移量（表示增量更新）。
根	字典	(间接) 指向目录的指针。见前一部分。
加密	字典	指向加密对象的指针（仅限加密文件）。
信息	字典	(间接) 指向信息（元数据）的指针。
ID	数组	由两个字节字符串组成的文件标识符。
XRefStm	int	交叉引用流的偏移量。见 Adobe PDF References 第49页。

通过 PyMuPDF 访问此信息，可以使用 Document.pdf_trailer() ，或者等效地使用 Document.xref_object() 并且将 -1 作为有效 xref 数字的替代。

>>> import pymupdf
>>> doc=pymupdf.open("PyMuPDF.pdf")
>>> print(doc.xref_object(-1))  # or: print(doc.pdf_trailer())
<<
/Type /XRef
/Index [ 0 8263 ]
/Size 8263
/W [ 1 3 1 ]
/Root 8260 0 R
/Info 8261 0 R
/ID [ <4339B9CEE46C2CD28A79EBDDD67CC9B3> <4339B9CEE46C2CD28A79EBDDD67CC9B3> ]
/Length 19883
/Filter /FlateDecode
>>
>>>

如何访问XML元数据#

PDF 除了标准元数据格式外，还可能包含 XML 元数据。事实上，大多数 PDF 查看器或修改软件在保存 PDF 时会添加此类信息（Adobe、Nitro PDF、PDF-XChange 等）。

PyMuPDF 无法直接 解释或更改 这些信息，因为它不包含 XML 特性。然而，XML 元数据被存储为一个 stream 对象，因此可以使用适当的软件读取、修改并写回。

>>> xmlmetadata = doc.get_xml_metadata()
>>> print(xmlmetadata)
<?xpacket begin="\ufeff" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-702">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
...
omitted data
...
<?xpacket end="w"?>

使用某些XML包，可以解释和/或修改XML数据，然后将其存储回来。如果之前PDF没有XML元数据，则以下方法也有效：

>>> # write back modified XML metadata:
>>> doc.set_xml_metadata(xmlmetadata)
>>>
>>> # XML metadata can be deleted like this:
>>> doc.del_xml_metadata()

如何扩展PDF元数据#

属性 Document.metadata 设计为对所有支持的文档类型以相同的方式工作：它是一个具有 固定键值对集合 的 Python 字典。相应地， Document.set_metadata() 仅接受标准键。

然而，PDF可能包含像这样的不可访问项目。此外，还可能有理由存储其他信息，例如版权。这里有一种通过使用PyMuPDF低级函数来处理任意元数据项的方法。

作为一个例子，看看某些PDF的标准元数据输出：

# ---------------------
# standard metadata
# ---------------------
pprint(doc.metadata)
{'author': 'PRINCE',
 'creationDate': "D:2010102417034406'-30'",
 'creator': 'PrimoPDF http://www.primopdf.com/',
 'encryption': None,
 'format': 'PDF 1.4',
 'keywords': '',
 'modDate': "D:20200725062431-04'00'",
 'producer': 'macOS Version 10.15.6 (Build 19G71a) Quartz PDFContext, '
             'AppendMode 1.1',
 'subject': '',
 'title': 'Full page fax print',
 'trapped': ''}

使用以下代码查看存储在元数据对象中的所有项：

# ----------------------------------
# metadata including private items
# ----------------------------------
metadata = {}  # make my own metadata dict
what, value = doc.xref_get_key(-1, "Info")  # /Info key in the trailer
if what != "xref":
    pass  # PDF has no metadata
else:
    xref = int(value.replace("0 R", ""))  # extract the metadata xref
    for key in doc.xref_get_keys(xref):
        metadata[key] = doc.xref_get_key(xref, key)[1]
pprint(metadata)
{'Author': 'PRINCE',
 'CreationDate': "D:2010102417034406'-30'",
 'Creator': 'PrimoPDF http://www.primopdf.com/',
 'ModDate': "D:20200725062431-04'00'",
 'PXCViewerInfo': 'PDF-XChange Viewer;2.5.312.1;Feb  9 '
                 "2015;12:00:06;D:20200725062431-04'00'",
 'Producer': 'macOS Version 10.15.6 (Build 19G71a) Quartz PDFContext, '
             'AppendMode 1.1',
 'Title': 'Full page fax print'}
# ---------------------------------------------------------------
# note the additional 'PXCViewerInfo' key - ignored in standard!
# ---------------------------------------------------------------

反之亦然，您还可以在PDF中存储私有元数据项。确保这些项符合PDF规范是您的责任——尤其是它们必须是（unicode）字符串。有关详细信息和注意事项，请参考Adobe PDF参考文献的第14.3节（第548页）：

what, value = doc.xref_get_key(-1, "Info")  # /Info key in the trailer
if what != "xref":
    raise ValueError("PDF has no metadata")
xref = int(value.replace("0 R", ""))  # extract the metadata xref
# add some private information
doc.xref_set_key(xref, "mykey", pymupdf.get_pdf_str("北京 is Beijing"))
#
# after executing the previous code snippet, we will see this:
pprint(metadata)
{'Author': 'PRINCE',
 'CreationDate': "D:2010102417034406'-30'",
 'Creator': 'PrimoPDF http://www.primopdf.com/',
 'ModDate': "D:20200725062431-04'00'",
 'PXCViewerInfo': 'PDF-XChange Viewer;2.5.312.1;Feb  9 '
                  "2015;12:00:06;D:20200725062431-04'00'",
 'Producer': 'macOS Version 10.15.6 (Build 19G71a) Quartz PDFContext, '
             'AppendMode 1.1',
 'Title': 'Full page fax print',
 'mykey': '北京 is Beijing'}

要删除选定的键，可以使用 doc.xref_set_key(xref, "mykey", "null")。在下一节中解释到，字符串“null”是 PDF 等同于 Python 的 None。具有该值的键将被视为未指定 - 并在垃圾回收中被物理移除。

如何读取和更新PDF对象#

也存在细粒度、优雅的方式来访问和操作选定的 PDF dictionary 键。

Document.xref_get_keys() 返回位于 xref 的对象的 PDF 键：

In [1]: import pymupdf
In [2]: doc = pymupdf.open("pymupdf.pdf")
In [3]: page = doc[0]
In [4]: from pprint import pprint
In [5]: pprint(doc.xref_get_keys(page.xref))
('Type', 'Contents', 'Resources', 'MediaBox', 'Parent')

与完整对象定义进行比较：

In [6]: print(doc.xref_object(page.xref))
<<
  /Type /Page
  /Contents 1297 0 R
  /Resources 1296 0 R
  /MediaBox [ 0 0 612 792 ]
  /Parent 1301 0 R
>>

单个键也可以通过 Document.xref_get_key() 直接访问。值 始终是字符串 以及类型信息，有助于对其进行解释：
```
In [7]: doc.xref_get_key(page.xref, "MediaBox")
Out[7]: ('array', '[0 0 612 792]')
```

这里是上述页面键的完整列表：

In [9]: for key in doc.xref_get_keys(page.xref):
...:        print("%s = %s" % (key, doc.xref_get_key(page.xref, key)))
...:
Type = ('name', '/Page')
Contents = ('xref', '1297 0 R')
Resources = ('xref', '1296 0 R')
MediaBox = ('array', '[0 0 612 792]')
Parent = ('xref', '1301 0 R')

未定义的键查询返回 ('null', 'null') – PDF 对象类型 null 对应于 Python 中的 None。布尔值 true 和 false 也类似。

让我们向页面定义添加一个新键，将其旋转设置为90度（你知道实际上存在 Page.set_rotation() 吗？）:

In [11]: doc.xref_get_key(page.xref, "Rotate")  # no rotation set:
Out[11]: ('null', 'null')
In [12]: doc.xref_set_key(page.xref, "Rotate", "90")  # insert a new key
In [13]: print(doc.xref_object(page.xref))  # confirm success
<<
  /Type /Page
  /Contents 1297 0 R
  /Resources 1296 0 R
  /MediaBox [ 0 0 612 792 ]
  /Parent 1301 0 R
  /Rotate 90
>>

此方法也可以通过将其值设置为 null 来从 xref 字典中删除一个键：以下将从页面中删除旋转规格： doc.xref_set_key(page.xref, "Rotate", "null")。同样，要从页面中删除所有链接、注释和字段，请使用 doc.xref_set_key(page.xref, "Annots", "null")。因为 Annots 的定义是一个数组，使用语句 doc.xref_set_key(page.xref, "Annots", "[]") 设置一个空数组在这种情况下也会完成相同的工作。

PDF 字典可以以层次结构嵌套。在下面的页面对象定义中，Font 和 XObject 都是 Resources 的子字典：

In [15]: print(doc.xref_object(page.xref))
<<
  /Type /Page
  /Contents 1297 0 R
  /Resources <<
    /XObject <<
      /Im1 1291 0 R
    >>
    /Font <<
      /F39 1299 0 R
      /F40 1300 0 R
    >>
  >>
  /MediaBox [ 0 0 612 792 ]
  /Parent 1301 0 R
  /Rotate 90
>>

上述情况 受支持 的方法有 Document.xref_set_key() 和 Document.xref_get_key()：使用类似路径的符号指向所需的键。例如，要检索上述键 Im1 的值，请在键参数中指定其“上方”的完整字典链： "Resources/XObject/Im1"：
```
In [16]: doc.xref_get_key(page.xref, "Resources/XObject/Im1")
Out[16]: ('xref', '1291 0 R')
```

路径表示法也可以用于 直接设置一个值：使用以下内容让 Im1 指向一个不同的对象：

In [17]: doc.xref_set_key(page.xref, "Resources/XObject/Im1", "9999 0 R")
In [18]: print(doc.xref_object(page.xref))  # confirm success:
<<
  /Type /Page
  /Contents 1297 0 R
  /Resources <<
    /XObject <<
      /Im1 9999 0 R
    >>
    /Font <<
      /F39 1299 0 R
      /F40 1300 0 R
    >>
  >>
  /MediaBox [ 0 0 612 792 ]
  /Parent 1301 0 R
  /Rotate 90
>>

请注意，这里不会进行任何语义检查：如果PDF没有xref 9999，它在此时将不会被检测到。

如果一个键不存在，它将通过设置其值来创建。此外，如果任何中间键也不存在，它们也会根据需要被创建。下面的代码在现有字典 A 下面的几个层级创建了一个数组 D。中间字典 B 和 C 会被自动创建：

In [5]: print(doc.xref_object(xref))  # some existing PDF object:
<<
  /A <<
  >>
>>
In [6]: # the following will create 'B', 'C' and 'D'
In [7]: doc.xref_set_key(xref, "A/B/C/D", "[1 2 3 4]")
In [8]: print(doc.xref_object(xref))  # check out what happened:
<<
  /A <<
    /B <<
      /C <<
        /D [ 1 2 3 4 ]
      >>
    >>
  >>
>>

当设置键值时，MuPDF 将进行基本的 PDF 语法检查。例如，只能在 字典下方 创建新的键。以下尝试在之前创建的数组 D 下创建一些新的字符串项 E：

In [9]: # 'D' is an array, no dictionary!
In [10]: doc.xref_set_key(xref, "A/B/C/D/E", "(hello)")
mupdf: not a dict (array)
--- ... ---
RuntimeError: not a dict (array)

如果某个更高层级的键是一个“间接”对象，即一个xref，则创建一个键也是不可能的。换句话说，xref只能被直接修改，而不能通过其他对象间接引用它们：

In [13]: # the following object points to an xref
In [14]: print(doc.xref_object(4))
<<
  /E 3 0 R
>>
In [15]: # 'E' is an indirect object and cannot be modified here!
In [16]: doc.xref_set_key(4, "E/F", "90")
mupdf: path to 'F' has indirects
--- ... ---
RuntimeError: path to 'F' has indirects

注意

这些是专家函数！没有验证是否指定了有效的PDF对象、xref等。与其他低级方法一样，存在使PDF或其部分无法使用的风险。

Do you have any feedback on this page?

本软件按原样提供，不作任何明示或暗示的担保。该软件根据许可证分发，除非按照该许可证的条款明确授权，否则不得复制、修改或分发。有关许可信息，请参阅artifex.com或联系Artifex Software Inc.，地址：39 Mesa Street, Suite 108A, San Francisco CA 94129, United States以获取更多信息。

低级接口#

如何遍历xref表#

如何处理对象流#

如何处理页面内容#

如何访问PDF目录#

如何访问PDF文件的尾部#

如何访问XML元数据#

如何扩展PDF元数据#

如何读取和更新PDF对象#

如何遍历`xref`表#