选择器¶

当你抓取网页时，最常见的任务是从HTML源代码中提取数据。有几种可用的库可以实现这一点，例如：

BeautifulSoup 是一个非常受 Python 程序员欢迎的网络抓取库，它根据 HTML 代码的结构构建一个 Python 对象，并且能够很好地处理不良标记，但它有一个缺点：速度较慢。
lxml 是一个 XML 解析库（也可以解析 HTML），它基于 ElementTree 提供了一个 Python 风格的 API。（lxml 不是 Python 标准库的一部分。）

Scrapy 自带了一种提取数据的机制。它们被称为选择器，因为它们通过 XPath 或 CSS 表达式“选择”HTML文档的某些部分。

XPath 是一种用于在XML文档中选择节点的语言，也可以用于HTML。CSS 是一种用于将样式应用于HTML文档的语言。它定义了选择器，以将这些样式与特定的HTML元素关联起来。

注意

Scrapy Selectors 是围绕 parsel 库的一个薄封装；这个封装的目的是为了提供与 Scrapy 响应对象更好的集成。

parsel 是一个独立的网页抓取库，可以在不使用 Scrapy 的情况下使用。它在底层使用了 lxml 库，并在 lxml API 的基础上实现了一个简单的 API。这意味着 Scrapy 选择器在速度和解析准确性上与 lxml 非常相似。

使用选择器¶

构建选择器¶

响应对象在 .selector 属性上暴露了一个 Selector 实例：

>>> response.selector.xpath("//span/text()").get()
'good'

使用XPath和CSS查询响应非常常见，因此响应还包括两个快捷方式：response.xpath() 和 response.css()：

>>> response.xpath("//span/text()").get()
'good'
>>> response.css("span::text").get()
'good'

Scrapy选择器是Selector类的实例通过传递TextResponse对象或字符串形式的标记（在text参数中）来构建。

通常不需要手动构建Scrapy选择器： response 对象在Spider回调中可用，因此在大多数情况下使用 response.css() 和 response.xpath() 快捷方式更为方便。通过使用 response.selector 或这些快捷方式之一您还可以确保响应体只被解析一次。

但如果需要，可以直接使用Selector。从文本构造：

>>> from scrapy.selector import Selector
>>> body = "<html><body><span>good</span></body></html>"
>>> Selector(text=body).xpath("//span/text()").get()
'good'

从响应构建 - HtmlResponse 是 TextResponse 的子类之一：

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="http://example.com", body=body, encoding="utf-8")
>>> Selector(response=response).xpath("//span/text()").get()
'good'

Selector 根据输入类型自动选择最佳的解析规则（XML 与 HTML）。

使用选择器¶

为了解释如何使用选择器，我们将使用Scrapy shell（它提供了交互式测试）和位于Scrapy文档服务器上的一个示例页面：

https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

为了完整性，以下是其完整的HTML代码：

<!DOCTYPE html>

<html>
  <head>
    <base href='http://example.com/' />
    <title>Example website</title>
  </head>
  <body>
    <div id='images'>
      <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' alt='image1'/></a>
      <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' alt='image2'/></a>
      <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' alt='image3'/></a>
      <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' alt='image4'/></a>
      <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' alt='image5'/></a>
    </div>
  </body>
</html>

首先，让我们打开shell：

scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

然后，在shell加载后，您将可以使用response作为shell变量，并且其附加的选择器在response.selector属性中。

由于我们正在处理HTML，选择器将自动使用HTML解析器。

因此，通过查看该页面的HTML代码，让我们构建一个XPath来选择标题标签内的文本：

>>> response.xpath("//title/text()")
[<Selector query='//title/text()' data='Example website'>]

要实际提取文本数据，您必须调用选择器 .get() 或 .getall() 方法，如下所示：

>>> response.xpath("//title/text()").getall()
['Example website']
>>> response.xpath("//title/text()").get()
'Example website'

.get() 总是返回单个结果；如果有多个匹配项，则返回第一个匹配项的内容；如果没有匹配项，则返回 None。.getall() 返回包含所有结果的列表。

请注意，CSS选择器可以使用CSS3伪元素选择文本或属性节点：

>>> response.css("title::text").get()
'Example website'

正如你所见，.xpath() 和 .css() 方法返回一个 SelectorList 实例，这是一个新选择器的列表。这个API可以用于快速选择嵌套数据：

>>> response.css("img").xpath("@src").getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

如果你想只提取第一个匹配的元素，你可以调用选择器 .get()（或其别名 .extract_first()，在之前的Scrapy版本中常用）：

>>> response.xpath('//div[@id="images"]/a/text()').get()
'Name: My image 1 '

如果没有找到元素，则返回 None：

>>> response.xpath('//div[@id="not-exists"]/text()').get() is None
True

可以提供一个默认的返回值作为参数，以代替None：

>>> response.xpath('//div[@id="not-exists"]/text()').get(default="not-found")
'not-found'

与其使用例如 '@src' XPath，可以使用 Selector 的 .attrib 属性来查询属性：

>>> [img.attrib["src"] for img in response.css("img")]
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

作为快捷方式，.attrib 也可以直接在 SelectorList 上使用；它返回第一个匹配元素的属性：

>>> response.css("img").attrib["src"]
'image1_thumb.jpg'

这在只期望单个结果时最为有用，例如通过id选择或在网页上选择唯一元素时：

>>> response.css("base").attrib["href"]
'http://example.com/'

现在我们将获取基础URL和一些图片链接：

>>> response.xpath("//base/@href").get()
'http://example.com/'

>>> response.css("base::attr(href)").get()
'http://example.com/'

>>> response.css("base").attrib["href"]
'http://example.com/'

>>> response.xpath('//a[contains(@href, "image")]/@href').getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']

>>> response.css("a[href*=image]::attr(href)").getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

>>> response.css("a[href*=image] img::attr(src)").getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

CSS选择器的扩展¶

根据W3C标准，CSS选择器不支持选择文本节点或属性值。但在网页抓取环境中，选择这些内容非常关键，因此Scrapy（parsel）实现了一些非标准的伪元素：

要选择文本节点，请使用 ::text
要选择属性值，请使用 ::attr(name)，其中 name 是您想要获取值的属性名称

警告

这些伪元素是Scrapy-/Parsel特有的。它们很可能不适用于其他库，如 lxml 或 PyQuery。

示例：

title::text 选择后代元素的子文本节点：

>>> response.css("title::text").get()
'Example website'

*::text 选择当前选择器上下文的所有后代文本节点：

>>> response.css("#images *::text").getall()
['\n   ',
'Name: My image 1 ',
'\n   ',
'Name: My image 2 ',
'\n   ',
'Name: My image 3 ',
'\n   ',
'Name: My image 4 ',
'\n   ',
'Name: My image 5 ',
'\n  ']

foo::text 如果 foo 元素存在但不包含文本（即文本为空），则不返回结果：

>>> response.css("img::text").getall()
[]

This means ``.css('foo::text').get()`` could return None even if an element
exists. Use ``default=''`` if you always want a string:

>>> response.css("img::text").get()
>>> response.css("img::text").get(default="")
''

a::attr(href) 选择后代链接的 href 属性值：

>>> response.css("a::attr(href)").getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']

注意

另请参阅：选择元素属性。

注意

你不能链式使用这些伪元素。但在实践中，这并没有太大意义：文本节点没有属性，属性值已经是字符串值，并且没有子节点。

嵌套选择器¶

选择方法（.xpath() 或 .css()）返回相同类型的选择器列表，因此您也可以为这些选择器调用选择方法。以下是一个示例：

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg" alt="image1"></a>',
'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg" alt="image2"></a>',
'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg" alt="image3"></a>',
'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg" alt="image4"></a>',
'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg" alt="image5"></a>']

>>> for index, link in enumerate(links):
...     href_xpath = link.xpath("@href").get()
...     img_xpath = link.xpath("img/@src").get()
...     print(f"Link number {index} points to url {href_xpath!r} and image {img_xpath!r}")
...
Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'

选择元素属性¶

有几种方法可以获取属性的值。首先，可以使用XPath语法：

>>> response.xpath("//a/@href").getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

XPath 语法有一些优点：它是一个标准的 XPath 功能，并且 @attributes 可以在 XPath 表达式的其他部分使用 - 例如可以根据属性值进行过滤。

Scrapy 还提供了对 CSS 选择器的扩展 (::attr(...))，它允许获取属性值：

>>> response.css("a::attr(href)").getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

除此之外，Selector 还有一个 .attrib 属性。如果你更喜欢在 Python 代码中查找属性，而不使用 XPaths 或 CSS 扩展，你可以使用它：

>>> [a.attrib["href"] for a in response.css("a")]
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

此属性也可在SelectorList上使用；它返回一个包含第一个匹配元素属性的字典。当预期选择器会给出单一结果时（例如，通过元素ID选择，或在页面上选择唯一元素时），使用它非常方便：

>>> response.css("base").attrib
{'href': 'http://example.com/'}
>>> response.css("base").attrib["href"]
'http://example.com/'

.attrib 属性在空的 SelectorList 中是空的：

>>> response.css("foo").attrib
{}

使用正则表达式的选择器¶

Selector 还有一个 .re() 方法，用于使用正则表达式提取数据。然而，与使用 .xpath() 或 .css() 方法不同，.re() 返回一个字符串列表。因此，你不能构建嵌套的 .re() 调用。

以下是一个用于从上面的HTML代码中提取图像名称的示例：

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r"Name:\s*(.*)")
['My image 1 ',
'My image 2 ',
'My image 3 ',
'My image 4 ',
'My image 5 ']

还有一个额外的辅助函数 .get()（及其别名 .extract_first()）用于 .re()，名为 .re_first()。使用它来仅提取第一个匹配的字符串：

>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r"Name:\s*(.*)")
'My image 1 '

extract() 和 extract_first()¶

如果您是Scrapy的长期用户，您可能熟悉.extract()和.extract_first()选择器方法。许多博客文章和教程也在使用它们。这些方法仍然被Scrapy支持，没有计划弃用它们。

然而，Scrapy 使用文档现在使用 .get() 和 .getall() 方法编写。我们认为这些新方法使代码更加简洁和易读。

以下示例展示了这些方法如何相互映射。

SelectorList.get() 与 SelectorList.extract_first() 相同：

>>> response.css("a::attr(href)").get()
'image1.html'
>>> response.css("a::attr(href)").extract_first()
'image1.html'

SelectorList.getall() 与 SelectorList.extract() 相同：

>>> response.css("a::attr(href)").getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.css("a::attr(href)").extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

Selector.get() 与 Selector.extract() 相同：

>>> response.css("a::attr(href)")[0].get()
'image1.html'
>>> response.css("a::attr(href)")[0].extract()
'image1.html'

为了保持一致性，还有Selector.getall()，它返回一个列表：

>>> response.css("a::attr(href)")[0].getall()
['image1.html']

因此，主要区别在于.get()和.getall()方法的输出更加可预测：.get()总是返回单个结果，.getall()总是返回所有提取结果的列表。使用.extract()方法时，结果是否为列表并不总是显而易见的；要获取单个结果，应调用.extract()或.extract_first()。

使用XPaths¶

这里有一些提示，可能有助于您有效地使用XPath与Scrapy选择器。如果您对XPath还不太熟悉，您可能想先看看这个XPath教程。

注意

一些技巧基于Zyte博客上的这篇文章。

使用相对XPaths¶

请记住，如果您正在嵌套选择器并使用以 / 开头的 XPath，则该 XPath 将是相对于文档的绝对路径，而不是相对于您调用它的 Selector。

例如，假设你想提取所有在

元素内的元素。首先，你需要获取所有的元素：

>>> divs = response.xpath("//div")

起初，你可能会想使用以下方法，这是错误的，因为它实际上会从文档中提取所有的

元素，而不仅仅是那些在元素内部的元素：

>>> for p in divs.xpath("//p"):  # this is wrong - gets all <p> from the whole document
...     print(p.get())
...

这是正确的方法（注意.//p XPath前面的点）：

>>> for p in divs.xpath(".//p"):  # extracts all <p> inside
...     print(p.get())
...

另一个常见的情况是提取所有直接的

子元素：

>>> for p in divs.xpath("p"):
...     print(p.get())
...

有关相对XPaths的更多详细信息，请参阅XPath规范中的Location Paths部分。

按类查询时，考虑使用CSS¶

因为一个元素可以包含多个CSS类，所以通过类选择元素的XPath方式是相当冗长的：

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

如果你使用@class='someclass'，你可能会错过具有其他类的元素，而如果你只是使用contains(@class, 'someclass')来弥补这一点，你可能会得到比你想要的更多的元素，如果它们有一个不同的类名，但共享字符串someclass。

事实证明，Scrapy 选择器允许你链式使用选择器，因此大多数时候你可以直接使用 CSS 按类选择，然后在需要时切换到 XPath：

>>> from scrapy import Selector
>>> sel = Selector(
...     text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>'
... )
>>> sel.css(".shout").xpath("./time/@datetime").getall()
['2014-07-23 19:00']

这比使用上面显示的冗长的XPath技巧更简洁。只需记住在接下来的XPath表达式中使用.。

注意 //node[1] 和 (//node)[1] 之间的区别¶

//node[1] 选择所有在其各自父节点下首先出现的节点。

(//node)[1] 选择文档中的所有节点，然后只获取其中的第一个节点。

示例：

>>> from scrapy import Selector
>>> sel = Selector(
...     text="""
...     <ul class="list">
...         <li>1</li>
...         <li>2</li>
...         <li>3</li>
...     </ul>
...     <ul class="list">
...         <li>4</li>
...         <li>5</li>
...         <li>6</li>
...     </ul>"""
... )
>>> xp = lambda x: sel.xpath(x).getall()

这将获取所有第一个

元素，无论其父元素是什么：

>>> xp("//li[1]")
['<li>1</li>', '<li>4</li>']

这将获取整个文档中的第一个

元素：

>>> xp("(//li)[1]")
['<li>1</li>']

这将获取所有在

父元素下的第一个元素：

>>> xp("//ul/li[1]")
['<li>1</li>', '<li>4</li>']

这将获取整个文档中位于

父元素下的第一个元素：

>>> xp("(//ul/li)[1]")
['<li>1</li>']

在条件中使用文本节点¶

当你需要使用文本内容作为XPath字符串函数的参数时，避免使用.//text()，而是使用.。

这是因为表达式 .//text() 生成了一组文本元素——一个节点集。当节点集被转换为字符串时，这种情况发生在它作为参数传递给像 contains() 或 starts-with() 这样的字符串函数时，它只会产生第一个元素的文本。

示例：

>>> from scrapy import Selector
>>> sel = Selector(
...     text='<a href="#">Click here to go to the <strong>Next Page</strong></a>'
... )

将节点集转换为字符串：

>>> sel.xpath("//a//text()").getall()  # take a peek at the node-set
['Click here to go to the ', 'Next Page']
>>> sel.xpath("string(//a[1]//text())").getall()  # convert it to string
['Click here to go to the ']

一个节点转换为字符串时，会将其自身及其所有子节点的文本拼接在一起：

>>> sel.xpath("//a[1]").getall()  # select the first node
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").getall()  # convert it to string
['Click here to go to the Next Page']

因此，在这种情况下，使用 .//text() 节点集将不会选择任何内容：

>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall()
[]

但是使用.来表示节点，是有效的：

>>> sel.xpath("//a[contains(., 'Next Page')]").getall()
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']

XPath表达式中的变量¶

XPath 允许你在 XPath 表达式中引用变量，使用 $somevariable 语法。这有点类似于 SQL 世界中的参数化查询或预处理语句，在查询中用占位符（如 ?）替换一些参数，然后用查询传递的值替换这些占位符。

这里有一个例子，根据元素的“id”属性值来匹配元素，而不需要硬编码（之前已经展示过）：

>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath("//div[@id=$val]/a/text()", val="images").get()
'Name: My image 1 '

这是另一个例子，用于查找包含五个子元素的

标签的“id”属性（这里我们将值 5 作为整数传递）：

>>> response.xpath("//div[count(a)=$cnt]/@id", cnt=5).get()
'images'

在调用.xpath()时，所有变量引用都必须有一个绑定值（否则你会得到一个ValueError: XPath error:异常）。这是通过传递尽可能多的命名参数来实现的。

parsel，驱动Scrapy选择器的库，提供了更多关于XPath变量的详细信息和示例。

移除命名空间¶

在处理抓取项目时，通常非常方便完全摆脱命名空间，只使用元素名称来编写更简单/方便的XPath。你可以使用Selector.remove_namespaces()方法来实现这一点。

让我们展示一个例子，用Python Insider博客的atom feed来说明这一点。

首先，我们打开包含要抓取的URL的shell：

$ scrapy shell https://feeds.feedburner.com/PythonInsider

文件是这样开始的：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet ...
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
      xmlns:blogger="http://schemas.google.com/blogger/2008"
      xmlns:georss="http://www.georss.org/georss"
      xmlns:gd="http://schemas.google.com/g/2005"
      xmlns:thr="http://purl.org/syndication/thread/1.0"
      xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
  ...

你可以看到几个命名空间声明，包括一个默认的 "http://www.w3.org/2005/Atom" 和另一个使用 gd: 前缀的 "http://schemas.google.com/g/2005"。

在shell中，我们可以尝试选择所有的对象，并发现它不起作用（因为Atom XML命名空间混淆了这些节点）：

>>> response.xpath("//link")
[]

但是一旦我们调用Selector.remove_namespaces()方法，所有节点都可以直接通过它们的名称访问：

>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector query='//link' data='<link rel="alternate" type="text/html" h'>,
    <Selector query='//link' data='<link rel="next" type="application/atom+'>,
    ...

如果你想知道为什么默认情况下不总是调用命名空间移除程序，而是需要手动调用它，这是因为两个原因，按相关性顺序排列如下：

移除命名空间需要遍历并修改文档中的所有节点，这对于Scrapy爬取的所有文档来说，默认执行是一个相当昂贵的操作。
在某些情况下，使用命名空间实际上是必要的，以防某些元素名称在命名空间之间发生冲突。不过，这些情况非常罕见。

使用EXSLT扩展¶

由于构建在lxml之上，Scrapy选择器支持一些EXSLT扩展，并预注册了这些命名空间以在XPath表达式中使用：

前缀	命名空间	用法
re	http://exslt.org/regular-expressions	正则表达式
设置	http://exslt.org/sets	集合操作

正则表达式¶

例如，当XPath的starts-with()或contains()不够用时，test()函数可以证明非常有用。

示例选择列表项中“class”属性以数字结尾的链接：

>>> from scrapy import Selector
>>> doc = """
... <div>
...     <ul>
...         <li class="item-0"><a href="link1.html">first item</a></li>
...         <li class="item-1"><a href="link2.html">second item</a></li>
...         <li class="item-inactive"><a href="link3.html">third item</a></li>
...         <li class="item-1"><a href="link4.html">fourth item</a></li>
...         <li class="item-0"><a href="link5.html">fifth item</a></li>
...     </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath("//li//@href").getall()
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').getall()
['link1.html', 'link2.html', 'link4.html', 'link5.html']

警告

C库libxslt本身不支持EXSLT正则表达式，因此lxml的实现使用了Python的re模块的钩子。因此，在XPath表达式中使用正则表达式函数可能会带来一些小的性能损失。

集合操作¶

这些可以方便地在提取文本元素之前排除文档树的部分。

提取微数据的示例（示例内容取自 https://schema.org/Product）包含项目范围组和相应的项目属性：

>>> doc = """
... <div itemscope itemtype="http://schema.org/Product">
...   <span itemprop="name">Kenmore White 17" Microwave</span>
...   <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
...   <div itemprop="aggregateRating"
...     itemscope itemtype="http://schema.org/AggregateRating">
...    Rated <span itemprop="ratingValue">3.5</span>/5
...    based on <span itemprop="reviewCount">11</span> customer reviews
...   </div>
...   <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
...     <span itemprop="price">$55.00</span>
...     <link itemprop="availability" href="http://schema.org/InStock" />In stock
...   </div>
...   Product description:
...   <span itemprop="description">0.7 cubic feet countertop microwave.
...   Has six preset cooking categories and convenience features like
...   Add-A-Minute and Child Lock.</span>
...   Customer reviews:
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Not a happy camper</span> -
...     by <span itemprop="author">Ellie</span>,
...     <meta itemprop="datePublished" content="2011-04-01">April 1, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1">
...       <span itemprop="ratingValue">1</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">The lamp burned out and now I have to replace
...     it. </span>
...   </div>
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Value purchase</span> -
...     by <span itemprop="author">Lucas</span>,
...     <meta itemprop="datePublished" content="2011-03-25">March 25, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1"/>
...       <span itemprop="ratingValue">4</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">Great microwave for the price. It is small and
...     fits in my apartment.</span>
...   </div>
...   ...
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> for scope in sel.xpath("//div[@itemscope]"):
...     print("current scope:", scope.xpath("@itemtype").getall())
...     props = scope.xpath(
...         """
...                 set:difference(./descendant::*/@itemprop,
...                                .//*[@itemscope]/*/@itemprop)"""
...     )
...     print(f"    properties: {props.getall()}")
...     print("")
...

current scope: ['http://schema.org/Product']
    properties: ['name', 'aggregateRating', 'offers', 'description', 'review', 'review']

current scope: ['http://schema.org/AggregateRating']
    properties: ['ratingValue', 'reviewCount']

current scope: ['http://schema.org/Offer']
    properties: ['price', 'availability']

current scope: ['http://schema.org/Review']
    properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']

current scope: ['http://schema.org/Rating']
    properties: ['worstRating', 'ratingValue', 'bestRating']

current scope: ['http://schema.org/Review']
    properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']

current scope: ['http://schema.org/Rating']
    properties: ['worstRating', 'ratingValue', 'bestRating']

在这里，我们首先遍历itemscope元素，对于每一个元素，我们查找所有的itemprops元素，并排除那些本身位于另一个itemscope内部的元素。

其他XPath扩展¶

Scrapy选择器还提供了一个非常需要的XPath扩展函数 has-class，对于具有所有指定HTML类的节点，该函数返回True。

对于以下HTML：

>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(
...     url="http://example.com",
...     body="""
... <html>
...     <body>
...         <p class="foo bar-baz">First</p>
...         <p class="foo">Second</p>
...         <p class="bar">Third</p>
...         <p>Fourth</p>
...     </body>
... </html>
... """,
...     encoding="utf-8",
... )

你可以这样使用它：

>>> response.xpath('//p[has-class("foo")]')
[<Selector query='//p[has-class("foo")]' data='<p class="foo bar-baz">First</p>'>,
<Selector query='//p[has-class("foo")]' data='<p class="foo">Second</p>'>]
>>> response.xpath('//p[has-class("foo", "bar-baz")]')
[<Selector query='//p[has-class("foo", "bar-baz")]' data='<p class="foo bar-baz">First</p>'>]
>>> response.xpath('//p[has-class("foo", "bar")]')
[]

所以 XPath //p[has-class("foo", "bar-baz")] 大致相当于 CSS p.foo.bar-baz。请注意，在大多数情况下它较慢，因为它是一个纯 Python 函数，针对每个相关节点调用，而 CSS 查找被转换为 XPath 并因此运行得更高效，所以在性能方面，它的使用仅限于那些不容易用 CSS 选择器描述的情况。

Parsel 还简化了使用 set_xpathfunc() 添加您自己的 XPath 扩展。

内置选择器参考¶

选择器对象¶

class scrapy.selector.Selector(*args: Any, **kwargs: Any)[源代码]¶

Selector 的一个实例是对响应的包装，用于选择其内容的某些部分。

response 是一个 HtmlResponse 或 XmlResponse 对象，将用于选择和提取数据。

text 是一个 Unicode 字符串或 UTF-8 编码的文本，用于在 response 不可用的情况下。同时使用 text 和 response 是未定义的行为。

type 定义了选择器类型，它可以是 "html"、"xml"、"json" 或 None（默认）。

如果 type 是 None，选择器会根据 response 类型自动选择最佳类型（见下文），或者在与 text 一起使用时默认选择 "html"。

如果 type 是 None 并且传递了一个 response，则选择器类型将根据响应类型推断如下：

"html" 用于 HtmlResponse 类型
"xml" 用于 XmlResponse 类型
"json" 用于 TextResponse 类型
"html" 用于其他任何情况

否则，如果设置了type，选择器类型将被强制设置，不会进行任何检测。

xpath(query: str, namespaces: Mapping[str, str] | None = None, **kwargs: Any) → SelectorList[_SelectorType][source]¶

查找与xpath query 匹配的节点，并将结果作为 SelectorList 实例返回，所有元素都已展平。列表元素也实现了 Selector 接口。

query 是一个包含要应用的XPATH查询的字符串。

namespaces 是一个可选的 prefix: namespace-uri 映射（字典），用于为那些已通过 register_namespace(prefix, uri) 注册的前缀添加额外的前缀。与 register_namespace() 不同，这些前缀不会保存以供将来调用使用。

任何额外的命名参数都可以用于在XPath表达式中传递XPath变量的值，例如：

selector.xpath('//a[href=$url]', url="http://www.example.com")

注意

为了方便，这个方法可以被称为 response.xpath()

css(query: str) → SelectorList[_SelectorType][source]¶

应用给定的CSS选择器并返回一个SelectorList实例。

query 是一个包含要应用的CSS选择器的字符串。

在后台，CSS查询使用cssselect库转换为XPath查询，并运行.xpath()方法。

注意

为了方便，这个方法可以被称为 response.css()

jmespath(query: str, **kwargs: Any) → SelectorList[_SelectorType][source]¶

查找与JMESPath query 匹配的对象，并将结果作为 SelectorList 实例返回，所有元素都被展平。列表元素也实现了 Selector 接口。

query 是一个包含要应用的 JMESPath 查询的字符串。

任何额外的命名参数都会传递给底层的 jmespath.search 调用，例如：

selector.jmespath('author.name', options=jmespath.Options(dict_cls=collections.OrderedDict))

注意

为了方便，这个方法可以被称为 response.jmespath()

get() → Any[源代码]¶

序列化并返回匹配的节点。

对于HTML和XML，结果始终是一个字符串，并且百分号编码的内容会被解码。

另请参阅：extract() 和 extract_first()

attrib¶

返回基础元素的属性字典。

另请参阅：选择元素属性。

re(regex: str | Pattern[str], replace_entities: bool = True) → List[str][source]¶

应用给定的正则表达式并返回一个包含匹配项的字符串列表。

regex 可以是一个已编译的正则表达式，也可以是一个字符串，该字符串将使用 re.compile(regex) 编译为正则表达式。

默认情况下，字符实体引用会被替换为相应的字符（除了&和<）。将replace_entities设置为False可以关闭这些替换。

re_first(regex: str | Pattern[str], default: None = None, replace_entities: bool = True) → str | None[source]¶

re_first(regex: str | Pattern[str], default: str, replace_entities: bool = True) → str

应用给定的正则表达式并返回第一个匹配的字符串。如果没有匹配项，则返回默认值（如果未提供参数，则为None）。

默认情况下，字符实体引用会被替换为相应的字符（除了&和<）。将replace_entities设置为False可以关闭这些替换。

register_namespace(prefix: str, uri: str) → None[source]¶: 注册给定的命名空间以在此Selector中使用。如果不注册命名空间，您将无法从非标准命名空间中选择或提取数据。请参阅XML响应上的选择器示例。

remove_namespaces() → None[source]¶: 移除所有命名空间，允许使用无命名空间的xpath遍历文档。参见移除命名空间。

__bool__() → bool[source]¶: 如果选择了任何实际内容，则返回True，否则返回False。换句话说，Selector的布尔值由其选择的内容决定。

getall() → List[str][source]¶

将匹配的节点序列化并返回为一个包含1个元素的字符串列表。

此方法被添加到Selector中以保持一致性；它在SelectorList中更有用。另请参阅：extract() 和 extract_first()

SelectorList 对象¶

class scrapy.selector.SelectorList(iterable=(), /)[source]¶

SelectorList 类是内置 list 类的子类，它提供了一些额外的方法。

xpath(xpath: str, namespaces: Mapping[str, str] | None = None, **kwargs: Any) → SelectorList[_SelectorType][源代码]¶

为列表中的每个元素调用.xpath()方法，并将它们的结果扁平化为另一个SelectorList返回。

xpath 是与 Selector.xpath() 中的参数相同的参数

namespaces 是一个可选的 prefix: namespace-uri 映射（字典），用于为那些已通过 register_namespace(prefix, uri) 注册的前缀添加额外的前缀。与 register_namespace() 不同，这些前缀不会保存以供将来调用使用。

任何额外的命名参数都可以用于在XPath表达式中传递XPath变量的值，例如：

selector.xpath('//a[href=$url]', url="http://www.example.com")

css(query: str) → SelectorList[_SelectorType][source]¶

调用此列表中每个元素的.css()方法，并将它们的结果扁平化为另一个SelectorList返回。

query 参数与 Selector.css() 中的参数相同

jmespath(query: str, **kwargs: Any) → SelectorList[_SelectorType][source]¶

调用此列表中每个元素的.jmespath()方法，并将它们的结果扁平化为另一个SelectorList返回。

query 参数与 Selector.jmespath() 中的参数相同。

任何额外的命名参数都会传递给底层的 jmespath.search 调用，例如：

selector.jmespath('author.name', options=jmespath.Options(dict_cls=collections.OrderedDict))

getall() → List[str][source]¶

调用此列表中每个元素的.get()方法，并将它们的结果扁平化，返回一个字符串列表。

另请参阅：extract() 和 extract_first()

get(default: None = None) → str | None[source]¶

get(default: str) → str

返回此列表中第一个元素的.get()结果。如果列表为空，则返回默认值。

另请参阅：extract() 和 extract_first()

re(regex: str | Pattern[str], replace_entities: bool = True) → List[str][source]¶

调用此列表中每个元素的.re()方法，并将它们的结果扁平化，返回一个字符串列表。

默认情况下，字符实体引用会被替换为相应的字符（除了&和<）。将replace_entities设置为False可以关闭这些替换。

re_first(regex: str | Pattern[str], default: None = None, replace_entities: bool = True) → str | None[source]¶

re_first(regex: str | Pattern[str], default: str, replace_entities: bool = True) → str

调用此列表中第一个元素的.re()方法，并将结果以字符串形式返回。如果列表为空或正则表达式未匹配到任何内容，则返回默认值（如果未提供参数，则为None）。

默认情况下，字符实体引用会被替换为相应的字符（除了&和<）。将replace_entities设置为False可以关闭这些替换。

attrib¶

返回第一个元素的属性字典。如果列表为空，则返回一个空字典。

另请参阅：选择元素属性。

示例¶

HTML响应中的选择器示例¶

这里有一些Selector示例来说明几个概念。在所有情况下，我们假设已经有一个Selector实例化，使用了一个HtmlResponse对象，如下所示：

sel = Selector(html_response)

从HTML响应体中选取所有的
元素，返回一个Selector对象列表（即一个SelectorList对象）：
```
sel.xpath("//h1")
```

从HTML响应体中提取所有

元素的文本，返回一个字符串列表：

sel.xpath("//h1").getall()  # this includes the h1 tag
sel.xpath("//h1/text()").getall()  # this excludes the h1 tag

遍历所有

标签并打印它们的类属性：

for node in sel.xpath("//p"):
    print(node.attrib["class"])

XML响应上的选择器示例¶

以下是一些示例，用于说明使用XmlResponse对象实例化的Selector对象的概念：

sel = Selector(xml_response)

从XML响应体中选取所有元素，返回一个Selector对象的列表（即一个SelectorList对象）：
```
sel.xpath("//product")
```

从Google Base XML feed中提取所有价格，这需要注册一个命名空间：

sel.register_namespace("g", "http://base.google.com/ns/1.0")
sel.xpath("//g:price").getall()