常见做法¶

本节记录了使用Scrapy时的常见做法。这些内容涵盖了许多主题，通常不属于任何其他特定部分。

从脚本运行Scrapy¶

你可以使用API从脚本中运行Scrapy，而不是通过scrapy crawl的典型方式运行Scrapy。

请记住，Scrapy 是构建在 Twisted 异步网络库之上的，因此您需要在 Twisted 反应器中运行它。

你可以用来运行你的爬虫的第一个工具是 scrapy.crawler.CrawlerProcess。这个类将为你启动一个Twisted反应器，配置日志记录并设置关闭处理程序。这个类是所有Scrapy命令使用的类。

这里有一个示例，展示了如何使用它运行单个爬虫。

import scrapy
from scrapy.crawler import CrawlerProcess


class MySpider(scrapy.Spider):
    # Your spider definition
    ...


process = CrawlerProcess(
    settings={
        "FEEDS": {
            "items.json": {"format": "json"},
        },
    }
)

process.crawl(MySpider)
process.start()  # the script will block here until the crawling is finished

在CrawlerProcess中定义字典中的设置。确保查看CrawlerProcess文档以熟悉其使用细节。

如果你在一个Scrapy项目中，有一些额外的辅助工具可以帮助你在项目中导入这些组件。你可以自动导入你的爬虫，将它们的名称传递给CrawlerProcess，并使用get_project_settings来获取一个Settings实例，其中包含你的项目设置。

接下来是一个实际示例，展示如何使用testspiders项目作为例子来实现这一点。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

# 'followall' is the name of one of the spiders of the project.
process.crawl("followall", domain="scrapy.org")
process.start()  # the script will block here until the crawling is finished

还有一个Scrapy工具，它提供了对爬取过程的更多控制：scrapy.crawler.CrawlerRunner。这个类是一个薄封装，封装了一些简单的辅助工具来运行多个爬虫，但它不会以任何方式启动或干扰现有的反应器。

使用此类时，在调度您的爬虫后应显式运行反应器。如果您的应用程序已经在使用Twisted，并且您希望在同一个反应器中运行Scrapy，建议您使用CrawlerRunner而不是CrawlerProcess。

请注意，在爬虫完成后，您还需要自行关闭Twisted反应器。这可以通过向CrawlerRunner.crawl方法返回的延迟对象添加回调来实现。

这里有一个使用示例，以及在MySpider运行完成后手动停止反应器的回调。

import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging


class MySpider(scrapy.Spider):
    # Your spider definition
    ...


configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s"})
runner = CrawlerRunner()

d = runner.crawl(MySpider)

from twisted.internet import reactor

d.addBoth(lambda _: reactor.stop())
reactor.run()  # the script will block here until the crawling is finished

相同的例子，但使用非默认的反应器，只有在使用CrawlerRunner时才需要调用install_reactor，因为CrawlerProcess已经自动完成了这一操作。

import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging


class MySpider(scrapy.Spider):
    # Your spider definition
    ...


configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s"})

from scrapy.utils.reactor import install_reactor

install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
runner = CrawlerRunner()
d = runner.crawl(MySpider)

from twisted.internet import reactor

d.addBoth(lambda _: reactor.stop())
reactor.run()  # the script will block here until the crawling is finished

另请参阅

反应器概述

在同一进程中运行多个爬虫¶

默认情况下，当您运行scrapy crawl时，Scrapy每个进程运行一个蜘蛛。然而，Scrapy支持使用内部API在每个进程中运行多个蜘蛛。

这是一个同时运行多个爬虫的示例：

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...


class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...


settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()  # the script will block here until all crawling jobs are finished

使用 CrawlerRunner 的相同示例：

import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings


class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...


class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...


configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()

from twisted.internet import reactor

d.addBoth(lambda _: reactor.stop())

reactor.run()  # the script will block here until all crawling jobs are finished

相同的例子，但通过链式延迟顺序运行爬虫：

from twisted.internet import defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings


class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...


class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...


settings = get_project_settings()
configure_logging(settings)
runner = CrawlerRunner(settings)


@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()


from twisted.internet import reactor

crawl()
reactor.run()  # the script will block here until the last crawl call is finished

不同的蜘蛛可以为相同的设置设置不同的值，但当它们在同一进程中运行时，可能由于设计或某些限制，无法使用这些不同的值。实际上，不同的设置会发生不同的情况：

SPIDER_LOADER_CLASS 以及其值所使用的设置 (SPIDER_MODULES, SPIDER_LOADER_WARN_ONLY 对于默认设置) 无法从每个蜘蛛的设置中读取。这些设置在 CrawlerRunner 或 CrawlerProcess 对象创建时应用。
对于TWISTED_REACTOR和ASYNCIO_EVENT_LOOP，将使用第一个可用的值，如果蜘蛛请求不同的反应器，则会引发异常。这些设置将在安装反应器时应用。
对于REACTOR_THREADPOOL_MAXSIZE、DNS_RESOLVER以及解析器使用的设置（DNSCACHE_ENABLED、DNSCACHE_SIZE、DNS_TIMEOUT，这些设置包含在Scrapy中），将使用第一个可用的值。这些设置在反应器启动时应用。

另请参阅

从脚本运行Scrapy。

分布式爬虫¶

Scrapy 没有提供任何内置功能来以分布式（多服务器）方式运行爬虫。然而，有一些方法可以分发爬虫，这些方法根据你计划如何分发它们而有所不同。

如果你有许多爬虫，分发负载的明显方法是设置多个Scrapyd实例并在这些实例之间分发爬虫运行。

如果你想通过多台机器运行一个（大型）爬虫，通常的做法是将要爬取的URL进行分区，并将它们发送到每个单独的爬虫。以下是一个具体示例：

首先，您准备要爬取的URL列表，并将它们放入单独的文件/URL中：

http://somedomain.com/urls-to-crawl/spider1/part1.list
http://somedomain.com/urls-to-crawl/spider1/part2.list
http://somedomain.com/urls-to-crawl/spider1/part3.list

然后你在3个不同的Scrapyd服务器上启动一个蜘蛛运行。蜘蛛将接收一个(spider)参数part，其中包含要爬取的分区编号：

curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3

避免被封禁¶

一些网站实施了某些措施来防止机器人爬取它们，这些措施的复杂程度各不相同。绕过这些措施可能既困难又棘手，有时可能需要特殊的基础设施。如果有疑问，请考虑联系商业支持。

在处理这类网站时，请记住以下一些提示：

从浏览器的知名用户代理池中轮换您的用户代理（可以通过搜索获取它们的列表）
禁用cookies（参见COOKIES_ENABLED），因为一些网站可能会使用cookies来检测机器人行为
使用下载延迟（2或更高）。请参阅DOWNLOAD_DELAY设置。
如果可能，使用Common Crawl来获取页面，而不是直接访问网站
使用一个轮换的IP池。例如，免费的Tor项目或付费服务如ProxyMesh。一个开源的替代方案是scrapoxy，这是一个超级代理，你可以将自己的代理附加到它上面。
使用一个避免封禁的服务，例如 Zyte API，它提供了一个 Scrapy 插件和额外的功能，比如 AI 网页抓取

如果您仍然无法防止您的机器人被封禁，请考虑联系商业支持。