爬虫 Scrapy 学习系列十三:Settings

前言

这是 Scrapy 系列学习文章之一,本章主要介绍 Settings 的相关的内容;

本文为作者的原创作品,转载需注明出处;

简介

Scrapy settings 的设计目的就是允许你通过设置来自定义所有 Scrapy 组件的行为,包括 core、extensions、pipeline 以及 spiders 自身;

settings 的内部提供了一个全局的以 key-value mapping 为存储格式的的命名空间,相关代码可以从中获取相关配置属性;当然,填充 settings 的方式有很多种机制,后续会有介绍;

For a list of available built-in settings see: Built-in settings reference.

指定 settings

当你使用 Scrapy 的时候,你需要指定哪一个 Scrapy settings 是你期望使用的;你可以通过环境变量来设置,SCRAPY_SETTINGS_MODULE;并且注意的是SCRAPY_SETTINGS_MODULE所对应的值必须是在 Python 的 import search path 中的

填充 settings

填充 settings 的方式有多种,

  1. Command line options (most precedence)
  2. Settings per-spider
  3. Project settings module
  4. Default settings per-command
  5. Default global settings (less precedence)

上面的配置的优先级是从上至下的;

Command line options

Arguments provided by the command line are the ones that take most precedence, overriding any other options. You can explicitly override one (or more) settings using the -s (or --set) command line option.

1
scrapy crawl myspider -s LOG_FILE=scrapy.log

Setting per-spider

Spiders (See the Spiders chapter for reference) can define their own settings that will take precedence and override the project ones. They can do so by setting their custom_settings attribute:

注意,在 spider 上的配置会覆盖 project 级别上的配置;

1
2
3
4
5
6
class MySpider(scrapy.Spider):
name = 'myspider'

custom_settings = {
'SOME_SETTING': 'some value',
}

这里不得不提笔者在前叙文章中所谈到的 如何为不同的 Spider 定义不同的 Item Pipeline 例子;

Project settings module

The project settings module is the standard configuration file for your Scrapy project, it’s where most of your custom settings will be populated. For a standard Scrapy project, this means you’ll be adding or changing the settings in the settings.py file created for your project.

Default settings per-command

Each Scrapy tool command can have its own default settings, which override the global default settings. Those custom command settings are specified in the default_settings attribute of the command class.

Default global settings

The global defaults are located in the scrapy.settings.default_settings module and documented in the Built-in settings reference section.

访问 settings

In a spider, the settings are available through self.settings:

1
2
3
4
5
6
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com']

def parse(self, response):
print("Existing settings: %s" % self.settings.attributes.keys())

注意,

The settings attribute is set in the base Spider class after the spider is initialized. If you want to use the settings before the initialization (e.g., in your spider’s __init__() method), you’ll need to override the from_crawler() method.

settings 对象是在 spider 实例进行初始化以后在 base Spider 中设置的;如果你想要在实例化之前就想要访问相关的 settings,那么你必须覆盖 from_crawler() 方法;

Settings 可以在 extensions,middleware 和 item pipelines 中通过方法 from_crawler 中的参数对象 Crawler 进行访问;看一个 Extension 相关的例子,

1
2
3
4
5
6
7
8
9
class MyExtension(object):
def __init__(self, log_is_enabled=False):
if log_is_enabled:
print("log is enabled!")

@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
return cls(settings.getbool('LOG_ENABLED'))

设置合理的 setting 名字

Setting names are usually prefixed with the component that they configure. For example, proper setting names for a fictional robots.txt extension would be ROBOTSTXT_ENABLED, ROBOTSTXT_OBEY, ROBOTSTXT_CACHEDIR, etc.

通常,一个合理的 setting 的名字需要有一个与模块相关的前缀名称;

内置 settings 属性

非常多的内置配置属性,笔者这里不一一列举,针对一些核心的配置项进行专门描述

AWS_ACCESS_KEY_ID

AWS_SECRET_ACCESS_KEY

BOT_NAME

Default: 'scrapybot'

The name of the bot implemented by this Scrapy project (also known as the project name). This will be used to construct the User-Agent by default, and also for logging.

一般而言就是你的项目工程的名字;

CONCURRENT_ITEMS

Default: 100

同时可以通过 Item Pipeline 并发处理的最大的 Items 的数量;

CONCURRENT_REQUESTS

Default: 16

能够被 scrapy downloader 同时并发处理的最大的 requests 数量;

CONCURRENT_REQUESTS_PER_DOMAIN

Default: 8

The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.

See also: AutoThrottle extension and its AUTOTHROTTLE_TARGET_CONCURRENCY option.

CONCURRENT_REQUESTS_PER_IP

Default: 0

The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If non-zero(不为0), the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words, concurrency limits will be applied per IP, not per domain.

This setting also affects DOWNLOAD_DELAY and AutoThrottle extension: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.

注意,当这个值不为 0 的时候,它将会替代 CONCURRENT_REQUESTS_PER_DOMAIN 的设置,转变为 per IP 而不再是 per Domain 的特性了;

DEFAULT_ITEM_CLASS

DEFAULT_REQUEST_HEADERS

Default:

1
2
3
4
{
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}

The default headers used for Scrapy HTTP Requests. They’re populated in the DefaultHeadersMiddleware.

DEPTH_LIMIT

Default: 0

Scope: scrapy.spidermiddlewares.depth.DepthMiddleware

The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.

DEPTH_PRIORITY

Default: 0

Scope: scrapy.spidermiddlewares.depth.DepthMiddleware

An integer that is used to adjust the request priority based on its depth:

  • if zero (default), no priority adjustment is made from depth
  • a positive value will decrease the priority, i.e. higher depth requests will be processed later ; this is commonly used when doing breadth-first crawls (BFO)
  • a negative value will increase priority, i.e., higher depth requests will be processed sooner (DFO)

该值是对子链接的爬取进行优先级设计,如果 >0,那么子链接的优先级将会降低;如果 <0,那么将会提高;

others

还有很多很多相关的有用的配置;https://doc.scrapy.org/en/latest/topics/settings.html#default-global-settings