前言

这是 Scrapy 系列学习文章之一，本章主要介绍 Settings 的相关的内容；

本文为作者的原创作品，转载需注明出处；

简介

Scrapy settings 的设计目的就是允许你通过设置来自定义所有 Scrapy 组件的行为，包括 core、extensions、pipeline 以及 spiders 自身；

settings 的内部提供了一个全局的以 key-value mapping 为存储格式的的命名空间，相关代码可以从中获取相关配置属性；当然，填充 settings 的方式有很多种机制，后续会有介绍；

For a list of available built-in settings see: Built-in settings reference.

指定 settings

当你使用 Scrapy 的时候，你需要指定哪一个 Scrapy settings 是你期望使用的；你可以通过环境变量来设置，SCRAPY_SETTINGS_MODULE；并且注意的是SCRAPY_SETTINGS_MODULE所对应的值必须是在 Python 的 import search path 中的

填充 settings

填充 settings 的方式有多种，

Command line options (most precedence)
Settings per-spider
Project settings module
Default settings per-command
Default global settings (less precedence)

上面的配置的优先级是从上至下的；

Command line options

Arguments provided by the command line are the ones that take most precedence, overriding any other options. You can explicitly override one (or more) settings using the -s (or --set) command line option.

1	scrapy crawl myspider -s LOG_FILE=scrapy.log

Setting per-spider

Spiders (See the Spiders chapter for reference) can define their own settings that will take precedence and override the project ones. They can do so by setting their custom_settings attribute:

注意，在 spider 上的配置会覆盖 project 级别上的配置；

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'SOME_SETTING': 'some value',
    }

这里不得不提笔者在前叙文章中所谈到的如何为不同的 Spider 定义不同的 Item Pipeline 例子；

Project settings module

The project settings module is the standard configuration file for your Scrapy project, it’s where most of your custom settings will be populated. For a standard Scrapy project, this means you’ll be adding or changing the settings in the settings.py file created for your project.

Default settings per-command

Each Scrapy tool command can have its own default settings, which override the global default settings. Those custom command settings are specified in the default_settings attribute of the command class.

Default global settings

The global defaults are located in the scrapy.settings.default_settings module and documented in the Built-in settings reference section.

访问 settings

In a spider, the settings are available through self.settings:

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        print("Existing settings: %s" % self.settings.attributes.keys())

注意，

The settings attribute is set in the base Spider class after the spider is initialized. If you want to use the settings before the initialization (e.g., in your spider’s __init__() method), you’ll need to override the from_crawler() method.

settings 对象是在 spider 实例进行初始化以后在 base Spider 中设置的；如果你想要在实例化之前就想要访问相关的 settings，那么你必须覆盖 from_crawler() 方法；

Settings 可以在 extensions，middleware 和 item pipelines 中通过方法 from_crawler 中的参数对象 Crawler 进行访问；看一个 Extension 相关的例子，

class MyExtension(object):
    def __init__(self, log_is_enabled=False):
        if log_is_enabled:
            print("log is enabled!")

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(settings.getbool('LOG_ENABLED'))

设置合理的 setting 名字

Setting names are usually prefixed with the component that they configure. For example, proper setting names for a fictional robots.txt extension would be ROBOTSTXT_ENABLED, ROBOTSTXT_OBEY, ROBOTSTXT_CACHEDIR, etc.

通常，一个合理的 setting 的名字需要有一个与模块相关的前缀名称；

内置 settings 属性

非常多的内置配置属性，笔者这里不一一列举，针对一些核心的配置项进行专门描述

AWS_ACCESS_KEY_ID

AWS_SECRET_ACCESS_KEY

BOT_NAME

Default: 'scrapybot'

The name of the bot implemented by this Scrapy project (also known as the project name). This will be used to construct the User-Agent by default, and also for logging.

一般而言就是你的项目工程的名字；

CONCURRENT_ITEMS

Default: 100

同时可以通过 Item Pipeline 并发处理的最大的 Items 的数量；

CONCURRENT_REQUESTS

Default: 16

能够被 scrapy downloader 同时并发处理的最大的 requests 数量；

CONCURRENT_REQUESTS_PER_DOMAIN

Default: 8

The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.

See also: AutoThrottle extension and its AUTOTHROTTLE_TARGET_CONCURRENCY option.

CONCURRENT_REQUESTS_PER_IP

Default: 0

The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If non-zero(不为0), the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words, concurrency limits will be applied per IP, not per domain.

This setting also affects DOWNLOAD_DELAY and AutoThrottle extension: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.

注意，当这个值不为 0 的时候，它将会替代 CONCURRENT_REQUESTS_PER_DOMAIN 的设置，转变为 per IP 而不再是 per Domain 的特性了；

DEFAULT_ITEM_CLASS

DEFAULT_REQUEST_HEADERS

Default:

{
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}

The default headers used for Scrapy HTTP Requests. They’re populated in the DefaultHeadersMiddleware.

DEPTH_LIMIT

Default: 0

Scope: scrapy.spidermiddlewares.depth.DepthMiddleware

The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.

DEPTH_PRIORITY

Default: 0

Scope: scrapy.spidermiddlewares.depth.DepthMiddleware

An integer that is used to adjust the request priority based on its depth:

if zero (default), no priority adjustment is made from depth
a positive value will decrease the priority, i.e. higher depth requests will be processed later ; this is commonly used when doing breadth-first crawls (BFO)
a negative value will increase priority, i.e., higher depth requests will be processed sooner (DFO)

该值是对子链接的爬取进行优先级设计，如果 >0，那么子链接的优先级将会降低；如果 <0，那么将会提高；

others

还有很多很多相关的有用的配置；https://doc.scrapy.org/en/latest/topics/settings.html#default-global-settings