前言
这是 Scrapy 系列学习文章之一,本章主要介绍 Settings 的相关的内容;
本文为作者的原创作品,转载需注明出处;
简介
Scrapy settings 的设计目的就是允许你通过设置来自定义所有 Scrapy 组件的行为,包括 core、extensions、pipeline 以及 spiders 自身;
settings 的内部提供了一个全局的以 key-value mapping 为存储格式的的命名空间,相关代码可以从中获取相关配置属性;当然,填充 settings 的方式有很多种机制,后续会有介绍;
For a list of available built-in settings see: Built-in settings reference.
指定 settings
当你使用 Scrapy 的时候,你需要指定哪一个 Scrapy settings 是你期望使用的;你可以通过环境变量来设置,SCRAPY_SETTINGS_MODULE
;并且注意的是SCRAPY_SETTINGS_MODULE
所对应的值必须是在 Python 的 import search path 中的
填充 settings
填充 settings 的方式有多种,
- Command line options (most precedence)
- Settings per-spider
- Project settings module
- Default settings per-command
- Default global settings (less precedence)
上面的配置的优先级是从上至下的;
Command line options
Arguments provided by the command line are the ones that take most precedence, overriding any other options. You can explicitly override one (or more) settings using the
-s
(or--set
) command line option.
1 | scrapy crawl myspider -s LOG_FILE=scrapy.log |
Setting per-spider
Spiders (See the Spiders chapter for reference) can define their own settings that will take precedence and override the project ones. They can do so by setting their custom_settings attribute:
注意,在 spider 上的配置会覆盖 project 级别上的配置;
1 | class MySpider(scrapy.Spider): |
这里不得不提笔者在前叙文章中所谈到的 如何为不同的 Spider 定义不同的 Item Pipeline 例子;
Project settings module
The project settings module is the standard configuration file for your Scrapy project, it’s where most of your custom settings will be populated. For a standard Scrapy project, this means you’ll be adding or changing the settings in the
settings.py
file created for your project.
Default settings per-command
Each Scrapy tool command can have its own default settings, which override
the global default settings. Those custom command settings are specified in the default_settings attribute of the command class.
Default global settings
The global defaults are located in the scrapy.settings.default_settings module and documented in the Built-in settings reference section.
访问 settings
In a spider, the settings are available through
self.settings
:
1 | class MySpider(scrapy.Spider): |
注意,
The settings attribute is set in the base Spider class after the spider is initialized. If you want to use the settings before the initialization (e.g., in your spider’s
__init__()
method), you’ll need to override thefrom_crawler()
method.
settings 对象是在 spider 实例进行初始化以后
在 base Spider 中设置的;如果你想要在实例化之前就想要访问相关的 settings,那么你必须覆盖 from_crawler() 方法;
Settings 可以在 extensions,middleware 和 item pipelines 中通过方法 from_crawler
中的参数对象 Crawler 进行访问;看一个 Extension 相关的例子,
1 | class MyExtension(object): |
设置合理的 setting 名字
Setting names are usually prefixed with the component that they configure. For example, proper setting names for a fictional robots.txt extension would be ROBOTSTXT_ENABLED, ROBOTSTXT_OBEY, ROBOTSTXT_CACHEDIR, etc.
通常,一个合理的 setting 的名字需要有一个与模块相关的前缀名称;
内置 settings 属性
非常多的内置配置属性,笔者这里不一一列举,针对一些核心的配置项进行专门描述
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
BOT_NAME
Default:
'scrapybot'
The name of the bot implemented by this Scrapy project (also known as the project name). This will be used to construct the User-Agent by default, and also for logging.
一般而言就是你的项目工程的名字;
CONCURRENT_ITEMS
Default: 100
同时可以通过 Item Pipeline 并发处理的最大的 Items 的数量;
CONCURRENT_REQUESTS
Default: 16
能够被 scrapy downloader 同时并发处理的最大的 requests 数量;
CONCURRENT_REQUESTS_PER_DOMAIN
Default: 8
The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.
See also: AutoThrottle extension and its AUTOTHROTTLE_TARGET_CONCURRENCY option.
CONCURRENT_REQUESTS_PER_IP
Default: 0
The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If
non-zero
(不为0
), the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words, concurrency limits will be applied per IP, not per domain.
This setting also affects DOWNLOAD_DELAY and AutoThrottle extension: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.
注意,当这个值不为 0 的时候,它将会替代 CONCURRENT_REQUESTS_PER_DOMAIN 的设置,转变为 per IP 而不再是 per Domain 的特性了;
DEFAULT_ITEM_CLASS
DEFAULT_REQUEST_HEADERS
Default:
1 | { |
The default headers used for Scrapy HTTP Requests. They’re populated in the
DefaultHeadersMiddleware
.
DEPTH_LIMIT
Default:
0
Scope: scrapy.spidermiddlewares.depth.DepthMiddleware
The
maximum depth
that will be allowed to crawl for any site. If zero, no limit will be imposed.
DEPTH_PRIORITY
Default: 0
Scope: scrapy.spidermiddlewares.depth.DepthMiddleware
An integer that is used to adjust the request priority based on its depth:
- if
zero
(default),no
priorityadjustment
is made from depth - a positive value will
decrease
the priority, i.e. higher depth requests will be processed later ; this is commonly used when doing breadth-first crawls (BFO) - a
negative
value willincrease
priority, i.e., higher depth requests will be processed sooner (DFO)
该值是对子链接的爬取进行优先级设计,如果 >0,那么子链接的优先级将会降低;如果 <0,那么将会提高;
others
还有很多很多相关的有用的配置;https://doc.scrapy.org/en/latest/topics/settings.html#default-global-settings