爬虫 Scrapy 学习系列之三:Command Line Tool

前言

这是 Scrapy 系列学习文章之一,本章主要介绍 Command Line Tool 相关的内容;

本文为作者的原创作品,转载需注明出处;

Command Line Tool 简介

当前最新版本是 0.10;

Scrapy 是由 scrapy command-line tool 所控制的,也称作 Scarpy tool;Scrapy tool 提供了大量的命令来完成相应的操作;

scrapy.cfg

载入方式

scrapy.cfg是 Scrapy tool 执行的环境参数,Scrapy 命令启动的时候,会从如下的路径中去查找scrapy.cfg文件,

  1. /etc/scrapy.cfg或者c:\scrapy\scrapy.cfg,既是首先在系统范围内查找;
  2. 在用户级别的路径中查找~/.config/scrapy.cfg ($XDG_CONFIG_HOME) 和~/.scrapy.cfg($HOME);
  3. 最后在项目的根目录中查找scrapy.cfg

相同属性如果在不同文件中有所定义,那么将会被合并,合并相关的优先级顺序是,用户级别的属性高于系统级别和项目级别的属性,也就是说,如果在用户级别定义了某个scrapy.cfg的属性,那么该属性会覆盖其它的;

内容

笔者搜索了本地所有的 scrapy.cfg 的内容,

1
2
3
4
$ find / -name "scrapy.cfg"
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/templates/project/scrapy.cfg
/Users/mac/workspace/python/scrapy/QuotesBot/quotesbot/scrapy.cfg
/Users/mac/workspace/python/scrapy/tutorial/scrapy.cfg

总共找到了三处有关scrapy.cfg的定义,一处是系统的默认的模板配置,另外两处都是工程中所独有的配置

  1. python3.6/site-packages/scrapy/templates/project/scrapy.cfg

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    # Automatically created by: scrapy startproject
    #
    # For more information about the [deploy] section see:
    # https://scrapyd.readthedocs.org/en/latest/deploy.html

    [settings]
    default = ${project_name}.settings

    [deploy]
    #url = http://localhost:6800/
    project = ${project_name}

    两个属性,一个用来 deploy,一个是有关设置 settings 的;有关 deploy 的信息很简单,告诉我工程名称是什么;另外一个有关scrapy.cfg的设置文件则对应的是工程目录中的 settings.py

  2. QuotesBot/quotesbot/scrapy.cfg

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    # Automatically created by: scrapy startproject
    #
    # For more information about the [deploy] section see:
    # https://scrapyd.readthedocs.org/en/latest/deploy.html

    [settings]
    default = quotesbot.settings

    [deploy]
    #url = http://localhost:6800/
    project = quotesbot

    有关scrapy.cfg的设置对应的是 quotesbot 工程中的 settings.py 文件,下面看看该文件内容,

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    # -*- coding: utf-8 -*-

    # Scrapy settings for quotesbot project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    # http://doc.scrapy.org/en/latest/topics/settings.html
    # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

    BOT_NAME = 'quotesbot'

    SPIDER_MODULES = ['quotesbot.spiders']
    NEWSPIDER_MODULE = 'quotesbot.spiders'


    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'quotesbot (+http://www.yourdomain.com)'

    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True

    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32

    # Configure a delay for requests for the same website (default: 0)
    # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16

    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False

    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False

    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    # 'Accept-Language': 'en',
    #}

    # Enable or disable spider middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    # 'quotesbot.middlewares.MyCustomSpiderMiddleware': 543,
    #}

    # Enable or disable downloader middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    # 'quotesbot.middlewares.MyCustomDownloaderMiddleware': 543,
    #}

    # Enable or disable extensions
    # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    # 'scrapy.extensions.telnet.TelnetConsole': None,
    #}

    # Configure item pipelines
    # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
    #ITEM_PIPELINES = {
    # 'quotesbot.pipelines.SomePipeline': 300,
    #}

    # Enable and configure the AutoThrottle extension (disabled by default)
    # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False

    # Enable and configure HTTP caching (disabled by default)
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

    上述文件中,定义了大量有关scrapy.cfg的相关配置属性;包括是否使用 Cookie 的属性 COOKIES_ENABLED,两次下载的间隔时间 DOWNLOAD_DELAY 等等;可以使用命令 settings 快速的查找所设置的指定参数;

Scrapy 项目的默认结构

1
2
3
4
5
6
7
8
9
10
11
scrapy.cfg
myproject/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...

scrapy.cfg放置在项目的根目录中,注意,项目的根目录与工程目录 myproject 同级;scrapy.cfg中的内容如下,

1
2
[settings]
default = myproject.settings

很明显,它将 settings.py 中的配置内容作为默认配置;

使用 scrapy tool

可以直接在命令行中输入scrapy

  1. 在任意地方输入( Scrapy 工程目录以外 )

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    $ scrapy
    Scrapy 1.4.0 - no active project

    Usage:
    scrapy <command> [options] [args]

    Available commands:
    bench Run quick benchmark test
    fetch Fetch a URL using the Scrapy downloader
    genspider Generate new spider using pre-defined templates
    runspider Run a self-contained spider (without creating a project)
    settings Get settings values
    shell Interactive scraping console
    startproject Create new project
    version Print Scrapy version
    view Open URL in browser, as seen by Scrapy

    [ more ] More commands available when run from project directory

    会打印出 scrapy 当前的版本,以及 scrapy 的命令提示;

  2. Scrapy 工程目录内,这里模拟使用的是 quotesbot 工程;

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    quotesbot mac$ scrapy
    Scrapy 1.4.0 - project: quotesbot

    Usage:
    scrapy <command> [options] [args]

    Available commands:
    bench Run quick benchmark test
    check Check spider contracts
    crawl Run a spider
    edit Edit spider
    fetch Fetch a URL using the Scrapy downloader
    genspider Generate new spider using pre-defined templates
    list List available spiders
    parse Parse URL (using its spider) and print the results
    runspider Run a self-contained spider (without creating a project)
    settings Get settings values
    shell Interactive scraping console
    startproject Create new project
    version Print Scrapy version
    view Open URL in browser, as seen by Scrapy

    Use "scrapy <command> -h" to see more info about a command

    可以看到,不仅打印出了scrapy的当前版本信息和帮助提示,并且输出了当前项目的信息;

创建项目

使用命令,

1
$ scrapy startproject myproject [project_dir]

如果是在当前目录下创建,则 project_dir 参数可以省略;参考例子 创建 tutorial 项目

项目控制

有些 scrapy 命令是在项目内部使用的,所以,执行命令的当前路径必须在项目的根目录中,比如,我要在当前工程中创建一个爬虫 mydomain,可以使用如下的命令,

1
$ scrapy genspider mydomain mydomain.com

生成一个名字为 mydomain 的爬虫且它可爬取的域是 mydomain.com;

当然,还有一些命令比如crawl只能在一个项目的上下文中通过命令行执行;还要注意的是,有些命令再项目中和项目外执行是有些许区别的,比如,fetch命令如果在项目中执行,会有 spider-overridden 的行为(比如,使用当前正在执行 spider 中的user_agent去替换当前的user_agent);

命令

命令分为两类,Global commandsProject-only commands

查看帮助

查看某个特殊命令的帮助,

1
$ scrapy <command> -h

比如执行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ scrapy bench -h
Usage
=====
scrapy bench

Run quick benchmark test

Options
=======
--help, -h show this help message and exit

Global Options
--------------
--logfile=FILE log file. \if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: INFO)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure

查看所有可用的命令

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ scrapy -h
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy

Global 命令 (Global commands)

startproject

  • 语法: scrapy startproject <project_name> [project_dir]
  • 是否需要在项目中执行:否

创建一个新的工程;

genspider

  • 语法: scrapy genspider [-t template] <name> <domain>
  • 是否需要在项目中执行:否

在当前目录或者是在当前的工程的 spiders目录中创建爬虫;<name>是爬虫的名称,<domain>用来生成allowed_domains属性信息;使用如下用例,

1
2
3
4
5
6
7
8
9
10
11
12
$ scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed

$ scrapy genspider example example.com
Created spider 'example' using template 'basic'

$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'

上述的例子说明了如何通过模板创建爬虫;

settings

  • 语法: scrapy settings [options]
  • 是否需要在项目中执行:否

获得一个 Scrapy 的相关配置数据;如果是在一个工程的上下文中执行,则返回的是当前工程的设置值,如果不是,则会显示默认的设置值;

1
2
3
4
$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0

runspider

  • 语法: scrapy runspider <spider_file.py>
  • 是否需要在项目中执行:否

从一个单独的 Python 文件中执行 spider,而不需要一个工程;使用用例,

1
2
$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]

shell

  • 语法: scrapy shell [url]
  • 是否需要在项目中执行:否

在前文的 tutorial 中 通过命令行的方式提取 有大篇幅介绍如何使用 scrapy shell 的相关内容;

fetch

  • 语法: scrapy fetch <url>
  • 是否需要在项目中执行:否

从给定的 URL 中通过 Scrapy downloader 进行下载并且将下载后的内容写入标准输出;

有意思的是,它会和 spider 一样的去进行下载,比如,如果 spider 自定义了 USER_AGENT,那么 fetch 也将会使用 spider 自定义的 USER_AGENT;所以,这个命令可以帮助我们去检查 spider 是如何去获取一个指定的页面的;

如果在项目之外使用它,那么就没有特别的 spider 可供使用,这个时候它将会使用默认的 Scrapy downloader 的配置;

支持的 options 有,

  • --spider=SPIDER: bypass spider autodetection and force use of specific spider
  • --headers: print the response’s HTTP headers instead of the response’s body
  • --no-redirect: do not follow HTTP 3xx redirects (default is to follow them)
1
2
3
4
5
6
7
8
9
10
11
12
13
$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]

$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
'Age': ['1263 '],
'Connection': ['close '],
'Content-Length': ['596'],
'Content-Type': ['text/html; charset=UTF-8'],
'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
'Etag': ['"573c1-254-48c9c87349680"'],
'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
'Server': ['Apache/2.2.3 (CentOS)']}

view

  • 语法: scrapy view <url>
  • 是否需要在项目中执行:否

与通过浏览器观察页面不同,spiders 有自己看页面的方式,spiders 看页面的方式自然就是view指令;

Supported options:

  • --spider=SPIDER: bypass spider autodetection and force use of specific spider
  • --no-redirect: do not follow HTTP 3xx redirects (default is to follow them)
1
2
$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]

version

  • 语法: scrapy version [-v]
  • 是否需要在项目中执行:否
1
2
$ scrapy version
Scrapy 1.4.0
1
2
3
4
5
6
7
8
9
10
11
$ scrapy version -v
Scrapy : 1.4.0
lxml : 3.8.0.0
libxml2 : 2.9.4
cssselect : 1.0.1
parsel : 1.2.0
w3lib : 1.17.0
Twisted : 17.5.0
Python : 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
pyOpenSSL : 17.0.0 (OpenSSL 1.1.0f 25 May 2017)
Platform : Darwin-16.6.0-x86_64-i386-64bit

bench

  • 语法: scrapy crawl <spider>
  • 是否需要在项目中执行:否

执行一个快速的性能指标测试,Benchmarking

项目内部命令 (Project-only commands)

crawl

  • 语法: scrapy crawl <spider>
  • 是否需要在项目中执行:是

启动一个 spider 并开始爬取;看一个例子,

1
2
$ scrapy crawl myspider
[ ... myspider starts crawling ... ]

check

  • 语法: scrapy check [-l] <spider>
  • 是否需要在项目中执行:是

执行 Spider 的 Contract 检查;补充,Contract 的作用就是通过一系列的简单约定来替代单元测试;

list

  • 语法: scrap list
  • 是否需要在项目中执行:是

将当前项目中的所有 spiders 全部罗列出来;例子

1
2
3
$ scrapy list
spider1
spider2

edit

  • 语法: scrapy edit <spider>
  • 是否需要在项目中执行:是

使用在环境变量中所定义的EDITOR来对当前的 spider 进行编辑;

1
$ scrapy edit my_spider

parse

  • 语法: scrapy parse <url> [options]
  • 是否需要在项目中执行:是

获取指定的 URL,然后使用爬虫来对它进行解析(parse),使用--callback来指定解析方法;

Options

  • --spider=SPIDER: bypass spider autodetection and force use of specific spider
    绕过自动查找 spider 的方式而强制使用某个特定的爬虫;
  • --a NAME=VALUE: set spider argument (may be repeated)
    设置 spider 的参数
  • --callback or -c: spider method to use as callback for parsing the response
    设置用来解析 response 的反馈内容;
  • --pipelines: process items through pipelines
    设置 pipeline
  • --rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the response
    使用 CrawlSpider 规则来发现 callback
  • --noitems: don’t show scraped items
  • --nolinks: don’t show extracted links
  • --nocolour: avoid using pygments to colorize the output
  • --depth or -d: depth level for which the requests should be followed recursively (default: 1)
    设置针对 request 的查询深度,默认是 1
  • --verbose or -v: display information for each depth level
    为每一层深度查询显示信息;

一个例子,

1
2
3
4
5
6
7
8
9
10
11
$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items ------------------------------------------------------------
[{'name': u'Example item',
'category': u'Furniture',
'length': u'12 cm'}]

# Requests -----------------------------------------------------------------
[]

Custom project commands

可以通过COMMANDS_MODULE自定义项目特有的命令;可以查看样例如何创建项目特有命令;

COMMANDS_MODULE

默认: ‘’(空字符)

使用一个模块来查找自定义的 Scrapy 命令;样例

1
COMMANDS_MODULE = 'mybot.commands'

通过 setup.py 注册命令的执行入口

你同样可以通过在一个外部 library 中的setup.py通过添加scrapy.commands部分来添加 Scrapy 命令;来看一个例子,

1
2
3
4
5
6
7
8
9
from setuptools import setup, find_packages

setup(name='scrapy-mymodule',
entry_points={
'scrapy.commands': [
'my_command=my_scrapy_module.commands:MyCommand',
],
},
)

Reference

https://doc.scrapy.org/en/latest/topics/commands.html