前言

这是 Scrapy 系列学习文章之一，本章主要介绍 Command Line Tool 相关的内容；

本文为作者的原创作品，转载需注明出处；

Command Line Tool 简介

当前最新版本是 0.10；

Scrapy 是由 scrapy command-line tool 所控制的，也称作 Scarpy tool；Scrapy tool 提供了大量的命令来完成相应的操作；

scrapy.cfg

载入方式

scrapy.cfg是 Scrapy tool 执行的环境参数，Scrapy 命令启动的时候，会从如下的路径中去查找scrapy.cfg文件，

/etc/scrapy.cfg或者c:\scrapy\scrapy.cfg，既是首先在系统范围内查找；
在用户级别的路径中查找~/.config/scrapy.cfg ($XDG_CONFIG_HOME) 和~/.scrapy.cfg($HOME)；
最后在项目的根目录中查找scrapy.cfg；

相同属性如果在不同文件中有所定义，那么将会被合并，合并相关的优先级顺序是，用户级别的属性高于系统级别和项目级别的属性，也就是说，如果在用户级别定义了某个scrapy.cfg的属性，那么该属性会覆盖其它的；

内容

笔者搜索了本地所有的 scrapy.cfg 的内容，

$ find / -name "scrapy.cfg"
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/templates/project/scrapy.cfg
/Users/mac/workspace/python/scrapy/QuotesBot/quotesbot/scrapy.cfg
/Users/mac/workspace/python/scrapy/tutorial/scrapy.cfg

总共找到了三处有关scrapy.cfg的定义，一处是系统的默认的模板配置，另外两处都是工程中所独有的配置

python3.6/site-packages/scrapy/templates/project/scrapy.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = ${project_name}.settings

[deploy]
#url = http://localhost:6800/
project = ${project_name}

两个属性，一个用来 deploy，一个是有关设置 settings 的；有关 deploy 的信息很简单，告诉我工程名称是什么；另外一个有关scrapy.cfg的设置文件则对应的是工程目录中的 settings.py

QuotesBot/quotesbot/scrapy.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = quotesbot.settings

[deploy]
#url = http://localhost:6800/
project = quotesbot

有关scrapy.cfg的设置对应的是 quotesbot 工程中的 settings.py 文件，下面看看该文件内容，

# -*- coding: utf-8 -*-

# Scrapy settings for quotesbot project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'quotesbot'

SPIDER_MODULES = ['quotesbot.spiders']
NEWSPIDER_MODULE = 'quotesbot.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'quotesbot (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'quotesbot.middlewares.MyCustomSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'quotesbot.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'quotesbot.pipelines.SomePipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

上述文件中，定义了大量有关scrapy.cfg的相关配置属性；包括是否使用 Cookie 的属性 COOKIES_ENABLED，两次下载的间隔时间 DOWNLOAD_DELAY 等等；可以使用命令 settings 快速的查找所设置的指定参数；

Scrapy 项目的默认结构

scrapy.cfg
myproject/
    __init__.py
    items.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...

scrapy.cfg放置在项目的根目录中，注意，项目的根目录与工程目录 myproject 同级；scrapy.cfg中的内容如下，

1 2	[settings] default = myproject.settings

很明显，它将 settings.py 中的配置内容作为默认配置；

使用 scrapy tool

可以直接在命令行中输入scrapy，

在任意地方输入( Scrapy 工程目录以外 )

$ scrapy
Scrapy 1.4.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

会打印出 scrapy 当前的版本，以及 scrapy 的命令提示；

Scrapy 工程目录内，这里模拟使用的是 quotesbot 工程；

quotesbot mac$ scrapy
Scrapy 1.4.0 - project: quotesbot

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

可以看到，不仅打印出了scrapy的当前版本信息和帮助提示，并且输出了当前项目的信息；

创建项目

使用命令，

1	$ scrapy startproject myproject [project_dir]

如果是在当前目录下创建，则 project_dir 参数可以省略；参考例子创建 tutorial 项目；

项目控制

有些 scrapy 命令是在项目内部使用的，所以，执行命令的当前路径必须在项目的根目录中，比如，我要在当前工程中创建一个爬虫 mydomain，可以使用如下的命令，

1	$ scrapy genspider mydomain mydomain.com

生成一个名字为 mydomain 的爬虫且它可爬取的域是 mydomain.com；

当然，还有一些命令比如crawl只能在一个项目的上下文中通过命令行执行；还要注意的是，有些命令再项目中和项目外执行是有些许区别的，比如，fetch命令如果在项目中执行，会有 spider-overridden 的行为（比如，使用当前正在执行 spider 中的user_agent去替换当前的user_agent)；

命令

命令分为两类，Global commands 和 Project-only commands；

查看帮助

查看某个特殊命令的帮助，

1	$ scrapy <command> -h

比如执行

$ scrapy bench -h
Usage
=====
  scrapy bench

Run quick benchmark test

Options
=======
--help, -h              show this help message and exit

Global Options
--------------
--logfile=FILE          log file. \if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: INFO)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

查看所有可用的命令

$ scrapy -h
Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Global 命令 (Global commands)

startproject

语法: scrapy startproject <project_name> [project_dir]
是否需要在项目中执行：否

创建一个新的工程；

genspider

语法: scrapy genspider [-t template] <name> <domain>
是否需要在项目中执行：否

在当前目录或者是在当前的工程的 spiders目录中创建爬虫；<name>是爬虫的名称，<domain>用来生成allowed_domains属性信息；使用如下用例，

$ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

$ scrapy genspider example example.com
Created spider 'example' using template 'basic'

$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'

上述的例子说明了如何通过模板创建爬虫；

settings

语法: scrapy settings [options]
是否需要在项目中执行：否

获得一个 Scrapy 的相关配置数据；如果是在一个工程的上下文中执行，则返回的是当前工程的设置值，如果不是，则会显示默认的设置值；

$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0

runspider

语法: scrapy runspider <spider_file.py>
是否需要在项目中执行：否

从一个单独的 Python 文件中执行 spider，而不需要一个工程；使用用例，

1 2	$ scrapy runspider myspider.py [ ... spider starts crawling ... ]

shell

语法: scrapy shell [url]
是否需要在项目中执行：否

在前文的 tutorial 中通过命令行的方式提取有大篇幅介绍如何使用 scrapy shell 的相关内容；

fetch

语法: scrapy fetch <url>
是否需要在项目中执行：否

从给定的 URL 中通过 Scrapy downloader 进行下载并且将下载后的内容写入标准输出；

有意思的是，它会和 spider 一样的去进行下载，比如，如果 spider 自定义了 USER_AGENT，那么 fetch 也将会使用 spider 自定义的 USER_AGENT；所以，这个命令可以帮助我们去检查 spider 是如何去获取一个指定的页面的；

如果在项目之外使用它，那么就没有特别的 spider 可供使用，这个时候它将会使用默认的 Scrapy downloader 的配置；

支持的 options 有，

--spider=SPIDER: bypass spider autodetection and force use of specific spider
--headers: print the response’s HTTP headers instead of the response’s body
--no-redirect: do not follow HTTP 3xx redirects (default is to follow them)

$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]

$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
 'Age': ['1263   '],
 'Connection': ['close     '],
 'Content-Length': ['596'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
 'Etag': ['"573c1-254-48c9c87349680"'],
 'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
 'Server': ['Apache/2.2.3 (CentOS)']}

view

语法: scrapy view <url>
是否需要在项目中执行：否

与通过浏览器观察页面不同，spiders 有自己看页面的方式，spiders 看页面的方式自然就是view指令；

Supported options:

--spider=SPIDER: bypass spider autodetection and force use of specific spider
--no-redirect: do not follow HTTP 3xx redirects (default is to follow them)

1 2	$ scrapy view http://www.example.com/some/page.html [ ... browser starts ... ]

version

语法: scrapy version [-v]
是否需要在项目中执行：否

1 2	$ scrapy version Scrapy 1.4.0

$ scrapy version -v
Scrapy    : 1.4.0
lxml      : 3.8.0.0
libxml2   : 2.9.4
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.17.0
Twisted   : 17.5.0
Python    : 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
pyOpenSSL : 17.0.0 (OpenSSL 1.1.0f  25 May 2017)
Platform  : Darwin-16.6.0-x86_64-i386-64bit

bench

语法: scrapy crawl <spider>
是否需要在项目中执行：否

执行一个快速的性能指标测试，Benchmarking

项目内部命令 (Project-only commands)

crawl

语法: scrapy crawl <spider>
是否需要在项目中执行：是

启动一个 spider 并开始爬取；看一个例子，

1 2	$ scrapy crawl myspider [ ... myspider starts crawling ... ]

check

语法: scrapy check [-l] <spider>
是否需要在项目中执行：是

执行 Spider 的 Contract 检查；补充，Contract 的作用就是通过一系列的简单约定来替代单元测试；

list

语法: scrap list
是否需要在项目中执行：是

将当前项目中的所有 spiders 全部罗列出来；例子

1
2
3

$ scrapy list
spider1
spider2

edit

语法: scrapy edit <spider>
是否需要在项目中执行：是

使用在环境变量中所定义的EDITOR来对当前的 spider 进行编辑；

1	$ scrapy edit my_spider

parse

语法: scrapy parse <url> [options]
是否需要在项目中执行：是

获取指定的 URL，然后使用爬虫来对它进行解析(parse)，使用--callback来指定解析方法；

Options

--spider=SPIDER: bypass spider autodetection and force use of specific spider
绕过自动查找 spider 的方式而强制使用某个特定的爬虫；
--a NAME=VALUE: set spider argument (may be repeated)
设置 spider 的参数
--callback or -c: spider method to use as callback for parsing the response
设置用来解析 response 的反馈内容；
--pipelines: process items through pipelines
设置 pipeline
--rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the response
使用 CrawlSpider 规则来发现 callback
--noitems: don’t show scraped items
--nolinks: don’t show extracted links
--nocolour: avoid using pygments to colorize the output
--depth or -d: depth level for which the requests should be followed recursively (default: 1)
设置针对 request 的查询深度，默认是 1
--verbose or -v: display information for each depth level
为每一层深度查询显示信息；

一个例子，

$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'name': u'Example item',
 'category': u'Furniture',
 'length': u'12 cm'}]

# Requests  -----------------------------------------------------------------
[]

Custom project commands

可以通过COMMANDS_MODULE自定义项目特有的命令；可以查看样例如何创建项目特有命令；

COMMANDS_MODULE

默认: ‘’(空字符)

使用一个模块来查找自定义的 Scrapy 命令；样例

1	COMMANDS_MODULE = 'mybot.commands'

通过 setup.py 注册命令的执行入口

你同样可以通过在一个外部 library 中的setup.py通过添加scrapy.commands部分来添加 Scrapy 命令；来看一个例子，

from setuptools import setup, find_packages

setup(name='scrapy-mymodule',
  entry_points={
    'scrapy.commands': [
      'my_command=my_scrapy_module.commands:MyCommand',
    ],
  },
 )

Reference

https://doc.scrapy.org/en/latest/topics/commands.html