前言

这是 Scrapy 系列学习文章之一，本章主要介绍 Scrapy Shell 的相关的内容；

本文为作者的原创作品，转载需注明出处；

简介

Scrapy Shell 可以在不启动你的 spider 的情况下，对你需要的爬取和提取逻辑进行快速的检查；对调试和开发 spider 有非常大的帮助，并且可以快速定位错误；

配置

如果你安装了 IPython，Scrapy shell 将会使用它来取代标准的 Python console；IPython 是一款功能更为强大的 Python 控制台，提供了自动补全，彩色字符等等更为强大的功能；这里，在使用 Scrapy shell 的时候非常推荐是使用 IPython；如何安装使用参考 IPython 安装和配置过程；

启用 IPython 以后，Scrapy Shell 的效果图，

可见，Scrapy Shell 的字符变成了彩色字符；

Scrapy 还得支持 bpython，bypthon 作为补充在 IPython 不能使用的情况下使用；

通过配置文件，你可以设置是使用 ipython，bpython 还是标准的 python shell；可以通过环境变量 SCRAPY_PYTHON_SHELL 或者是通过 scrapy.cfg 进行设置，如下，

1 2	[settings] shell = bpython

启动

启动 shell 的命令格式如下，

1	scarpy shell <url>

shell 同样适用于本地文件，可以通过下面的命令来指定本地文件，

# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html

# File URI
scrapy shell file:///absolute/path/to/file.html

注意，如果要使用本地文件，必须显示的使用./或者是../来指明；否则它不会认为是本地文件，举例

$ scrapy shell index.html
[ ... scrapy shell starts ... ]
[ ... traceback ... ]
twisted.internet.error.DNSLookupError: DNS lookup failed:
address 'index.html' not found: [Errno -5] No address associated with hostname.

可以看到，如果单独指定一个 index.html，Python 并不会认为它是一个本地文件，而是试图通过 DNS 去查找相关的 URL 访问资源；因为相比于处理文件而言，Python 更喜欢处理的是 HTTP URLs 而不是文件，所以当你没有显示指明文件路径格式的时候，Python 会默认将其当做 URL 来处理；

使用

Scrapy Shell 就是指一个普通的 Python 控制台或者是 IPython 控制台；同时，它提供了一些额外的快捷方法；

快捷键

shelp()
打印出一组可用的对象和快捷键
fetch(url[, redirect=True])
通过参数 url 获得一个新的 response 对象，然后更新相关联对象；redirect 参数表示是否支持 3xx 的跳转( redirect )操作；
fetch(request)
通过参数 request 获得一个新的 response 对象，然后更新相关联对象
view(response)
通过你的本地浏览器打开 response 对象；该方法会在 response body 中添加一个 <base> tag，目的是为了能够正常的展示一些图片或者是 css style 的文件信息；不过需要注意的是，通过这种方式会在你的本地创建一些临时文件，而这些文件并不会被自动删除；

Scrapy objects

Scrapy shell 会为 download page 自动创建一些方便且有用的对象，比如 Response 和 Selector 对象；相关的对象全部罗列如下，

crawler
spider
request
通过 fetch(url) 所得到的当前的 Request 对象；可以通过 replace() 对其进行修改，或者是通过 fetch 方法返回一个新的 Request 对象；
response
通过 fetch() 返回的当前的 Response 对象
settings
参考 Settings

一个 Scrapy Shell 会话的例子

下面我们通过一个典型的例子来了解一下 Scrapy Shell 的会话过程，首先，爬取 http://scrapy.org 页面，然后再切换至爬取 https://reddit.com 页面，过程中，我们通过修改 (Reddit) 请求方法为 POST 请求，最终 fetch() 的时候得到一个 404 的错误；记住，这个例子中所爬取的页面是随着网站的变化随时有可能变化的；

首先，启动 Scrapy Shell

1	scrapy shell 'http://scrapy.org' --nolog

然后，shell 将通过 Scrapy Downloader 去获取 URL 的页面信息，同时打印出一组有用的对象和有用的快捷键；如下所述，

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f07395dd690>
[s]   item       {}
[s]   request    <GET http://scrapy.org>
[s]   response   <200 https://scrapy.org/>
[s]   settings   <scrapy.settings.Settings object at 0x7f07395dd710>
[s]   spider     <DefaultSpider 'default' at 0x7f0735891690>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

>>>

然后，我们按照上述的要求来执行该测试用例，

>>> response.xpath('//title/text()').extract_first()
'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

>>> fetch("http://reddit.com")

>>> response.xpath('//title/text()').extract()
['reddit: the front page of the internet']

>>> request = request.replace(method="POST")

>>> fetch(request)

>>> response.status
404

>>> from pprint import pprint

>>> pprint(response.headers)
{'Accept-Ranges': ['bytes'],
 'Cache-Control': ['max-age=0, must-revalidate'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Thu, 08 Dec 2016 16:21:19 GMT'],
 'Server': ['snooserv'],
 'Set-Cookie': ['loid=KqNLou0V9SKMX4qb4n; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
                'loidcreated=2016-12-08T16%3A21%3A19.445Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
                'loid=vi0ZVe4NkxNWdlH7r7; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
                'loidcreated=2016-12-08T16%3A21%3A19.459Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure'],
 'Vary': ['accept-encoding'],
 'Via': ['1.1 varnish'],
 'X-Cache': ['MISS'],
 'X-Cache-Hits': ['0'],
 'X-Content-Type-Options': ['nosniff'],
 'X-Frame-Options': ['SAMEORIGIN'],
 'X-Moose': ['majestic'],
 'X-Served-By': ['cache-cdg8730-CDG'],
 'X-Timer': ['S1481214079.394283,VS0,VE159'],
 'X-Ua-Compatible': ['IE=edge'],
 'X-Xss-Protection': ['1; mode=block']}
>>>

注意几个点

fetch(“http://reddit.com")
切换至提取 Reddit 页面
request = request.replace(method=”POST”)
将当前的请求改为 POST 请求，再次根据当前的 request 对象进行爬去，结果返回 404 错误，表示，当前的 reddit 页面并不支持 POST 请求；
pprint
该方法可以非常方便的打印出当前 response 的相关内容

从 Spider 中启动 Scrapy Shell 来检查 response

通过在 Spider 中使用scrapy.shell.inspect_response，可以用来通过 Scrapy Shell 来检测当前的 response；看下面这个例子，

新建一个 Scrapy 工程 tutorial

1 2	$ scrapy startproject tutorial $ cd tutorial

创建 MySpider
首先，通过 genspider 创建 MySpider

1
2
3

$ scrapy genspider MySpider scrapy.org
Created spider 'MySpider' using template 'basic' in module:
  tutorial.spiders.MySpider

然后，打开 MySpider，写入下面的内容，

import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "http://scrapy.org",
        "http://www.baidu.com"
    ]

    def parse(self, response):
        # We want to inspect one specific response.
        if ".org" in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response, self)

        # Rest of parsing code.

注意，该 Spider 被命名为 “myspider”；

执行该 Spider

   $ scrapy crawl myspider
   2017-07-17 11:19:21 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: tutorial)
   ...
   2017-07-17 11:19:24 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://scrapy.org/> from <GET http://scrapy.org>
2017-07-17 11:19:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapy.org/> (referer: None)
   ....
   >>> response.url
   'http://example.org'

可见，当通过在项目 tutorial 相对路径中通过 bash 命令执行该 spider，将会被拦截并被自动的跳转到 Scrapy Shell 命令窗口中，接下来我们便可以通过 Scrapy Shell 做任何我们想要进行的调试了；

检测 response
是否包含该 xpath 元素？
1
2
>>> response.xpath('//h1[@class="fn"]')
[]
通过本地浏览器查看所抓取的页面
1
2
>>> view(response)
True
退出
通过 Ctrl-D (Ctrl-Z windowns) 退出，退出以后，spider 将会继续从断开的地方继续执行；注意，在调试过程中，可以使用 fetch() 方法去获取一个新的 Request / Response，不过当你退出以后，将会继续 spider 之前的 response 和 request，也就是说，当通过上述方法通过拦截进入 Scrapy Shell 后，在其内部通过 fetch 或者 replace 方法对 request 做过的任何修改，并不会影响到原来的 spider；

附录

IPython 安装和配置过程

安装，

1	$ pip3 isntall ipython

配置，这里演示如何通过环境变量进行配置

1	$ vim ~/.bash_profile

添加如下内容

1	export SCRAPY_PYTHON_SHELL=ipython

重新载入

1	$ source ~/.bash_profile