爬虫 Scrapy 学习系列十一:Requests and Responses

前言

这是 Scrapy 系列学习文章之一,本章主要介绍 Requests 和 Responses 的相关的内容;

本文为作者的原创作品,转载需注明出处;

简介

Scrapy 使用 Request 和 Response 对象来对特定的网页进行爬去;

典型的,Request 对象是通过 spider 生成,并在系统中传递,直到遇到 Downloader 组件,DownLoader 将会执行该 request 请求并返回一个 response 给生成该 request 的 spider 对象;

Request 和 Response 都有其所对应的子类,相关内容参考 [Request subclasses] 和 [Response subclasses]

Request 对象

1
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])

一个 Request 对象既代表的是一个 HTTP request 对象,通常该对象是通过 spider 生成,并通过 Donwloader 执行,执行之后,生成相应的 response;相关参数的具体描述参考 parameters,几个重要的参数总结如下,

  • callback (callable)
    该方法当 request 被 Downloader 成功下载以后被调用,接受 response 作为参数;注意,如果在初始化 Request 对象的时候没有指定 callback,那么 spider 的默认方法 parse() 将会被调用;
  • meta (dict)

    the initial values for the Request.meta attribute. If given, the dict passed in this parameter will be shallow copied.

    Request.meta这个属性比较有意思,为每一个 Request 绑定这么一个唯一的 dict 对象,来存储与当前 Request 相关的参数;

  • headers (dict)

    the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers). If None is passed as value, the HTTP header will not be sent at all.

  • cookies (dict or list)
    既是表示请求的 cookies,可以通过两种方式指定

    • 使用 dict

      1
      2
      request_with_cookies = Request(url="http://www.example.com",
      cookies={'currency': 'USD', 'country': 'UY'})
    • 使用字典的列表形式

      1
      2
      3
      4
      5
      request_with_cookies = Request(url="http://www.example.com",
      cookies=[{'name': 'currency',
      'value': 'USD',
      'domain': 'example.com',
      'path': '/currency'}])

      这种方式与前者的区别是,该方式支持自定义domainpath属性;唯一的用处是,该 cookies 需要被存储并被后续请求所使用;

    通常,网站的 web 服务器将会通过 response 返回新的 cookie 内容并被作用到之后的 Request 请求当中;在某些情况下,你并不希望当前的 cookies 被新的 cookie 覆盖,那么你可以通过在 Request.meta 中将配置dont_merge_cookies设置为 True 来避免这种行为;

    更多的相关内容可以参考 CookiesMiddleware

  • priority (int)

    the priority of this request (defaults to 0). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.

  • errback (callable)

    a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives a Twisted Failure instance as first parameter. For more information, see Using errbacks to catch exceptions in request processing below.

为回调方法 (callback functions) 传递额外的参数

正如前文所述,callback 方法是当 Downloader 完成了对当前的 Request 下载,会回调该 callback 方法,并且将第一个参数设置为 response;但是,有些时候,你希望为该回调方法传递一些参数过去,该方法可以通过 Request.metadata 来进行传递;看下面这个例子,

1
2
3
4
5
6
7
8
9
10
11
12
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
yield request

def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
yield item

执行流程是,回调方法 parse_page1() 会回调 parse_page2() 方法,通过request.meta将属性 item 作为参数进行传递;

在请求过程中使用 errbacks 来捕获异常

当请求过程中出现异常,将会调用 errbacks 方法;该方法接收一个 Twisted instance 作为其第一个参数,通过该 instance 可以来检测 connection 的异常,DNS 异常等等;看下面这个例子,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
"http://www.httpbin.org:12345/", # non-responding host, timeout expected
"http://www.httphttpbinbin.org/", # DNS error expected
]

def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)

def parse_httpbin(self, response):
self.logger.info('Got successful response from {}'.format(response.url))
# do something useful here...

def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))

# in case you want to do something special for some errors,
# you may need the failure's type:

if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)

elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)

elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)

通过创建一个 errback 回调方法 errback_httpbin() 来处理相关的异常,注意,该方法的第一个参数 failure 就是前文所述的 Twisted instance;注意,Python 约定,实例方法的第一个参数的说法是忽略掉了 self 默认参数的;

Request.meta 的特殊键

从前文可以看到,Request.meta 属性比较重要,它是一个 dict,可以通过其设置dont_merge_cookies来避免 cookies 被覆盖的情况,也可以通过它为 callback 提供额外的参数等等;所以本小节将会对其内置的 keys 进行一些描述;相关具体的内容参考 special keys,看一下概览

  • dont_redirect
  • dont_retry
  • handle_httpstatus_list
  • handle_httpstatus_all
  • dont_merge_cookies (see cookies parameter of Request constructor)
  • cookiejar
  • dont_cache
  • redirect_urls
  • bindaddress
  • dont_obey_robotstxt
  • download_timeout
  • download_maxsize
  • download_latency
  • download_fail_on_dataloss
  • proxy
  • ftp_user (See FTP_USER for more info)
  • ftp_password (See FTP_PASSWORD for more info)
  • referrer_policy
  • max_retry_times

下面就几个比较重要的属性进行描述;

bindaddress

The IP of the outgoing IP address to use for the performing the request.

为当前的 Request 请求设置一个出口 IP;我的理解,就是指的请求的源 IP,但是,这样设置,能收到 Response 不呢?直觉告诉我,应该是收不到的,因为 web server 会根据 Request
的源 IP 来设置自己的 Response 的目的 IP;

download_timeout

The amount of time (in secs) that the downloader will wait before timing out. See also: DOWNLOAD_TIMEOUT.

这里是对单个 Request 对象进行设置,当然可以通过全局设置 DOWNLOAD_TIMEOUT 来进行设置;

download_latency

The amount of time spent to fetch the response, since the request has been started, i.e. HTTP message sent over the network. This meta key only becomes available when the response has been downloaded. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.

当 request 开始以后,需要花多少时间来获取该 response;

DOWNLOAD_MAXSIZE

Default: 1073741824 (1024MB)

The maximum response size (in bytes) that downloader will download.

If you want to disable it set to 0.

注意,该属性可以为单个的 spider 设置,也可以像上面描述的那样为每一个 Request 进行设置;不过,该特性要生效的话,必须满足 Twisted >= 11.1

DOWNLOAD_WARNSIZE

Default: 33554432 (32MB)

The response size (in bytes) that downloader will start to warn.

If you want to disable it set to 0.

当下载数据超过多少以后,开始报警;同样可以针对单个 spider 进行设置也可以为单个 Request 进行设置….

DOWNLOAD_FAIL_ON_DATALOSS

Default: True

Whether or not to fail on broken responses, that is, declared Content-Length does not match content sent by the server or chunked response was not properly finish. If True, these responses raise a ResponseFailed([_DataLoss]) error. If False, these responses are passed through and the flag dataloss is added to the response, i.e.: ‘dataloss’ in response.flags is True.

是否让一个被损坏的 responses 失败;判断依据是,当检测到服务器所反馈的Content-Length如果与实际返回的内容大小不匹配的时候,是否已失败进行处理;… 注意,同样,该属性可以全局的进行设置,也可以针对特定的某一个 Request 进行单独的配置;

注意,

If setting:RETRY_ENABLED is True and this setting is set to True, the ResponseFailed([_DataLoss]) failure will be retried as usual.

如果当设置 RETRY_ENABLED 被设置为 True,那么当遇到该错误以后,将会触发自动重试;

max_try_times

The meta key is used set retry times per request. When initialized, the max_retry_times meta key takes higher precedence over the RETRY_TIMES setting.

为每一个 request 单独设置 retry times

Request 的扩展类

下面介绍一下 Scrapy 中内置的 request subclasses;

FormRequest objects

顾名思义,该子类自然是用来处理 Form 的;

It uses lxml.html forms to pre-populate form fields with form data from Response objects.

它使用 lxml.html forms 来为 Response 中的 form fields 进行预填充;

1
class scrapy.http.FormRequest(url[, formdata, ...])

可见,FormRequest 只是在构造函数中添加了一个新的 formdata 参数;其余的参数与 Request 对象的构造函数是一致的;

formdata (dict or iterable of tuples) – is a dictionary (or iterable of (key, value) tuples) containing HTML Form data which will be url-encoded and assigned to the body of the request.

额外的提供了下面的这样一个类方法,

1
classmethod from_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])

该方法返回一个新的 FormRequest 对象,该对象中包含从 response 中的<form>元素中预填充的表单值;

默认情况下,该类方法将会对<input type="submit"/>元素模拟 click 操作;但是,有时候并不希望这一行为,可以通过设置参数dont_click为 True 来避免;

相关参数的详细描述参考 FormRequest objects args

一些例子

使用 FormRequest 通过 HTTP POST 来发送数据

1
2
3
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]

通过在 Spider 中返回一个 FormRequest 对象来模拟 HTML 的表单提交动作;

使用 FormRequest.form_response() 来模拟用户登录

通常 form_response() 方法会自动填充<input type='hidden' />这样的元素,比如 session 相关的数据或者是一些认证相关的 token 等,比如 csrf token;在抓取的时候,你希望这些 fields 能够自动的进行预填充而无需人工干预,或者只需要人工填充一部分值,比如用户名、密码以及图片验证码;看下面这个例子,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import scrapy

class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']

def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)

def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return

# continue scraping with authenticated session...

可见,通过此方法from_response(),自动的将<input type='hidden' />的表单元素值填充,并且只需要手动的填充相关的用户名、密码等;

Response 对象

1
class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None])

Response 对象既表示的是一个 HTTP Response 对象,通过 Downloader 执行 Request 请求后从服务器获得 Http Response 并生成此 Scrapy Response 对象;然后将该 Response 对象反馈给与之相关的 spider 对象;

参数如下,

  • url (string) – the URL of this response
  • status (integer) – the HTTP status of the response. Defaults to 200.
  • headers (dict) – the headers of this response. The dict values can be strings (for single valued headers) or lists (for multi-valued headers).
  • body (str) – the response body. It must be str, not unicode, unless you’re using a encoding-aware Response subclass, such as TextResponse.
  • flags (list) – is a list containing the initial values for the Response.flags attribute. If given, the list will be shallow copied.
  • request (Request object) – the initial value of the Response.request attribute. This represents the Request that generated this response.

Response 的扩展类

TextResponse 对象

1
class scrapy.http.TextResponse(url[, encoding[, ...]])

TextResponse objects adds encoding capabilities to the base Response class, which is meant to be used only for binary data, such as images, sounds or any media file.

因此TextResponse提供了一个新的构造参数,encoding,其余的参数与 Response 对象的构造参数相同;

encoding (string) – is a string which contains the encoding to use for this response. If you create a TextResponse object with a unicode body, it will be encoded using this encoding (remember the body attribute is always a string). If encoding is None (default value), the encoding will be looked up in the response headers and body instead.

简而言之,该参数的作用就是为 response body 中的内容(string)设置 encoding 的编码,如果该值为None(默认),那么该 encoding 编码将会从 responseheadersbody 中去查找;

相关属性

text

通过 Unicode 返回 Response Body;注意,它是Unicode编码格式的;不过返回的结果将会一直被缓存,所以可以多次通过Response.text来进行访问;

encoding

表示该 response 所使用的编码类型;response 的编码通过下面的步骤来识别,优先级由高到低

  • the encoding passed in the constructor encoding argument
  • the encoding declared in the Content-Type HTTP header. If this encoding is not valid (ie. unknown), it is ignored and the next resolution mechanism is tried.
  • the encoding declared in the response body. The TextResponse class doesn’t provide any special functionality for this. However, the HtmlResponse and XmlResponse classes do.
  • the encoding inferred by looking at the response body. This is the more fragile method but also the last one tried.

可见,最先从 TextResponse 中的构造函数的参数中查看,是否有设置;如果没有,则从 Response 的 HTTP header 中查找;如果没有找到,则中 response body 中去查找;

selector

A Selector instance using the response as target. The selector is lazily instantiated on first access.

相关方法

xpath(query)

A shortcut(快捷方式) to TextResponse.selector.xpath(query)

比如,

1
response.xpath('//p')

css(query)

follow(url, callback=None, method=’GET’ …)

1
follow(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding=None, priority=0, dont_filter=False, errback=None)

Return a Request instance to follow a link url. It accepts the same arguments as Request.__init__ method, but url can be not only an absolute URL, but also

  • a relative URL;
  • a scrapy.link.Link object (e.g. a link extractor result);
  • an attribute Selector (not SelectorList) - e.g. response.css('a::attr(href)')[0] or response.xpath('//img/@src')[0].
  • a Selector for <a> element, e.g. response.css('a.my_link')[0].

See A shortcut for creating Requests for usage examples.

查看相关例子,使用 response.follow

body_as_unicode()

The same as text, but available as a method. This method is kept for backwards compatibility; please prefer response.text.

HtmlResponse 对象

1
class scrapy.http.HtmlResponse(url[, ...])

The HtmlResponse class is a subclass of TextResponse which adds encoding auto-discovering support by looking into the HTML meta http-equiv attribute. See TextResponse.encoding.

继承自TextResponse,增加了编码自动发现功能,从 HTML meta http-equiv attribute 中自动发现;

XmlResponse 对象

1
class scrapy.http.XmlResponse(url[, ...])

The XmlResponse class is a subclass of TextResponse which adds encoding auto-discovering support by looking into the XML declaration line. See TextResponse.encoding.

继承自TextResponse并添加了编码自动发现的功能;