爬虫 Scrapy 学习系列十五:日志

前言

这是 Scrapy 系列学习文章之一,本章主要介绍日志的相关的内容;

本文为作者的原创作品,转载需注明出处;

简介

注意,scrapy.log已经被 deprecated 了,取而代之,通过使用 Python 的内置的标准 logging 的方式来调用; Scrapy Loggings 可以通过 Logging settings 进行扩展;

Scrapy calls scrapy.utils.log.configure_logging() to set some reasonable defaults and handle those settings in Logging settings when running commands, so it’s recommended to manually call it if you’re running Scrapy from scripts as described in Run Scrapy from a script.

Log levels

Python’s builtin logging defines 5 different levels to indicate the severity of a given log message. Here are the standard ones, listed in decreasing order:

  1. logging.CRITICAL - for critical errors (highest severity)
  2. logging.ERROR - for regular errors
  3. logging.WARNING - for warning messages
  4. logging.INFO - for informational messages
  5. logging.DEBUG - for debugging messages (lowest severity)

记录日志

1
2
import logging
logging.warning("This is a warning")

使用logging.log方法,接受一个 log level 作为参数

1
2
import logging
logging.log(logging.WARNING, "This is a warning")

上述的 logger 相当于使用的是 root logger;等价于

1
2
3
import logging
logger = logging.getLogger()
logger.warning("This is a warning")

通过logging.getLogger方法来根据不同的名字获取不同的 logger 实例

1
2
3
import logging
logger = logging.getLogger('mycustomlogger')
logger.warning("This is a warning")

可以使用类的内置参数 __name__ 来取代

1
2
3
import logging
logger = logging.getLogger(__name__)
logger.warning("This is a warning")

有关 Python logging 知识参考 Basic Logging TutorialFurther documentation on loggers

Logging from Spider

Scrapy provides a logger within each Spider instance, which can be accessed and used like this:

1
2
3
4
5
6
7
8
9
import scrapy

class MySpider(scrapy.Spider):

name = 'myspider'
start_urls = ['http://scrapinghub.com']

def parse(self, response):
self.logger.info('Parse function called on %s', response.url)

上面的这个默认的 logging 使用的是当前 Spider 的名字既 myspider;当然,你也可以自己定制;

1
2
3
4
5
6
7
8
9
10
11
12
import logging
import scrapy

logger = logging.getLogger('mycustomlogger')

class MySpider(scrapy.Spider):

name = 'myspider'
start_urls = ['http://scrapinghub.com']

def parse(self, response):
logger.info('Parse function called on %s', response.url)

Logging Conifguration

Logger 本身并不关心信息是如何发送和存储的;而这部分工作是由一系列的handlers来处理的,这些hanlders将会把这些 messages 转发到目的地,比如标准输出,文件,emails 等等;

默认情况下,Scrapy 将会通过如下的配置为 root logger 设置和配置相关的 handler

Logging settings

  • LOG_FILE
  • LOG_ENABLED
  • LOG_ENCODING
  • LOG_LEVEL
  • LOG_FORMAT
  • LOG_DATEFORMAT
  • LOG_STDOUT
  • LOG_SHORT_NAMES

The first couple of settings define a destination for log messages. If LOG_FILE is set, messages sent through the root logger will be redirected to a file named LOG_FILE with encoding LOG_ENCODING. If unset and LOG_ENABLED is True, log messages will be displayed on the standard error. Lastly, if LOG_ENABLED is False, there won’t be any visible log output.

LOG_LEVEL determines the minimum level of severity to display, those messages with lower severity will be filtered out. It ranges through the possible levels listed in Log levels.

LOG_FORMAT and LOG_DATEFORMAT specify formatting strings used as layouts for all messages. Those strings can contain any placeholders listed in logging’s logrecord attributes docs and datetime’s strftime and strptime directives respectively.

If LOG_SHORT_NAMES is set, then the logs will not display the scrapy component that prints the log. It is unset by default, hence logs contain the scrapy component responsible for that log output.

Command-line options

There are command-line arguments, available for all commands, that you can use to override some of the Scrapy settings regarding logging.

  • --logfile FILE
    Overrides LOG_FILE
  • --loglevel/-L LEVEL
    Overrides LOG_LEVEL
  • --nolog
    Sets LOG_ENABLED to False

logging handler

参考 Python 官方教程 https://docs.python.org/3/library/logging.handlers.html

Advanced customization

如果想要屏蔽如下这些多余的 INFO 消息,

1
2
3
2016-12-16 22:00:06 [scrapy.spidermiddlewares.httperror] INFO: Ignoring
response <500 http://quotes.toscrape.com/page/1-34/>: HTTP status code
is not handled or not allowed

首先,可以看到该 logger 的名字是[scrapy.spidermiddlewares.httperror];(如果你得到的是简称 [scrapy],将 LOG_SHORT_NAME 设置为 False 即可;)

其次,我们只需要在代码中将其 logging level 设置为其对应的更高的日志级别,比如 WARNING,这样,低优先级的 warnings 便不会再输出了;如下,

1
2
3
4
5
6
7
8
9
10
import logging
import scrapy


class MySpider(scrapy.Spider):
# ...
def __init__(self, *args, **kwargs):
logger = logging.getLogger('scrapy.spidermiddlewares.httperror')
logger.setLevel(logging.WARNING)
super().__init__(*args, **kwargs)

scrapy.utils.log module

1
scrapy.utils.log.configure_logging(settings=None, install_root_handler=True)

Initialize logging defaults for Scrapy.

为 Scrapy 初始化默认的 logging;

相关参数

  • settings (dict, Settings object or None) – settings used to create and configure a handler for the root logger (default: None).
  • install_root_handler (bool) – whether to install root logging handler (default: True)

该方法的功能如下,

  • Route warnings and twisted logging through Python standard logging
    将 warnigs 和 twisted logging 路由自 Python 的标注日志输出;
  • Assign DEBUG and ERROR level to Scrapy and Twisted loggers respectively
  • Route stdout to log if LOG_STDOUT setting is True

When install_root_handler is True (default), this function also creates a handler for the root logger according to given settings (see Logging settings). You can override default options using settings argument. When settings is empty or None, defaults are used.

install_root_handler 设置为True的时候,Scrapy 将会根据 settings 为 root logger 创建一个 handler;当然你可以通过 settings 来修改 hanlder 的行为;

If you plan on configuring the handlers yourself is still recommended you call this function, passing install_root_handler=False. Bear in mind there won’t be any log output set by default in that case.

如果你想自定义一个 handler,推荐仍然调用该方法,只是将install_root_handler=False

To get you started on manually configuring logging’s output, you can use logging.basicConfig() to set a basic root handler. This is an example on how to redirect INFO or higher messages to a file:

你可以通过 logging.basicConfig()](https://docs.python.org/2/library/logging.html#logging.basicConfig) 设置一个 basic root handler;下面来看一个例子,

1
2
3
4
5
6
7
8
9
import logging
from scrapy.utils.log import configure_logging

configure_logging(install_root_handler=False)
logging.basicConfig(
filename='log.txt',
format='%(levelname)s: %(message)s',
level=logging.INFO
)

看一个更完整的例子,https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script

我的总结

实际的日志场景比官网的教程中所描述的几个例子更为复杂,一种典型的例子就是,将 Spider 进行分组,然后对不同分组的 Spider 采用不同的 handler,典型的用例就是将不同分组的 Spider 的日志输出到不同的文件系统中;