爬虫 Scrapy 学习系列十六:数据统计

前言

这是 Scrapy 系列学习文章之一,本章主要介绍 Stats Collection 的相关的内容;

本文为作者的原创作品,转载需注明出处;

简介

Scrapy 通过键值对的方式来统计和收集统计数据,其中的值通常就是一个计数器;该设施被称作 Stats Collector,可以通过 Crawler APIstats 属性进行访问;可相关例子参考 Common Stats Collector uses

The Stats Collector keeps a stats table per open spider which is automatically opened when the spider is opened, and closed when the spider is closed.

Common Stats Collector uses

Access the stats collector through the stats attribute. Here is an example of an extension that access stats:

1
2
3
4
5
6
7
8
class ExtensionThatAccessStats(object):

def __init__(self, stats):
self.stats = stats

@classmethod
def from_crawler(cls, crawler):
return cls(crawler.stats)

可见必须通过 crawler 实例对象进行访问;为 stat 赋值

1
stats.set_value('hostname', socket.gethostname())

Increment stat value:

1
stats.inc_value('custom_count')

Set stat value only if greater than previous:

1
stats.max_value('max_items_scraped', value)

Set stat value only if lower than previous:

1
stats.min_value('min_free_memory_percent', value)

Get stat value:

1
2
>>> stats.get_value('custom_count')
1

Get all stats:

1
2
>>> stats.get_stats()
{'custom_count': 1, 'start_time': datetime.datetime(2009, 7, 14, 21, 47, 28, 977139)}

Available Stats Collectors

Besides the basic StatsCollector there are other Stats Collectors available in Scrapy which extend the basic Stats Collector. You can select which Stats Collector to use through the STATS_CLASS setting. The default Stats Collector used is the MemoryStatsCollector.

MemoryStatsCollector

1
class scrapy.statscollectors.MemoryStatsCollector

A simple stats collector that keeps the stats of the last scraping run (for each spider) in memory, after they’re closed. The stats can be accessed through the spider_stats attribute, which is a dict keyed by spider domain name.

该 stats collector 为每一个 spider 的最近一次执行在内存中保存了相关统计信息,当该 spider 关闭以后同样可以被访问;通过 spider_stats dict 进行访问,对应的 key 就是相关的 domain 的名字;

This is the default Stats Collector used in Scrapy.

  • spider_stats
    A dict of dicts (keyed by spider name) containing the stats of the last scraping run for each spider.

DummyStatsCollector

1
class scrapy.statscollectors.DummyStatsCollector

A Stats collector which does nothing but is very efficient (because it does nothing). This stats collector can be set via the STATS_CLASS setting, to disable stats collect in order to improve performance. However, the performance penalty of stats collection is usually marginal compared to other Scrapy workload like parsing pages.