爬虫 Scrapy 学习系列之十:Feed exports

前言

这是 Scrapy 系列学习文章之一,本章主要介绍 feed exports 的相关的内容;

本文为作者的原创作品,转载需注明出处;

简介

Feed exports 就是如何将爬取到的数据进行序列化并通过某种文件形式( 通过 URI 的方式,可以是本地文件系统,也可以远程文件系统 )进行存储;下面来看看 Scrapy 提供的序列化格式,

序列化格式

Scrapy 通过 Item Exportors 来将所爬取的数据进行序列化;常用的序列化格式有,

JSON

  • FEED_FORMAT: json
  • Exporter used: JsonItemExporter
  • See this warning if you’re using JSON with large feeds.

JSON lines

CSV

  • FEED_FORMAT: csv
  • Exporter used: CsvItemExporter
  • To specify columns to export and their order use FEED_EXPORT_FIELDS. Other feed exporters can also use this option, but it is important for CSV because unlike many other export formats CSV uses a fixed header.

XML

  • FEED_FORMAT: xml
  • Exporter used: XmlItemExporter

Pickle

  • FEED_FORMAT: pickle
  • Exporter used: PickleItemExporter

Marshal

  • FEED_FORMAT: marshal
  • Exporter used: MarshalItemExporter

存储

存储类型

被序列化的爬取数据(术语叫做 Export Feed),通过指定 URI ( 通过 FEED_URI 进行设置
);Feed exporters 支持多种存储形式,通过不同的 FEED_URI 格式来执行,

Local filesystem

The feeds are stored in the local filesystem.

  • URI scheme: file
  • Example URI: file:///tmp/export.csv
  • Required external libraries: none

Note that for the local filesystem storage (only) you can omit the scheme if you specify an absolute path like /tmp/export.csv. This only works on Unix systems though.

FTP

The feeds are stored in a FTP server.

  • URI scheme: ftp
  • Example URI: ftp://user:pass@ftp.example.com/path/to/export.csv
  • Required external libraries: none

S3

The feeds are stored on Amazon S3.

  • URI scheme: s3
  • Example URIs:
    • s3://mybucket/path/to/export.csv
    • s3://aws_key:aws_secret@mybucket/path/to/export.csv
  • Required external libraries: botocore or boto

The AWS credentials can be passed as user/password in the URI, or they can be passed through the following settings:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY

Standard output

The feeds are written to the standard output of the Scrapy process.

  • URI scheme: stdout
  • Example URI: stdout:
  • Required external libraries: none

Storage URI parameters

看几个例子,

  • 通过 FTP 为每一个爬虫单独创建一个文件夹进行存储

    1
    ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json
  • 使用 S3 为每一个爬虫单独创建一个文件夹进行存储

    1
    s3://mybucket/scraping/feeds/%(name)s/%(time)s.json

注意,里面使用到了一些默认的参数

  • %(time)s - gets replaced by a timestamp when the feed is being created
  • %(name)s - gets replaced by the spider name
  • %(site_id)s - would get replaced by the spider.site_id attribute the moment the feed is being created

Settings

FEED_URI (必须输入)

FEED_FORMAT

FEED_EXPORT_ENCODING

FEED_EXPORT_FIELDS

FEED_EXPORT_INDENT

FEED_STORE_EMPTY

FEED_STORAGES

FEED_STORAGES_BASE

默认

1
2
3
4
5
6
7
{
'': 'scrapy.extensions.feedexport.FileFeedStorage',
'file': 'scrapy.extensions.feedexport.FileFeedStorage',
'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
's3': 'scrapy.extensions.feedexport.S3FeedStorage',
'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}

A dict containing the built-in feed storage backends supported by Scrapy. You can disable any of these backends by assigning None to their URI scheme in FEED_STORAGES. E.g., to disable the built-in FTP storage backend (without replacement), place this in your settings.py:

1
2
3
FEED_STORAGES = {
'ftp': None,
}

FEED_EXPORTERS

FEED_EXPORTERS_BASE

默认

1
2
3
4
5
6
7
8
9
{
'json': 'scrapy.exporters.JsonItemExporter',
'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
'jl': 'scrapy.exporters.JsonLinesItemExporter',
'csv': 'scrapy.exporters.CsvItemExporter',
'xml': 'scrapy.exporters.XmlItemExporter',
'marshal': 'scrapy.exporters.MarshalItemExporter',
'pickle': 'scrapy.exporters.PickleItemExporter',
}

A dict containing the built-in feed exporters supported by Scrapy. You can disable any of these exporters by assigning None to their serialization format in FEED_EXPORTERS. E.g., to disable the built-in CSV exporter (without replacement), place this in your settings.py:

1
2
3
FEED_EXPORTERS = {
'csv': None,
}