前言

这是 Scrapy 系列学习文章之一，本章主要介绍 feed exports 的相关的内容；

本文为作者的原创作品，转载需注明出处；

简介

Feed exports 就是如何将爬取到的数据进行序列化并通过某种文件形式( 通过 URI 的方式，可以是本地文件系统，也可以远程文件系统 )进行存储；下面来看看 Scrapy 提供的序列化格式，

序列化格式

Scrapy 通过 Item Exportors 来将所爬取的数据进行序列化；常用的序列化格式有，

JSON
JSON lines
CSV
XML

JSON

FEED_FORMAT: json
Exporter used: JsonItemExporter
See this warning if you’re using JSON with large feeds.

JSON lines

FEED_FORMAT: jsonlines
Exporter used: JsonLinesItemExporter

CSV

FEED_FORMAT: csv
Exporter used: CsvItemExporter
To specify columns to export and their order use FEED_EXPORT_FIELDS. Other feed exporters can also use this option, but it is important for CSV because unlike many other export formats CSV uses a fixed header.

XML

FEED_FORMAT: xml
Exporter used: XmlItemExporter

Pickle

FEED_FORMAT: pickle
Exporter used: PickleItemExporter

Marshal

FEED_FORMAT: marshal
Exporter used: MarshalItemExporter

存储

存储类型

被序列化的爬取数据(术语叫做 Export Feed)，通过指定 URI ( 通过 FEED_URI 进行设置
)；Feed exporters 支持多种存储形式，通过不同的 FEED_URI 格式来执行，

Local filesystem
FTP
S3 (requires botocore or boto)
Standard output

Local filesystem

The feeds are stored in the local filesystem.

URI scheme: file

Example URI: file:///tmp/export.csv

Required external libraries: none

Note that for the local filesystem storage (only) you can omit the scheme if you specify an absolute path like /tmp/export.csv. This only works on Unix systems though.

FTP

The feeds are stored in a FTP server.

URI scheme: ftp

Example URI: ftp://user:pass@ftp.example.com/path/to/export.csv

Required external libraries: none

S3

The feeds are stored on Amazon S3.

URI scheme: s3

Example URIs:

s3://mybucket/path/to/export.csv

s3://aws_key:aws_secret@mybucket/path/to/export.csv

Required external libraries: botocore or boto

The AWS credentials can be passed as user/password in the URI, or they can be passed through the following settings:

AWS_ACCESS_KEY_ID

AWS_SECRET_ACCESS_KEY

Standard output

The feeds are written to the standard output of the Scrapy process.

URI scheme: stdout

Example URI: stdout:

Required external libraries: none

Storage URI parameters

看几个例子，

通过 FTP 为每一个爬虫单独创建一个文件夹进行存储

1	ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json

使用 S3 为每一个爬虫单独创建一个文件夹进行存储
1
s3://mybucket/scraping/feeds/%(name)s/%(time)s.json

注意，里面使用到了一些默认的参数

%(time)s - gets replaced by a timestamp when the feed is being created
%(name)s - gets replaced by the spider name
%(site_id)s - would get replaced by the spider.site_id attribute the moment the feed is being created

Settings

FEED_URI (必须输入)

FEED_FORMAT

FEED_EXPORT_ENCODING

FEED_EXPORT_FIELDS

FEED_EXPORT_INDENT

FEED_STORE_EMPTY

FEED_STORAGES

FEED_STORAGES_BASE

默认

{
    '': 'scrapy.extensions.feedexport.FileFeedStorage',
    'file': 'scrapy.extensions.feedexport.FileFeedStorage',
    'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
    's3': 'scrapy.extensions.feedexport.S3FeedStorage',
    'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}

A dict containing the built-in feed storage backends supported by Scrapy. You can disable any of these backends by assigning None to their URI scheme in FEED_STORAGES. E.g., to disable the built-in FTP storage backend (without replacement), place this in your settings.py:

1
2
3

FEED_STORAGES = {
    'ftp': None,
}

FEED_EXPORTERS

FEED_EXPORTERS_BASE

默认

{
    'json': 'scrapy.exporters.JsonItemExporter',
    'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
    'jl': 'scrapy.exporters.JsonLinesItemExporter',
    'csv': 'scrapy.exporters.CsvItemExporter',
    'xml': 'scrapy.exporters.XmlItemExporter',
    'marshal': 'scrapy.exporters.MarshalItemExporter',
    'pickle': 'scrapy.exporters.PickleItemExporter',
}

A dict containing the built-in feed exporters supported by Scrapy. You can disable any of these exporters by assigning None to their serialization format in FEED_EXPORTERS. E.g., to disable the built-in CSV exporter (without replacement), place this in your settings.py:

1
2
3

FEED_EXPORTERS = {
    'csv': None,
}