前言
这是 Scrapy 系列学习文章之一,本章主要介绍 feed exports 的相关的内容;
本文为作者的原创作品,转载需注明出处;
简介
Feed exports 就是如何将爬取到的数据进行序列化并通过某种文件形式( 通过 URI 的方式,可以是本地文件系统,也可以远程文件系统 )进行存储;下面来看看 Scrapy 提供的序列化格式,
序列化格式
Scrapy 通过 Item Exportors 来将所爬取的数据进行序列化;常用的序列化格式有,
- JSON
- JSON lines
- CSV
- XML
JSON
- FEED_FORMAT: json
- Exporter used: JsonItemExporter
- See this warning if you’re using JSON with large feeds.
JSON lines
- FEED_FORMAT: jsonlines
- Exporter used: JsonLinesItemExporter
CSV
- FEED_FORMAT: csv
- Exporter used: CsvItemExporter
- To specify columns to export and their order use FEED_EXPORT_FIELDS. Other feed exporters can also use this option, but it is important for CSV because unlike many other export formats CSV uses a fixed header.
XML
- FEED_FORMAT: xml
- Exporter used: XmlItemExporter
Pickle
- FEED_FORMAT: pickle
- Exporter used: PickleItemExporter
Marshal
- FEED_FORMAT: marshal
- Exporter used: MarshalItemExporter
存储
存储类型
被序列化的爬取数据(术语叫做 Export Feed),通过指定 URI ( 通过 FEED_URI 进行设置
);Feed exporters 支持多种存储形式,通过不同的 FEED_URI 格式来执行,
- Local filesystem
- FTP
- S3 (requires botocore or boto)
- Standard output
Local filesystem
The feeds are stored in the local filesystem.
- URI scheme:
file
- Example URI:
file:///tmp/export.csv
- Required external libraries: none
Note that for the local filesystem storage (only) you can omit the scheme if you specify an absolute path like /tmp/export.csv. This only works on Unix systems though.
FTP
The feeds are stored in a FTP server.
- URI scheme:
ftp
- Example URI:
ftp://user:pass@ftp.example.com/path/to/export.csv
- Required external libraries: none
S3
The feeds are stored on Amazon S3.
The AWS credentials can be passed as user/password in the URI, or they can be passed through the following settings:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
Standard output
The feeds are written to the standard output of the Scrapy process.
- URI scheme:
stdout
- Example URI:
stdout
:- Required external libraries: none
Storage URI parameters
看几个例子,
通过 FTP 为每一个爬虫单独创建一个文件夹进行存储
1
ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json
使用 S3 为每一个爬虫单独创建一个文件夹进行存储
1
s3://mybucket/scraping/feeds/%(name)s/%(time)s.json
注意,里面使用到了一些默认的参数
%(time)s
- gets replaced by a timestamp when the feed is being created%(name)s
- gets replaced by the spider name%(site_id)s
- would get replaced by the spider.site_id attribute the moment the feed is being created
Settings
FEED_URI (必须输入)
FEED_FORMAT
FEED_EXPORT_ENCODING
FEED_EXPORT_FIELDS
FEED_EXPORT_INDENT
FEED_STORE_EMPTY
FEED_STORAGES
FEED_STORAGES_BASE
默认1
2
3
4
5
6
7{
'': 'scrapy.extensions.feedexport.FileFeedStorage',
'file': 'scrapy.extensions.feedexport.FileFeedStorage',
'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
's3': 'scrapy.extensions.feedexport.S3FeedStorage',
'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}
A dict containing the built-in feed storage backends supported by Scrapy. You can disable any of these backends by assigning
None
to their URI scheme in FEED_STORAGES. E.g., to disable the built-in FTP storage backend (without replacement), place this in your settings.py:
1 | FEED_STORAGES = { |
FEED_EXPORTERS
FEED_EXPORTERS_BASE
默认1
2
3
4
5
6
7
8
9{
'json': 'scrapy.exporters.JsonItemExporter',
'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
'jl': 'scrapy.exporters.JsonLinesItemExporter',
'csv': 'scrapy.exporters.CsvItemExporter',
'xml': 'scrapy.exporters.XmlItemExporter',
'marshal': 'scrapy.exporters.MarshalItemExporter',
'pickle': 'scrapy.exporters.PickleItemExporter',
}
A dict containing the built-in feed exporters supported by Scrapy. You can disable any of these exporters by assigning
None
to their serialization format in FEED_EXPORTERS. E.g., to disable the built-in CSV exporter (without replacement), place this in your settings.py:
1 | FEED_EXPORTERS = { |