Debugging Spiders

本文档介绍了调试Spider的最常用技术. 考虑下面的以下爬虫:

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = (
        'http://example.com/page1',
        'http://example.com/page2',
        )

    def parse(self, response):
        # <processing code not shown>
        # collect `item_urls`
        for item_url in item_urls:
            yield scrapy.Request(item_url, self.parse_item)

    def parse_item(self, response):
        # <processing code not shown>
        item = MyItem()
        # populate `item` fields
        # and extract item_details_url
        yield scrapy.Request(item_details_url, self.parse_details, cb_kwargs={'item': item})

    def parse_details(self, response, item):
        # populate more `item` fields
        return item

基本上,这是一个简单的蜘蛛程序,可解析两页项目(start_urls). 项目还具有一个包含其他信息的详细信息页面,因此我们使用Requestcb_kwargs功能来传递部分填充的项目.

Parse Command

检查Spider输出的最基本方法是使用parse命令. 它允许在方法级别检查蜘蛛的不同部分的行为. 它的优点是灵活易用,但不允许在方法内部调试代码.

为了查看从特定网址抓取的项目:

$ scrapy parse --spider=myspider -c parse_item -d 2 <item_url>
[ ... scrapy log lines crawling example.com spider ... ]

>>> STATUS DEPTH LEVEL 2 <<<
# Scraped Items  ------------------------------------------------------------
[{'url': <item_url>}]

# Requests  -----------------------------------------------------------------
[]

使用--verbose-v选项,我们可以查看每个深度级别的状态:

$ scrapy parse --spider=myspider -c parse_item -d 2 -v <item_url>
[ ... scrapy log lines crawling example.com spider ... ]

>>> DEPTH LEVEL: 1 <<<
# Scraped Items  ------------------------------------------------------------
[]

# Requests  -----------------------------------------------------------------
[<GET item_details_url>]


>>> DEPTH LEVEL: 2 <<<
# Scraped Items  ------------------------------------------------------------
[{'url': <item_url>}]

# Requests  -----------------------------------------------------------------
[]

使用以下命令也可以轻松地检查从单个start_url刮取的项目:

$ scrapy parse --spider=myspider -d 3 'http://example.com/page1'

Scrapy Shell

尽管parse命令对于检查Spider的行为非常有用,但除了显示收到的响应和输出之外,检查回调内部发生的操作也无济于事. 当parse_details有时不接收任何项目时,如何调试情况?

幸运的是,在这种情况下, shell就是您的面包和黄油(请参阅从蜘蛛调用外壳以检查响应 ):

from scrapy.shell import inspect_response

def parse_details(self, response, item=None):
    if item:
        # populate more `item` fields
        return item
    else:
        inspect_response(response, self)

另请参阅: 从蜘蛛调用外壳以检查响应 .

Open in browser

有时,您只想查看某个响应在浏览器中的外观,可以使用open_in_browser函数. 这是您如何使用它的示例:

from scrapy.utils.response import open_in_browser

def parse_details(self, response):
    if "item name" not in response.body:
        open_in_browser(response)

open_in_browser将打开浏览器,同时显示Scrapy收到的响应,并调整基本标记,以便正确显示图像和样式.

Logging

Logging is another useful option for getting information about your spider run. Although not as convenient, it comes with the advantage that the logs will be available in all future runs should they be necessary again:

def parse_details(self, response, item=None):
    if item:
        # populate more `item` fields
        return item
    else:
        self.logger.warning('No item received for %s', response.url)

有关更多信息,请检查" 日志记录"部分.