scrapy 学习笔记

scrapy功能总结：

extended CSS selectors ，XPath，re
可互动的shell console用来试验CSS，XPath和正则表达式
内建的对多种输入输出形式的支持，输入有JSON，CSV，XML,输出有FTP,S3,本地文件系统
编码自动检测(utf-8啊什么的)和支持
强大的可扩展性，允许你通过signal插入你自己的方法，以及设计良好的API
各种可操作的内建扩展和中间件：
- cookies和session
- HTTP相关功能，比如压缩，授权，缓存等。
- 伪装代理
- robots.txt可以选择假装没看见...
- 爬取深度设置
- 以及别的更多的功能
一个远程的终端用来钩住一个运行在你Scrapy程序内部的python客户端，用来introspect(自醒的意思，不知该怎么翻译好）和debug你的爬虫
爬虫可重用，自动下图片的中间件管道（media pipeline），caching DNS解析，以及更多！

附上以上内容部分原文（有些翻译可能不太准确或意思表达不够清楚就看下面也可跳过这部分）

Scrapy provides a lot of powerful features for making scraping easy and efficient, such as: Scrapy提供了很多强大的功能，可以让抓取工作更简单和有效率，比如下面这些:

• Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions. extended CSS selectors(Scrapy Selector) ，XPath，re

• An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders. 可互动的shell console用来试验CSS，XPath和正则表达式

• Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem) 内建的对多种输入输出形式的支持，输入有JSON，CSV，XML,输出有FTP,S3,本地文件系统

• Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations. 编码自动检测和支持

• Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines). 强大的可扩展性，允许你通过signal插入你自己的方法，以及设计良好的API

• Wide range of built-in extensions and middlewares for handling: 各种可操作的内建扩展和中间件 – cookies and session handling cookies和session – HTTP features like compression, authentication, caching HTTP相关功能，比如压缩，授权，缓存等。 – user-agent spoofing 伪装代理 – robots.txt – crawl depth restriction 爬取深度设置 – and more

• A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler 一个远程的终端用来钩住一个运行在你Scrapy程序内部的python客户端，用来introspect(自醒的意思，不知该怎么翻译好）和debug你的爬虫

• Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically downloading images (or any other media) associated with the scraped items, a caching DNS resolver, and much more! 爬虫可重用，自动下图片的中间件管道（media pipeline），caching DNS解析，以及更多！

使用流程(参考官网自带toturial总结)：

首先，创建一个scrapy项目，选定一个目录，输入命令： scrapy startproject 'project_name'
定义Item。Item是用来装载抓取到的数据的容器，就好像python字典类型。当然你可以直接配合scrapy使用python的字典类型 . Items提供了额外的保护用来防止对未定义的键值填充数据，防止打印错误。Item也可以和Item Loaders配合使用，用来很方便的往Item里填充数据。通过创建一个 scrapy.Item类来声明Item，并把它的属性定义为 scrapy.Field对象。编辑项目目录里的item.py文件，定义Item.
定义spiders . Spiders是由你定义由Scrapy使用用来从一个域名(或一堆域名)中获取信息的类。这些类定义的内容包括一个初始化的URL的列表，如何跟踪links，如何从页面解析并提取内容。定义scrapy.Scrapy的子类创建Spider，并定义下面这些属性:
- name：标识该Spider，必须独一无二，不能给不同的Spiders设置相同的名字
- start_urls：Spider从这里开始爬取。后续的urls可能也会从这里产生
- parse()：spider的一个方法，会随着start_urls 的 Response object被下载而被调用。response被传给该方法作为第一个也是唯一的参数。此方法负责解析response data并提取 scraped data(作为 scraped Items)并产生更多的urls。总结
  - parse response data
  - extract data and return as Item object
  - produce more urls to follow
Crawling . 在项目目录顶层(项目文件夹进去第一层！)启动爬虫，执行命令： scrapy crawl 'spider_name' ，这个命令会启动name为‘spider_name’的爬虫！
运行后发生的事，，，，Scrapy给start_urls里的每一个url创建一个scrapy.Request 对象并把parse方法分配给它们作为它们各自的回调函数。这些Request会被scheduled(调度，加入计划队列），然后被执行，然后，scrapy.http.Response对象会通过parse()方法fed back给spider。

关于Selectors

Scrapy有自己的数据解析提取(extracting data)机制，基于XPath或者CSS expressions，被叫做selectors。Scrapy selectors是构建在lxml库的基础上的。selector有四个基本的方法:

xpath()
css()
extract()
re()

几个常用的XPath语法：

/html/head/title: selects the <title> element, inside the <head> element of an HTML document
/html/head/title/text(): selects the text inside the aforementioned <title> element.
//td: selects all the elements
//div[@class="mine"]: selects all div elements which contain an attribute class="mine"

关于xpath，scrapy官方文档推荐了一篇文章,官网说的是，想理解xpath的模式，可以看这篇文章(大概就这意思）

scrapy的follow links机制：

Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.

Scrapy跟踪links的机制：当你yield一个Request在一个callback方法里后，Scrapy会调度那个request并发送它，然后会在request结束后执行callback方法。

A common pattern is a callback method that extracts some items, looks for a link to follow to the next page and then yields a Request with the same callback for it:

一个通用的模式：一个callback方法提取一些items，寻找下一页的link，然后yield一个Request请求并用同样的callback调用它。例子：

		
def parse_articles_follow_next_page(self, response):
     for article in response.xpath("//article"):
         item = ArticleItem()
         ... extract article data here
         yield item
     next_page = response.css("ul.navigation > li.next-page > a::attr('href')")
     if next_page:     
         url = response.urljoin(next_page[0].extract())
	 #注意Request的第二个参数
         yield scrapy.Request(url, self.parse_articles_follow_next_page)

保存抓取到的数据

The simplest way to store the scraped data is by using Feed exports, with the following command:

scrapy crawl dmoz -o items.json

That will generate an items.json file containing all scraped items, serialized in JSON.

这会生成一个json文件保存所有的抓取到的数据。

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex things with the scraped items, you can write an Item Pipeline.

应付小项目这就足够了，如果对这些抓取到的数据要求更多功能，可以写一个Item Pipeline.

先就这些，以后陆续更新，过几天放一个小项目的实例