9.3.3 scrapy 框架

avention 2018-05-04 原文

　　scrapy是一个非常好用的Web爬虫框架，非常适合抓取Web站点从网页中提取结构化的数据，并且支持自定义的需求。在使用scrapy爬取网页数据时，除了熟悉HTML标签，还需要了解目标网页的数据组织结构，确定要爬取什么信息，这样才能针对性地编写爬虫程序。

　　使用pip命令安装好scrapy扩展库。在安装过程中遇到的报错需要自己根据报错内容百度并解决，培养自己动手解决问题的能力。

 1 import scrapy
 2 import os
 3 import urllib.request
 4 
 5 #自定义一个爬虫类
 6 class MySpider(scrapy.spiders.Spider):
 7     #爬虫的名字，每个爬虫都必须有不同的名字，类变量
 8     name = 'mySpider'
 9     allowed_domains=['www.sdibt.edu.cn']
10 
11     #要爬取的其实页面，必须是列表，可以匹配多个RUL
12     start_urls = ['http://www.sdibt.edu.cn/info/1026/11238.htm']
13 
14     #针对每个要爬取的网页，会自动调用下面这个方法
15     def parse(self,response):
16         self.downloadWebpage(response)
17         self.downloadImages(response)
18 
19         #检查页面中的超链接，并继续爬取
20         hxs = scrapy.Selector(response)
21         sites = hxs.spath('//ul/li')
22 
23         for site in sites:
24             link = site.xpath('a/@href').extract()[0]
25             if link == '#':
26                 continue
27             #把相对地址转换为绝对地址
28             elif link.startswith('..'):
29                 next_url = os.path.dirname(response.rul)
30                 next_url += '/' + link
31             else:
32                 next_url = link
33 
34             #生成Request对象，并指定回调函数
35             yield scrapy.Request(url = next_url,callback = self.parse_item)
36 
37     #回调函数，对起始页面中的每个超链接其作用
38     def parse_item(self,response):
39         self.downloadWebpage(response)
40         self.downloadImages(response)
41 
42     #下载当前页面中所有图片
43     def downloadImages(self,response):
44         hxs = scrapy.Selector(response)
45         images = hxs.xpath('//img/@src').extract()
46 
47         for image_url in images:
48             imageFilename = image_url.split('/')[-1]
49             if os.path.exists(imageFilename):
50                 continue
51 
52             #把相对地址转换为绝对地址
53             if image_url.startswith('..'):
54                 image_url = os.path.dirname(response.url) + '/' + image_url
55 
56             #打开网页图片
57             fp=urllib.request.urlopen(image_url)
58             #创建本地图片文件
59             with open(imageFilename,'wb') as f:
60                 f.write(fp.read())
61             fp.close()
62             
63     #把网页内容保存为本地文件
64     def downloadWebpage(self,response):
65         filename = response.rul.split('/')[-1]
66         with open(filename,'wb') as f:
67             f.write(response.body)