python 采集网站数据,本教程用的是scrapy蜘蛛

1、安装Scrapy框架

 命令行执行:

  1. pip install scrapy

安装的scrapy依赖包和原先你安装的其他python包有冲突话,推荐使用Virtualenv安装

安装完成后,随便找个文件夹创建爬虫

  1. scrapy startproject 你的蜘蛛名称

文件夹目录

爬虫规则写在spiders目录下

items.py ——需要爬取的数据

pipelines.py ——执行数据保存

settings —— 配置

middlewares.py——下载器

下面是采集一个小说网站的源码

先在items.py定义采集的数据

  1. # 2019年8月12日17:41:08
  2. # author zhangxi<1638844034@qq.com>
  3. import scrapy
  4. class BookspiderItem(scrapy.Item):
  5. # define the fields for your item here like:
  6. i = scrapy.Field()
  7. book_name = scrapy.Field()
  8. book_img = scrapy.Field()
  9. book_author = scrapy.Field()
  10. book_last_chapter = scrapy.Field()
  11. book_last_time = scrapy.Field()
  12. book_list_name = scrapy.Field()
  13. book_content = scrapy.Field()
  14. pass

编写采集规则

  1. # 2019年8月12日17:41:08
  2. # author zhangxi<1638844034@qq.com>
  3. import scrapy
  4. from ..items import BookspiderItem
  5. class Book(scrapy.Spider):
  6. name = "BookSpider"
  7. start_urls = [
  8. \'http://www.xbiquge.la/xiaoshuodaquan/\'
  9. ]
  10. def parse(self, response):
  11. bookAllList = response.css(\'.novellist:first-child>ul>li\')
  12. for all in bookAllList:
  13. booklist = all.css(\'a::attr(href)\').extract_first()
  14. yield scrapy.Request(booklist,callback=self.list)
  15. def list(self,response):
  16. book_name = response.css(\'#info>h1::text\').extract_first()
  17. book_img = response.css(\'#fmimg>img::attr(src)\').extract_first()
  18. book_author = response.css(\'#info p:nth-child(2)::text\').extract_first()
  19. book_last_chapter = response.css(\'#info p:last-child::text\').extract_first()
  20. book_last_time = response.css(\'#info p:nth-last-child(2)::text\').extract_first()
  21. bookInfo = {
  22. \'book_name\':book_name,
  23. \'book_img\':book_img,
  24. \'book_author\':book_author,
  25. \'book_last_chapter\':book_last_chapter,
  26. \'book_last_time\':book_last_time
  27. }
  28. list = response.css(\'#list>dl>dd>a::attr(href)\').extract()
  29. i = 0
  30. for var in list:
  31. i += 1
  32. bookInfo[\'i\'] = i # 获取抓取时的顺序,保存数据时按顺序保存
  33. yield scrapy.Request(\'http://www.xbiquge.la\'+var,meta=bookInfo,callback=self.info)
  34. def info(self,response):
  35. self.log(response.meta[\'book_name\'])
  36. content = response.css(\'#content::text\').extract()
  37. item = BookspiderItem()
  38. item[\'i\'] = response.meta[\'i\']
  39. item[\'book_name\'] = response.meta[\'book_name\']
  40. item[\'book_img\'] = response.meta[\'book_img\']
  41. item[\'book_author\'] = response.meta[\'book_author\']
  42. item[\'book_last_chapter\'] = response.meta[\'book_last_chapter\']
  43. item[\'book_last_time\'] = response.meta[\'book_last_time\']
  44. item[\'book_list_name\'] = response.css(\'.bookname h1::text\').extract_first()
  45. item[\'book_content\'] = \'\'.join(content)
  46. yield item

保存数据

  1. import os
  2. class BookspiderPipeline(object):
  3. def process_item(self, item, spider):
  4. curPath = \'E:/小说/\'
  5. tempPath = str(item[\'book_name\'])
  6. targetPath = curPath + tempPath
  7. if not os.path.exists(targetPath):
  8. os.makedirs(targetPath)
  9. book_list_name = str(str(item[\'i\'])+item[\'book_list_name\'])
  10. filename_path = targetPath+\'/\'+book_list_name+\'.txt\'
  11. print(\'------------\')
  12. print(filename_path)
  13. with open(filename_path,\'a\',encoding=\'utf-8\') as f:
  14. f.write(item[\'book_content\'])
  15. return item

执行

  1. scrapy crawl BookSpider

即可完成一个小说程序的采集

这里推荐使用 

  1. scrapy shell 爬取的网页url

然后 response.css(\’\’) 测试规则是否正确

本教程程序源码:github:https://github.com/zhangxi-key/py-book.git

 

版权声明:本文为zhangxi1原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/zhangxi1/p/11341641.html