python爬虫的图片信息爬取

上一篇博客已经讲述了对文本信息的爬取，本章将详细说一下对图片信息的爬取。

首先先看一下项目的目录：

老规矩，根据代码页进行讲解：(本次只针对一个页面进行讲解，多页面爬取只需解除注释即可)

kgcspider.py

# -*- coding: utf-8 -*-
import scrapy
from kgc.items import *


class KgcspiderSpider(scrapy.Spider):
    name = \'kgcspider\'
    #allowed_domains = [\'http://www.kgc.cn/list/230-1-6-9-9-0.shtml\']
    start_urls = [\'http://www.kgc.cn/list/230-1-6-9-9-0.shtml\']

    def parse(self, response):
        #print(response.body.decode())
        title = response.css(\'a.yui3-u.course-title-a.ellipsis::text\').extract()
        price=response.css(\'div.right.align-right>span::text\').extract()
        persons=response.css(\'span.course-pepo::text\').extract()
        image_urls=response.css(\'a.kgc-w>img::attr("src")\').extract()
        #print(title)
        datas=zip(title,price,persons,image_urls)
        for d in datas:
            item=KgcItem()
            item[\'title\']=d[0]
            item[\'price\']=d[1]
            item[\'persons\']=d[2]
            item[\'image_urls\']=[d[3]]
            yield  item
        # next_url=response.css(\'li.next>a::attr("href")\').extract_first()
        #
        # if next_url is not None:
        #     yield response.follow(next_url,self.parse)

精解：对于之前的文本内容的爬取代码保持不变，增加的图片的爬取路径image_urls,也对其进行循环输出，并且放到item中。

item.py

import scrapy
class KgcItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field()
    price=scrapy.Field()
    persons=scrapy.Field()

    image_urls=scrapy.Field()
    images=scrapy.Field()

精解：在实体类item中，加入存储的field，并且对图片images进行存取。images存储的时图片的一些存储路径path，爬取路径URL等，后期可以根据path查询图片。

piplines.py

class KgcPipeline(object):
    def open_spider(self,spider):
    #当蜘蛛启动时自动执行
        self.file=open("/home/yzhl/IdeaProjects/kgc/kgc.csv","w",encoding=\'utf8\')
    def process_item(self, item, spider):
    #蜘蛛每yild一个item，执行一次
        line=item["title"]+","+item["price"]+\',\'+item["persons"]+\',\'+item["images_urls"]+\'\n\'
        self.file.write(line)
        return item
    def close_spider(self,spider):
    #蜘蛛完成工作关闭执行
        self.file.close()

精解：当启动蜘蛛后，这个kgc.csv文件的类型已经不再适用，item.py只对其执行yield的item，所以就需要对setting文件进行配置了。

setting.py

ITEM_PIPELINES = {
   \'kgc.pipelines.KgcPipeline\': 300,
   \'scrapy.pipelines.images.ImagesPipeline\': 1,
}
IMAGES_STORE=\'/home/yzhl/kgcimages\'

精解：在intem_pipelines中加入scrapy关于图片images的管道，同时还要在实体类item中写入关于store的路径，路径即存储图片的文件夹的路径，这样下载的图片就会依次存入到文件夹的目录下。

路径获取：cd到当前目录下，pwd查找当前路径。

本文链接：https://www.cnblogs.com/qianshuixianyu/p/9233919.html

python爬虫的图片信息爬取

python爬虫的图片信息爬取的更多相关文章

随机推荐

热门专题

目录导航