python抓取链家房源信息(三)

之前写过一个链家网北京二手房的数据抓取,然后本来今天想着要把所有的东西弄完,但是临时有事出去了一趟,耽搁了一下,然后现在是想着把北京的二手房的信息都进行抓取,并且存储在mongodb中,

首先是通过\’https://bj.lianjia.com\’的url将按照区域划分和地铁路线图进行划分的所有的url抓取出来进行存储,然后在进行下一步的分析,然后会每一套房源信息都会有一个data-housecode,标识是那一套房间,为了避免有重复的房源信息,在每套房的数据中将data-housecode,数据作为每一套房的标识,进行存储,然后在抓取到房源信息存储到mongodb中时,通过data-housecode进行判断,看当前房源是否已经存储完全,如果已经存储了,则不必插入,否则将该房源信息插入到mongodb中。

用的还是scrapy框架,然后只是在spider.py中添加了按照区和地铁路线图的所有的房源信息,当然根据区域和地铁还可以分的更细。。。

大致的爬虫的框架是:

在scrapy框架中,使用过程是,在spider.py中,将要获取的url请求给scheduler,然后通过download模块进行Request下载数据,如果下载失败,会将结果告诉scrapy engine,然后scrapy engine会稍后进行重新请求,然后download将下载的数据给spider,spider进行数据处理,抓取需要保存的按照地铁路线或者是区域的url,然后跟进url,将个个不同的url进行告诉scrapy engine,然后又通过相同的远离然后进行抓取,然后存储每个房源的标识和条件情况,然后将处理结果返回给item,通过item进行mongodb的存储。

scrapy.py中的代码如下:

#-*-coding:utf-8-*-
import scrapy
import re
from bs4 import BeautifulSoup
import  time

import json
from scrapy.http import Request
from House.items import HouseItem
import lxml.html
from lxml import etree
class spider(scrapy.Spider):
    name = \'House\'
    url = \'https://bj.lianjia.com\'
    base_url = \'https://bj.lianjia.com/ershoufang\'
    def start_requests(self):
            print(self.base_url)
            yield Request(self.base_url,self.get_area_url,dont_filter=True)


    def get_area_url(self,response):
        selector = etree.HTML(response.text)
        results = selector.xpath(\'//dd/div/div/a/@href\')
        for each in results:
            if \'lianjia\' not in each:
                url = self.url + each
            else:
                url = each
            print(url)
            yield Request(url, self.get_total_page, dont_filter=True)
    def get_total_page(self,response):
        soup = BeautifulSoup(response.text, \'lxml\')
        Total_page = soup.find_all(\'div\', class_=\'page-box house-lst-page-box\')
        res = r\'<div .*? page-data=\\'{\"totalPage\":(.*?),"curPage":.*?}\\' page-url=".*?\'
        total_num = re.findall(res, str(Total_page), re.S | re.M)

        for i in range(1, int(total_num[0])):
            print(i)
            url = response.url + \'pg\' + str(i)
            print(url)
            yield Request(url, self.parse, dont_filter=True)

    def parse(self, response):
        soup = BeautifulSoup(response.text,\'lxml\')

        message1 = soup.find_all(\'div\',class_ = \'houseInfo\')
        message2 = soup.find_all(\'div\',class_ = \'followInfo\')
        message3 = soup.find_all(\'div\',class_ = \'positionInfo\')
        message4 = soup.find_all(\'div\',class_ = \'title\')
        message5 = soup.find_all(\'div\',class_ = \'totalPrice\')
        message6 = soup.find_all(\'div\',class_ = \'unitPrice\')
        message7 = soup.find_all(name=\'a\', attrs={\'class\': \'img\'})
        Flags = []
        for each in message7:
            Flags.append(each.get(\'data-housecode\'))
        num = 0
        for flag,each,each1,each2,each3,each4,each5 in zip(Flags,message1,message2,message3,message4,message5,message6):
            List = each.get_text().split(\'|\')
            item = HouseItem()
            item[\'flag\'] = flag
            item[\'address\'] = List[0].strip()
            item[\'house_type\'] = List[1].strip()
            item[\'area\'] = List[2].strip()
            item[\'toward\'] = List[3].strip()
            item[\'decorate\'] = List[4].strip()
            if len(List) == 5:
                item[\'elevate\'] = \'None\'
            else:
                item[\'elevate\'] = List[5].strip()
            List = each1.get_text().split(\'/\')
            item[\'interest\'] = List[0].strip()
            item[\'watch\'] = List[1].strip()
            item[\'publish\'] = List[2].strip()
            List = each2.get_text().split(\'-\')
            item[\'build\'] = List[0].strip()
            item[\'local\'] = List[1].strip()
            item[\'advantage\'] = each3.get_text().strip()
            item[\'price\'] = each4.get_text().strip()
            item[\'unit\'] = each5.get_text().strip()

            print("%s %s %s %s %s %s %s %s %s %s %s %s %s %s %s "%(item[\'flag\'],item[\'address\'],item[\'house_type\'],item[\'area\'],item[\'toward\'],
                                                         item[\'decorate\'],item[\'elevate\'],item[\'interest\'],
                                                         item[\'watch\'],item[\'publish\'],item[\'build\'],item[\'local\'],item[\'advantage\'],item[\'price\'],item[\'unit\']))
            num += 1
            yield item

 

版权声明:本文为chenyang920原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/chenyang920/p/8035566.html