爬虫多线程模板，xpath，etree

dreamer-zhang 2021-12-20 原文


class QuiShi:
    def __init__(self):
        self.temp_url = "http://www.lovehhy.net/Joke/Detail/QSBK/{0}"
        self.headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}
        #1.Queue url队列
        self.url_query = Queue()
        #　html网页队列
        self.html_query = Queue()
        # content内容队列
        self.content_query = Queue()
    def get_url_list(self):
         for i in range(1,5):
             self.url_query.put(self.temp_url.format(i))

    def parse_url(self):
        while True:
            url = self.url_query.get()
            self.html_query.put(requests.get(url,headers=self.headers).content.decode("gbk"))
            self.url_query.task_done()

    def get_content_list(self):
        # print(html_str)
        #etree.HTML 变成树状结构
        while True:
            html_str = self.html_query.get()
            html_str = html_str.replace("<br />","").strip("")
            html = etree.HTML(html_str)
            # s = html.xpath(\'//div[@id="footzoon"]\')
            h3_list = html.xpath(\'//div[@id="footzoon"]/h3\')
            content_list=[]
            for h3 in h3_list:
                item = {}
                item["title"] = h3.xpath("./a/text()")
                item["title_href"] = h3.xpath("./a/@href")
                item["content"] =[]
                s = h3.xpath(\'./following-sibling::div/text()\')
                for i in s:
                    item["content"].append(i.replace("\u3000",""))
                content_list.append(item)
            self.content_query.put(content_list)
            self.html_query.task_done()

    def save_content_list(self):
        while True:
            cons = self.content_query.get()
            print(cons)
            self.content_query.task_done()


    def run(self):
        # 1.获取url地址列表

        t1 = threading.Thread(target=self.get_url_list)
        t21 = threading.Thread(target=self.parse_url)
        t22 = threading.Thread(target=self.parse_url)
        t23 = threading.Thread(target=self.parse_url)
        t3 = threading.Thread(target=self.get_content_list)
        t4 = threading.Thread(target=self.save_content_list)
        t1.start()
        t21.start()
        t22.start()
        t23.start()
        t3.start()
        t4.start()
        self.url_query.join()
        self.html_query.join()
        self.content_query.join()


if __name__ == \'__main__\':
    t1 = time.time()
    quishi = QuiShi()
    quishi.run()
    print(time.time() - t1)

本文链接：https://www.cnblogs.com/dreamer-zhang/p/11905889.html

随机推荐

Pinnacle Studio 24中文版

教程： 1、下载安装包并解压缩，断网，然后以管理员身份运行Activation.exe进行安装 2、点击I […]...

Linux下安装Redis4.0版本（简便方法）

Redis介绍： Redis 是完全开源免费的，遵守BSD协议，是一个高性能的key-value数据库 […]...

《Python数据科学手册》

《Python数据科学手册》掌握用Scikit-learn、Numpy等工具高效存储、处理和分析数据《Py […]...

矩阵键盘的检测

由于矩阵键盘中每一个按键的两个接线口都是接在IO口上的，所以我们就必须在软件里面控制单片机在每个独立按键的两端 […]...

2019全球货运代理TOP25、全球第三方物流Top50排行榜

日前，由业内权威咨询公司Armstrong& Associates, Inc.发布的全球货运代理TOP […]...

Linux下通过命令行mail发送e-mail

找到配置文件/etc/mail.rc添加如下行 # vi /etc/mail.rc set from=1234 […]...

网站优化三大标签

网站优化是指在了解搜索引擎自然排名机制的基础之上，对网站进行内部及外部的调整优化，改进网站在搜索引擎中关键词的 […]...

MySQL数据库优化小结

一数据库设计以前都说三大范式，具体应该叫数据库范式第一范式-表的数据不重复，数据是唯一的第二范式-表的 […]...

爬虫多线程模板，xpath，etree

爬虫多线程模板，xpath，etree的更多相关文章

随机推荐

热门专题

目录导航