Python爬虫笔记

本次学习的教学视频来自嵩天老师的网络爬虫教学，主要学习内容有requests\BeautifulSoup\scrapy\re，目前除了scrapy其他刚好看完。并搬运实现了一些小项目如58同城租房信息爬取、淘宝搜索商品项目，现将从爬虫基本方法、实战和遇到的问题三个方面进行总结。

　　1.基本方法

　　首先就是requests库，是python最简易实用的HTTP库，是一个请求库。主要方法如下，其中requests.request()方法最常用，用于构造请求，是其他几种方法的总和。其余方法如get()获取HTML网页，head()获取网页head标签，post()\pu()t用于提交对应请求，patch()进行局部修改，delete()提交删除请求。

　　着重介绍request.get()方法，requests.get(url, params=None,**kwargs)

　　其中url为页面链接，params为额外参数，字典格式，**kwargs包含了12个控制访问的参数。（params\data\json\headers\cookies\auth\files\timeout\proxies\allow_redirects\stream\verify\cert）

通常我们使用get()方法获取页面的内容。

　　接着介绍请求得到的Response对象，见下表。

　　补充几段常用代码。

（1）爬取京东商品

import requests
url = "https://item.jd.com/2967929.html"
try:
    r = requests.get(url)
    r.raise_for_status()   #如果发送了错误请求，可以抛出异常
    r.encoding = r.apparent_encoding  #把文本内容的编码格式传递给头文件编码格式
    print(r.text[:1000])
except:
    print("爬取失败！")

（2）爬取亚马逊，需要修改headers字段，模拟请求

import requests
url="https://www.amazon.cn/gp/product/B01M8L5Z3Y"
try:
    kv = {\'user-agent\':\'Mozilla/5.0\'}  #模拟请求头
    r=requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    print(r.status_code)
    print(r.text[:1000])
except:
    print("爬取失败")

（3）百度搜索关键词提交-params提交关键词

import requests
url="http://www.baidu.com/s"
try:
    kv={\'wd\':\'Python\'}
    r=requests.get(url,params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
    print(r.text[500:5000])
except:
    print("爬取失败")

（4）图片爬取存储

import requests
import os
url="http://tc.sinaimg.cn/maxwidth.800/tc.service.weibo.com/p3_pstatp_com/6da229b421faf86ca9ba406190b6f06e.jpg"
root="D://pics//"
path=root + url.split(\'/\')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path, \'wb\') as f:
            f.write(r.content)   #r.content为图片
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
except:
    print("爬取失败")

下面介绍BeautifulSoup库，用于对网页内容进行解析。

BeautifulSoup(mk, \’html.parser\’)，可以用html.parser\lxml\xml\html5lib作为解析器，这里选取html.parser。

元素主要有Tag\Name\Attributes\NavigableString\Comment。其中Tag使用方法如(soup.a)，attrs使用如（a.attrs[\’class\’]），Navigable（tag.string）为非属性字符串，comment即注释。

标签树的遍历方法有（上行遍历、下行遍历、平行遍历）

　　此外可以用soup.prettify()输出有层次感的段落。

信息提取方法如下：常用find_all，具体对标签搜索有soup.find_all(\’a\’)，对属性搜索有soup.find_all(\’p\’,class=\’course\’)，对字符串搜索有soup.find_all(string=\’…\’)，配合正则表达式检索有soup.find_all(re.compile(\’link\’))。

       find() 搜索且返回一个结果，字符串类型
　　　　find_parents() 在先辈节点中搜索，返回一个列表类型
　　　　find_parent() 在先辈节点中返回一个结果，字符串类型
　　　　find_next_siblings() 在后续平行节点搜索，返回列表类型
　　　　find_next_sibling()
　　　　find_previous_siblings()
　　　　find_previous_sibling() 在前序平行节点中返回一个结果，字符串类型
            
       find_all(name,attrs,recursive,string,**kwargs) 返回一个列表类型，存储查找的结果
　　　  参数：
           name：对标签名称的检索字符串，可以使用列表查找多个标签，find_all(true)所有标签
　　　　　　attrs：对标签属性值的检索字符串，可标注属性检索 例如find_all(\'a\',\'href\')
　　　　　　recursive:是否对子孙所有节点搜索，默认值为true，false则值查找当前节点儿子的信息　　　　
　　　　　　string:<></>中字符串区域的检索字符串

　　最后介绍Re正则表达式库。

正则表达式限定符如下：

　　贪婪匹配指匹配的数据无限多，所谓的的非贪婪指的是匹配的次数有限多。一般情况下，非贪婪只要匹配1次。*、+限定符都是贪婪的，因为它们会尽可能多的匹配文字，只有在它们的后面加上一个?就可以实现非贪婪或最小匹配。

　　在re库中一般使用raw string类型即r\’text\’。其中遇到特殊字符需要 \ 加以转义。

方法如下

　　re.search(pattern,string,flag=0)在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象
　　re.match() 在一个字符串的开始位置起匹配正则表达式，返回match对象 注意match为空
　　re.findall()搜索字符串，一列表类型返回全部能匹配的子串
　　re.split()将一个字符串按照正则表达式匹配结果进行分割，返回列表类型
　　re.finditer() 搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素是match对象
　　re.sub()在一个字符串中替换所有匹配正则表达式的子串，返回替换后的字符串

　　re.compile(pattern,flags) 将正则表达式的字符串形式编译成正则表达式对象

　　flag = 0中有三种选择类型，re.I忽略大小写、re.M从每行开始匹配、re.S匹配所有字符。

　　以上是函数式用法，此外还有面向对象用法。

pat = re.compile(\'\')
pat.search(text)

　　最后介绍match对象的属性和方法，见下。

　　1、属性
　　　　1）string 待匹配文本
　　　　2）re 匹配时使用的pattern对象（正则表达式）
　　　　3）pos 正则表达式搜索文本的开始位置
　　　　4）endpos 正则表达式搜索文本的结束为止
　　2、方法
　　　　1）.group(0) 获得匹配后的字符串
　　　　2）.start() 匹配字符串在原始字符串的开始位置
　　　　3）.end() 匹配字符串在原始字符串的结束位置
　　　　4）.span() 返回（.start()，.end()）元组类型

2.实战演练

　　主要选取了淘宝商品搜索和58同城租房两个实例，链接分别为‘https://blog.csdn.net/u014135206/article/details/103216129?depth_1-utm_source=distribute.pc_relevant_right.none-task-blog-BlogCommendFromBaidu-8&utm_source=distribute.pc_relevant_right.none-task-blog-BlogCommendFromBaidu-8‘ \’https://cloud.tencent.com/developer/article/1611414\’

　　淘宝搜索

import requests
import re

def getHTMLText(url):

    headers = {
        \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0\'
    }
    #cookies在元素审查，网络里面刷新，找请求头下面的Cookie
    usercookies = \'\'        #这里需要使用客户端的淘宝登录cookies
    cookies = {}
    for a in usercookies.split(\';\'):
        name, value = a.strip().split(\'=\', 1)
        cookies[name] = value
    print(cookies)

    try:
        r = requests.get(url, headers=headers, cookies=cookies, timeout=60)
        r.raise_for_status()  #如果有错误返回异常
        print(r.status_code) #打印状态码
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return \'failed\'


def parsePage(ilt, html):
    try:
        plt = re.findall(r\'\"view_price\"\:\"[\d\.]*\"\', html)
        tlt = re.findall(r\'\"raw_title\"\:\".*?\"\', html)
        for i in range(len(plt)):
            price = eval(plt[i].split(\':\')[1])  # 意义是进行分割其冒号
            title = eval(tlt[i].split(\':\')[1])
            ilt.append([price, title])
    except:
        print("")


def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序号", "价格", "商品名称"))  # 输出信息
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1]))


def main():
    goods = \'足球\'
    depth = 3
    start_url = \'http://s.taobao.com/search?q={}&s=\'.format(goods)  # 找到起始页的url链接
    infoList = []
    for i in range(depth):  # 进行循环爬去每一页
        try:
            url = start_url + str(44 * i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)


main()

　　58同城爬取租房，这部分代码较多，选取重要内容展示。

1.加密字体的解码

# 获取字体文件并转换为xml文件
def get_font(page_url, page_num, proxies):
    response = requests.get(url=page_url, headers=headers, proxies=proxies)
    # 匹配 base64 编码的加密字体字符串
    base64_string = response.text.split("base64,")[1].split("\'")[0].strip()
    # print(base64_string)
    # 将 base64 编码的字体字符串解码成二进制编码
    bin_data = base64.decodebytes(base64_string.encode())
    # 保存为字体文件
    with open(\'58font.woff\', \'wb\') as f:
        f.write(bin_data)
    print(\'第\' + str(page_num) + \'次访问网页，字体文件保存成功！\')
    # 获取字体文件，将其转换为xml文件
    font = TTFont(\'58font.woff\')
    font.saveXML(\'58font.xml\')
    print(\'已成功将字体文件转换为xml文件！\')
    return response.text


# 将加密字体编码与真实字体进行匹配
def find_font():
    # 以glyph开头的编码对应的数字
    glyph_list = {
        \'glyph00001\': \'0\',
        \'glyph00002\': \'1\',
        \'glyph00003\': \'2\',
        \'glyph00004\': \'3\',
        \'glyph00005\': \'4\',
        \'glyph00006\': \'5\',
        \'glyph00007\': \'6\',
        \'glyph00008\': \'7\',
        \'glyph00009\': \'8\',
        \'glyph00010\': \'9\'
    }
    # 十个加密字体编码
    unicode_list = [\'0x9476\', \'0x958f\', \'0x993c\', \'0x9a4b\', \'0x9e3a\', \'0x9ea3\', \'0x9f64\', \'0x9f92\', \'0x9fa4\', \'0x9fa5\']
    num_list = []
    # 利用xpath语法匹配xml文件内容
    font_data = etree.parse(\'./58font.xml\')
    for unicode in unicode_list:
        # 依次循环查找xml文件里code对应的name
        result = font_data.xpath("//cmap//map[@code=\'{}\']/@name".format(unicode))[0]
        # print(result)
        # 循环字典的key，如果code对应的name与字典的key相同，则得到key对应的value
        for key in glyph_list.keys():
            if key == result:
                num_list.append(glyph_list[key])
    print(\'已成功找到编码所对应的数字！\')
    # print(num_list)
    # 返回value列表
    return num_list


# 替换掉网页中所有的加密字体编码
def replace_font(num, page_response):
    # 9476 958F 993C 9A4B 9E3A 9EA3 9F64 9F92 9FA4 9FA5
    result = page_response.replace(\'鑶\', num[0]).replace(\'閏\', num[1]).replace(\'餼\', num[2]).replace(\'驋\', num[3]).replace(\'鸺\', num[4]).replace(\'麣\', num[5]).replace(\'齤\', num[6]).replace(\'龒\', num[7]).replace(\'龤\', num[8]).replace(\'龥\', num[9])
    print(\'已成功将所有加密字体替换！\')
    return result

2.租房信息爬取

# 提取租房信息
def parse_pages(pages):
    num = 0
    soup = BeautifulSoup(pages, \'lxml\')
    # 查找到包含所有租房的li标签
    all_house = soup.find_all(\'li\', class_=\'house-cell\')
    for house in all_house:
        # 标题
        # title = house.find(\'a\', class_=\'strongbox\').text.strip()
        # print(title)
        # 价格
        price = house.find(\'div\', class_=\'money\').text.strip()
        price = str(price)
        print(price)
        # 户型和面积
        layout = house.find(\'p\', class_=\'room\').text.replace(\' \', \'\')
        layout = str(layout)
        print(layout)
        # 楼盘和地址
        address = house.find(\'p\', class_=\'infor\').text.replace(\' \', \'\').replace(\'\n\', \'\')
        address = str(address)
        print(address)
        num += 1
        print(\'第\' + str(num) + \'条数据爬取完毕，暂停3秒！\')
        time.sleep(3)

        with open(\'58.txt\', \'a+\', encoding=\'utf-8\') as f:          #这里需encoding编码为utf-8，因网络读取的文本和写入的文本编码格式不一；a+继续在文本底部追加内容。
            f.write(price + \'\t\' + layout + \'\t\' + address + \'\n\')

3.由于58会封禁爬虫IP地址，还需要爬取ip进行切换。

def getiplists(page_num):  #爬取ip地址存到列表，爬取pages页
    proxy_list = []
    for page in range(1, page_num):
        url = "  "+str(page)
        r = requests.get(url, headers=headers)
        soup = BeautifulSoup(r.text, \'lxml\')
        ips = soup.findAll(\'tr\')
        for x in range(5, len(ips)):
            ip = ips[x]
            tds = ip.findAll("td")  #找到td标签
            ip_temp = \'http://\'+tds[1].contents[0]+":"+tds[2].contents[0]  #.contents找到子节点，tag之间的navigable也构成了节点
            proxy_list.append(ip_temp)
    proxy_list = set(proxy_list)  #去重
    proxy_list = list(proxy_list)
    print(\'已爬取到\'+ str(len(proxy_list)) + \'个ip地址\')
    return proxy_list

通过更新proxies，作为参数更新到requests.get()中，可以一直刷新IP地址。

proxies = {

\’http\’: item,

\’https\’: item,

}

3.经验总结

　　期间遇到问题汇总如下：

1.大多数网站都需要模拟请求头，user-agent。

2.淘宝需要模拟cookies登陆，cookies信息可以在检查元素中找到。

3.这种方法只能爬取静态的网页，对于数据写入javascript的动态网页，还需要新的知识。

4.爬取过程中容易被封IP，需要在IP代理网站爬取IP地址，不断刷新IP地址。在get()方法中增加proxies参数即可。

5.58的价格字符串采用的加密的方式显示，需要解码。

本文链接：https://www.cnblogs.com/ExMan/p/12736877.html

Python爬虫笔记

Python爬虫笔记

Python爬虫笔记的更多相关文章

随机推荐

热门专题

目录导航