爬虫入门二(数据解析)
1、数据解析
简介
1、什么是数据解析,数据解析可以干什么
- 概念:就是将一组数据中的局部数据进行提取
- 作用:使用来实现聚焦爬虫
2、数据解析的通用原理
- 问题:html展现的数据可以存储在哪里?
- 标签之中
- 属性中
1、标签定位
2、取文本或者取属性
正常解析两种方法
需求:http://duanziwang.com/category/%E6%90%9E%E7%AC%91%E5%9B%BE/,将该网站中的图片数据进行爬取
如何对图片(二进制)数据进行爬取
方法一:requests
import requests
headers = {
\'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36\'
}
url = \'http://duanziwang.com/usr/uploads/2019/02/3334500855.jpg\'
pic_data = requests.get(url=url,headers=headers).content #content返回的是二进制类型的响应数据
with open(\'1.jpg\',\'wb\') as fp:
fp.write(pic_data)
方法二:urllib
#方法2:urllib就是一个低版本的requests
import urllib
url = \'http://duanziwang.com/usr/uploads/2019/02/3334500855.jpg\'
urllib.request.urlretrieve(url=url,filename=\'./2.jpg\')
(\'./2.jpg\', <http.client.HTTPMessage at 0x27aab4fde80>)
区别
- 两种图片爬取的方法的区别是什么?
- 方法1可以进行UA伪装,方法2不行
- 抓包工具中response中显示的页面源码和开发者工具的Element选项卡显示的页面源码的区别是什么?
- Element:显示的页面源码内容是当前网页加载完毕后对应的所有数据(包含动态加载的数据)
- response:显示的内容仅仅是当前一个请求请求到的数据(不包含动态加载的数据)
1、re
示例一:爬取一页数据
#需求的实现:爬取了一页的数据
import re
import os
url = \'http://duanziwang.com/category/%E6%90%9E%E7%AC%91%E5%9B%BE/\'
page_text = requests.get(url,headers=headers).text #页面源码数据
#新建一个文件夹
dirName = \'imgLibs\'
if not os.path.exists(dirName):
os.mkdir(dirName)
#数据解析:每一张图片的地址
ex = \'<article.*?<img src="(.*?)" alt=.*?</article>\'
img_src_list = re.findall(ex,page_text,re.S) #爬虫中使用findall函数必须要使用re.S
for src in img_src_list:
imgName = src.split(\'/\')[-1]
imgPath = dirName+\'/\'+imgName
urllib.request.urlretrieve(url=src,filename=imgPath)
print(imgName,\'下载成功!!!\')
3334500855.jpg 下载成功!!!
1865826151.jpg 下载成功!!!
2591221721.jpeg 下载成功!!!
249789596.png 下载成功!!!
392cbdb4e25246a094490178eb7497d5.gif 下载成功!!!
示例二:进行全站数据的爬取
#进行全站数据的爬取:爬取所有页码的图片数据
#需求的实现
#制定一个通用的url模板,不可以被改变
url = \'http://duanziwang.com/category/搞笑图/%d/\'
for page in range(1,4):
new_url = format(url%page)
page_text = requests.get(new_url,headers=headers).text #页面源码数据
#新建一个文件夹
dirName = \'imgLibs\'
if not os.path.exists(dirName):
os.mkdir(dirName)
#数据解析:每一张图片的地址
ex = \'<article.*?<img src="(.*?)" alt=.*?</article>\'
img_src_list = re.findall(ex,page_text,re.S) #爬虫中使用findall函数必须要使用re.S
for src in img_src_list:
imgName = src.split(\'/\')[-1]
imgPath = dirName+\'/\'+imgName
urllib.request.urlretrieve(url=src,filename=imgPath)
print(imgName,\'下载成功!!!\')
3334500855.jpg 下载成功!!!
1865826151.jpg 下载成功!!!
2591221721.jpeg 下载成功!!!
249789596.png 下载成功!!!
392cbdb4e25246a094490178eb7497d5.gif 下载成功!!!
2、bs4
安装
- pip install bs4
- pip install lxml
解析原理
- 实例化一个BeautifulSoup的一个对象,把即将被解析的页面源码内容加载到该对象中
- 调用BeautifulSoup对象中相关的方法和属性进行标签定位和本文数据的提取
实例化方式
- BeautifulSoup(fp,\'lxml\'):将本地的文件内容加载到该对象中进行数据解析
- BeautifulSoup(page_text,\'lxml\'):将互联网上请求到的数据加载到该对象中进行数据解析
bs4相关解析操作
标签定位:
soup.tagName:定位到第一个出现的tagName标签.返回的是单数
属性定位:soup.find(\'tagName\',attrName=\'value\'),返回的是单数
find_all(\'tagName\',attrName=\'value\')返回的是复数(列表)
选择器定位:select(\'选择器\'),返回的也是一个列表
层级选择器:
大于号:表示一个层级
空格:表示多个层级
取文本
string:只可以将标签中直系的文本取出
text:可以将标签中所有的内容取出
取属性
tag[\'attrName\']
示例1:进行全篇小说内容的爬取
#爬取到首页的页面数据
main_url = \'http://www.shicimingju.com/book/sanguoyanyi.html\'
page_text = requests.get(main_url,headers=headers).text
fp = open(\'./sanguo.txt\',\'a\',encoding=\'utf-8\')
#解析章节名称+详情页的url
soup = BeautifulSoup(page_text,\'lxml\')
a_list = soup.select(\'.book-mulu > ul > li > a\')
for a in a_list:
title = a.string#章节标题
detail_url = \'http://www.shicimingju.com\'+a[\'href\']
#爬取详情页的页面源码内容
detail_page_text = requests.get(url=detail_url,headers=headers).text
#解析章节内容
detail_soup = BeautifulSoup(detail_page_text,\'lxml\')
div_tag = detail_soup.find(\'div\',class_="chapter_content")
content = div_tag.text #章节内容
fp.write(title+\':\'+content+\'\n\')
print(title,\'下载成功!!!\')
fp.close()
第一回·宴桃园豪杰三结义 斩黄巾英雄首立功 下载成功!!!
第二回·张翼德怒鞭督邮 何国舅谋诛宦竖 下载成功!!!
第三回·议温明董卓叱丁原 馈金珠李肃说吕布 下载成功!!!
第四回·废汉帝陈留践位 谋董贼孟德献刀 下载成功!!!
第五回·发矫诏诸镇应曹公 破关兵三英战吕布 下载成功!!!
第六回·焚金阙董卓行凶 匿玉玺孙坚背约 下载成功!!!
第七回·袁绍磐河战公孙 孙坚跨江击刘表 下载成功!!!
3、xpath
安装、原理、实例化
安装
pip install lxml
解析原理(流程)
实例化一个etree的对象,将解析的数据加载到该对象中
需要调用etree对象中的xpath方法结合着不同的xpath表达式进行标签定位和文本数据的提取
etree对象的实例化
etree.parse(\'filePath\'):将本都数据加载到etree中
etree.HTML(page_text):将互联网上的数据加载到该对象中
html中所有的标签都是遵从了树状的结构,便于我们实现高效的节点的遍历或者查找(定位)
xpath方法的返回值一定是复数(列表)
标签定位
标签定位
最左侧的/:xpath表达式式一定要从根标签开始进行定位
非最左侧的/:表示一个层级
最左侧的//:从任意位置进行标签定位(常用)
非最左侧//:表示多个层级
//tagName:定位到所有的tagName标签
属性定位://tagName[@attrName="value"]
索引定位://tagName[index],index索引是从1开始
模糊匹配:
//div[contains(@class, "ng")]
//div[starts-with(@class, "ta")]
取文本
/text():取直系的文本内容。列表只有一个元素
//text():所有的文本内容。列表会有多个列表元素
取属性
/@attrName
示例1:爬取解析虎牙中直播的房间名称,热度,详情页的url
url = \'https://www.huya.com/g/lol\'
page_text = requests.get(url=url,headers=headers).text
#数据解析
tree = etree.HTML(page_text)
li_list = tree.xpath(\'//div[@class="box-bd"]/ul/li\')
for li in li_list:
#实现局部解析:将局部标签下指定的内容进行解析
#局部解析xpath表达式中的最左侧的./表示的就是xpath方法调用者对应的标签
title = li.xpath(\'./a[2]/text()\')[0]
hot = li.xpath(\'./span/span[2]/i[2]/text()\')[0]
detail_url = li.xpath(\'./a[1]/@href\')[0]
print(title,hot,detail_url)
【100%胜率上钻】弹幕随机 611.0万 https://www.huya.com/kaerlol
单双排【连胜上大师】强势AD目前19-0 277.0万 https://www.huya.com/baozha
莎莉:睡了三个小时的战神开干了 219.6万 https://www.huya.com/836458
峡谷上2080分冲第一:新版野区是爷爷! 170.9万 https://www.huya.com/gushouyu
【重播】12月6日8:00 直播2019全明星赛 167.5万 https://www.huya.com/s
早上好 151.9万 https://www.huya.com/951231
示例二:xpath图片数据爬取+乱码的处理
#url模板
url = \'http://pic.netbian.com/4kmeinv/index_%d.html\'
for page in range(1,11):
new_url = format(url%page) #只可以表示非第一页的页码连接
if page == 1:
new_url = \'http://pic.netbian.com/4kmeinv/\'
page_text = requests.get(new_url,headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath(\'//*[@id="main"]/div[3]/ul/li\')
for li in li_list:
img_name = li.xpath(\'./a/img/@alt\')[0]+\'.jpg\'
img_name = img_name.encode(\'iso-8859-1\').decode(\'gbk\')
img_src = \'http://pic.netbian.com\'+li.xpath(\'./a/img/@src\')[0]
print(img_name,img_src)
iso-8859-1 编码?
#待补充
示例三: 城市和热门城市的爬取
#xpath表达式中管道符的应用
#目的:使得xpath表达式具有更强的通用性
url = \'https://www.aqistudy.cn/historydata/\'
page_text = requests.get(url,headers=headers).text
tree = etree.HTML(page_text)
# hot_cities = tree.xpath(\'//div[@class="bottom"]/ul/li/a/text()\')
all_cities = tree.xpath(\'//div[@class="bottom"]/ul/div[2]/li/a/text() | //div[@class="bottom"]/ul/li/a/text()\')
all_cities
穿白色裙子的美女 看书 4k美女壁纸.jpg http://pic.netbian.com/uploads/allimg/191121/223811-1574347091a133.jpg
居家小清新白色小吊带美女4k壁纸.jpg http://pic.netbian.com/uploads/allimg/190902/152344-1567409024af8c.jpg
白色睡裙美女 侧躺 书 烛光 4k美女壁纸3840x2160.jpg http://pic.netbian.com/uploads/allimg/191129/185356-1575024836972e.jpg
xpath表达式管道符的应用
# 目的:使得xpath表达式具有更强的通用性
url = \'https://www.aqistudy.cn/historydata/\'
page_text = requests.get(url,headers=headers).text
tree = etree.HTML(page_text)
# hot_cities = tree.xpath(\'//div[@class="bottom"]/ul/li/a/text()\')
all_cities = tree.xpath(\'//div[@class="bottom"]/ul/div[2]/li/a/text() | //div[@class="bottom"]/ul/li/a/text()\')
all_cities
4、pyquery
1、安装方法
pip install pyquery
2、引用方法
from pyquery import PyQuery as pq
3、简介
pyquery 是类型jquery 的一个专供python使用的html解析的库,使用方法类似bs4。
4、使用方法
4.1 初始化方法
from pyquery import PyQuery as pq
doc =pq(html) #解析html字符串
doc =pq("http://news.baidu.com/") #解析网页
doc =pq("./a.html") #解析html 文本
4.2 基本CSS选择器
from pyquery import PyQuery as pq
html = \'\'\'
<div id="wrap">
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
\'\'\'
doc = pq(html)
print doc("#wrap .s_from link")
运行结果:
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
#是查找id的标签 .是查找class 的标签 link 是查找link 标签 中间的空格表示里层
4.3 查找子元素
from pyquery import PyQuery as pq
html = \'\'\'
<div id="wrap">
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
\'\'\'
#查找子元素
doc = pq(html)
items=doc("#wrap")
print(items)
print("类型为:%s"%type(items))
link = items.find(\'.s_from\')
print(link)
link = items.children()
print(link)
运行结果:
<div id="wrap">
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
类型为:<class \'pyquery.pyquery.PyQuery\'>
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
根据运行结果可以发现返回结果类型为pyquery,并且find方法和children 方法都可以获取里层标签
4.4 查找父元素
from pyquery import PyQuery as pq
html = \'\'\'
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
\'\'\'
doc = pq(html)
items=doc(".s_from")
print(items)
#查找父元素
parent_href=items.parent()
print(parent_href)
运行结果:
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link href="http://asda.com">asdadasdad12312</link>
<link href="http://asda1.com">asdadasdad12312</link>
<link href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
parent可以查找出外层标签包括的内容,与之类似的还有parents,可以获取所有外层节点
4.5 查找兄弟元素
from pyquery import PyQuery as pq
html = \'\'\'
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class=\'active1 a123\' href="http://asda.com">asdadasdad12312</link>
<link class=\'active2\' href="http://asda1.com">asdadasdad12312</link>
<link class=\'movie1\' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
\'\'\'
doc = pq(html)
items=doc("link.active1.a123")
print(items)
#查找兄弟元素
siblings_href=items.siblings()
print(siblings_href)
运行结果:
<link class="active1 a123" href="http://asda.com">asdadasdad12312</link>
<link class="active2" href="http://asda1.com">asdadasdad12312</link>
<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
根据运行结果可以看出,siblings 返回了同级的其他标签
结论:子元素查找,父元素查找,兄弟元素查找,这些方法返回的结果类型都是pyquery类型,可以针对结果再次进行选择
4.6 遍历查找结果
from pyquery import PyQuery as pq
html = \'\'\'
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class=\'active1 a123\' href="http://asda.com">asdadasdad12312</link>
<link class=\'active2\' href="http://asda1.com">asdadasdad12312</link>
<link class=\'movie1\' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
\'\'\'
doc = pq(html)
its=doc("link").items()
for it in its:
print(it)
运行结果:
<link class="active1 a123" href="http://asda.com">asdadasdad12312</link>
<link class="active2" href="http://asda1.com">asdadasdad12312</link>
<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
4.7 获取属性信息
from pyquery import PyQuery as pq
html = \'\'\'
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class=\'active1 a123\' href="http://asda.com">asdadasdad12312</link>
<link class=\'active2\' href="http://asda1.com">asdadasdad12312</link>
<link class=\'movie1\' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
\'\'\'
doc = pq(html)
its=doc("link").items()
for it in its:
print(it.attr(\'href\'))
print(it.attr.href)
运行结果:
http://asda.com
http://asda.com
http://asda1.com
http://asda1.com
http://asda2.com
http://asda2.com
4.8 获取文本
from pyquery import PyQuery as pq
html = \'\'\'
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class=\'active1 a123\' href="http://asda.com">asdadasdad12312</link>
<link class=\'active2\' href="http://asda1.com">asdadasdad12312</link>
<link class=\'movie1\' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
\'\'\'
doc = pq(html)
its=doc("link").items()
for it in its:
print(it.text())
运行结果:
asdadasdad12312
asdadasdad12312
asdadasdad12312
4.9 获取HTML信息
from pyquery import PyQuery as pq
html = \'\'\'
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class=\'active1 a123\' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class=\'active2\' href="http://asda1.com">asdadasdad12312</link>
<link class=\'movie1\' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
\'\'\'
doc = pq(html)
its=doc("link").items()
for it in its:
print(it.html())
运行结果:
<a>asdadasdad12312</a>
asdadasdad12312
asdadasdad12312
5、常用DOM操作
5.1 addClass removeClass 添加,移除class标签
from pyquery import PyQuery as pq
html = \'\'\'
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class=\'active1 a123\' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class=\'active2\' href="http://asda1.com">asdadasdad12312</link>
<link class=\'movie1\' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
\'\'\'
doc = pq(html)
its=doc("link").items()
for it in its:
print("添加:%s"%it.addClass(\'active1\'))
print("移除:%s"%it.removeClass(\'active1\'))
运行结果
添加:<link class="active1 a123" href="http://asda.com"><a>asdadasdad12312</a></link>
移除:<link class="a123" href="http://asda.com"><a>asdadasdad12312</a></link>
添加:<link class="active2 active1" href="http://asda1.com">asdadasdad12312</link>
移除:<link class="active2" href="http://asda1.com">asdadasdad12312</link>
添加:<link class="movie1 active1" href="http://asda2.com">asdadasdad12312</link>
移除:<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
需要注意的是已经存在的class标签不会继续添加
5.2 attr css attr 为获取/修改属性css添加style属性
from pyquery import PyQuery as pq
html = \'\'\'
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class=\'active1 a123\' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class=\'active2\' href="http://asda1.com">asdadasdad12312</link>
<link class=\'movie1\' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
\'\'\'
doc = pq(html)
its=doc("link").items()
for it in its:
print("修改:%s"%it.attr(\'class\',\'active\'))
print("添加:%s"%it.css(\'font-size\',\'14px\'))
运行结果:
C:\Python27\python.exe D:/test_his/test_re_1.py
修改:<link class="active" href="http://asda.com"><a>asdadasdad12312</a></link>
添加:<link class="active" href="http://asda.com" style="font-size: 14px"><a>asdadasdad12312</a></link>
修改:<link class="active" href="http://asda1.com">asdadasdad12312</link>
添加:<link class="active" href="http://asda1.com" style="font-size: 14px">asdadasdad12312</link>
修改:<link class="active" href="http://asda2.com">asdadasdad12312</link>
添加:<link class="active" href="http://asda2.com" style="font-size: 14px">asdadasdad12312</link>
attr css操作直接修改对象的
5.3 remove 移除标签
from pyquery import PyQuery as pq
html = \'\'\'
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class=\'active1 a123\' href="http://asda.com"><a>asdadasdad12312</a></link>
<link class=\'active2\' href="http://asda1.com">asdadasdad12312</link>
<link class=\'movie1\' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
\'\'\'
doc = pq(html)
its=doc("div")
print(\'移除前获取文本结果:\n%s\'%its.text())
it=its.remove(\'ul\')
print(\'移除后获取文本结果:\n%s\'%it.text())
运行结果:
移除前获取文本结果:
hello nihao
asdasd
asdadasdad12312
asdadasdad12312
asdadasdad12312
移除后获取文本结果:
hello nihao
其他DOM方法参考:
http://pyquery.readthedocs.io/en/latest/api.html
6、伪类选择器
from pyquery import PyQuery as pq
html = \'\'\'
<div href="wrap">
hello nihao
<ul class="s_from">
asdasd
<link class=\'active1 a123\' href="http://asda.com"><a>helloasdadasdad12312</a></link>
<link class=\'active2\' href="http://asda1.com">asdadasdad12312</link>
<link class=\'movie1\' href="http://asda2.com">asdadasdad12312</link>
</ul>
</div>
\'\'\'
doc = pq(html)
its=doc("link:first-child")
print(\'第一个标签:%s\'%its)
its=doc("link:last-child")
print(\'最后一个标签:%s\'%its)
its=doc("link:nth-child(2)")
print(\'第二个标签:%s\'%its)
its=doc("link:gt(0)") #从零开始
print("获取0以后的标签:%s"%its)
its=doc("link:nth-child(2n-1)")
print("获取奇数标签:%s"%its)
its=doc("link:contains(\'hello\')")
print("获取文本包含hello的标签:%s"%its)
运行结果:
第一个标签:<link class="active1 a123" href="http://asda.com"><a>helloasdadasdad12312</a></link>
最后一个标签:<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
第二个标签:<link class="active2" href="http://asda1.com">asdadasdad12312</link>
获取0以后的标签:<link class="active2" href="http://asda1.com">asdadasdad12312</link>
<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
获取奇数标签:<link class="active1 a123" href="http://asda.com"><a>helloasdadasdad12312</a></link>
<link class="movie1" href="http://asda2.com">asdadasdad12312</link>
获取文本包含hello的标签:<link class="active1 a123" href="http://asda.com"><a>helloasdadasdad12312</a></link>
更多css选择器可以查看: