Python爬虫之PyQuery使用
Python爬虫之PyQuery使用
PyQuery简介
pyquery能够通过选择器精确定位 DOM 树中的目标并进行操作。pyquery相当于jQuery的python实现,可以用于解析HTML网页等。它的语法与jQuery几乎完全相同,对于使用过jQuery的人来说很熟悉,也很好上手。
初始化
有 4 种方法可以进行初始化:
可以通过传入 字符串、lxml、文件 或者 url 来使用PyQuery
- from pyquery import PyQuery as pq
- from lxml import etree
- d = pq("<html></html>")#传入字符串
- d = pq(etree.fromstring("<html></html>"))#传入lxml
- d = pq(url='http://baidu.com/') #传入url
- d = pq(filename=path_to_html_file) #传入文件
基本CSS选择器
- html='''
- <html>
- <body>
- <ul class="mh-col">
- <li class="g-ellipsis">
- <a class="g-a-noline" data-md='{"b":"list","p":"1-1"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E6%AD%8C%E6%89%8B%E9%AB%98%E7%A9%BA%E6%8B%8DMV%E5%9D%A0%E4%BA%A1&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 歌手高空拍MV坠亡
- </a>
- <span class="mh-ico-up">
- </span>
- </li>
- <li class="g-ellipsis">
- <a class="g-a-noline" data-md='{"b":"list","p":"1-2"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E4%BD%9B%E7%A5%96%E6%9C%B1%E9%BE%99%E5%B9%BF%E9%87%91%E5%A9%9A&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D&fr=hao_360so_history_b" target="_blank">
- 佛祖朱龙广金婚
- </a>
- <span class="mh-ico-down">
- </span>
- </li>
- <li class="g-ellipsis">
- <a class="g-a-noline" data-md='{"b":"list","p":"1-3"}' href="http://www.so.com/link?url=http%3A%2F%2Fbaike.so.com%2Fzt%2Fzufangfangpian.html%3Fsrc%3Dreci&q=%E5%91%A8%E6%9D%B0%E4%BC%A6%E7%9A%84%E6%AD%8C&ts=1540365398&t=c6890eee0e669832ca96d5223582d6e" target="_blank">
- 常见租房陷阱
- </a>
- <span class="mh-ico-up">
- </span>
- </li>
- <li class="g-ellipsis">
- <a class="g-a-noline" data-md='{"b":"list","p":"1-4"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E9%9D%B3%E4%B8%9C%E5%9B%9E%E5%BA%94%E5%8F%91%E9%94%99%E8%AF%97%E8%AF%8D&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 靳东回应发错诗词
- </a>
- </li>
- <li class="g-ellipsis">
- <a class="g-a-noline" data-md='{"b":"list","p":"1-5"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E7%85%A4%E8%80%81%E6%9D%BF%E4%BB%AC%E7%9A%84%E5%BD%B1%E8%A7%86%E6%B1%9F%E6%B9%96&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 煤老板们的影视江湖
- </a>
- </li>
- <li class="g-ellipsis">
- <a class="g-a-noline" data-md='{"b":"list","p":"1-6"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=1024%E7%A8%8B%E5%BA%8F%E5%91%98%E8%8A%82&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 1024程序员节
- </a>
- </li>
- <li class="g-ellipsis">
- <a class="g-a-noline" data-md='{"b":"list","p":"1-7"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E7%BE%8E%E7%9A%84%E5%90%88%E5%B9%B6%E5%B0%8F%E5%A4%A9%E9%B9%85&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 美的合并小天鹅
- </a>
- </li>
- <li class="g-ellipsis">
- <a class="g-a-noline" data-md='{"b":"list","p":"1-8"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E4%BA%AC%E6%98%86%E9%AB%98%E9%80%9F4%E8%BD%A6%E7%9B%B8%E6%92%9E&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 京昆高速4车相撞
- </a>
- </li>
- </ul>
- </body>
- </html>
- '''
from pyquery import PyQuery as pq
doc = pq(html)
# 获取所有a标签
print(doc(‘body .mh-col li a’))
注意:
类名用.
id用#
标签用标签名
另外选择的是具有层级关系,从左到右,不是直接的父子的关系。
运行结果如下:
- <a class="g-a-noline" data-md="{"b":"list","p":"1-1"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E6%AD%8C%E6%89%8B%E9%AB%98%E7%A9%BA%E6%8B%8DMV%E5%9D%A0%E4%BA%A1&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 歌手高空拍MV坠亡
- </a>
- <a class="g-a-noline" data-md="{"b":"list","p":"1-2"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E4%BD%9B%E7%A5%96%E6%9C%B1%E9%BE%99%E5%B9%BF%E9%87%91%E5%A9%9A&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D&fr=hao_360so_history_b" target="_blank">
- 佛祖朱龙广金婚
- </a>
- <a class="g-a-noline" data-md="{"b":"list","p":"1-3"}" href="http://www.so.com/link?url=http%3A%2F%2Fbaike.so.com%2Fzt%2Fzufangfangpian.html%3Fsrc%3Dreci&q=%E5%91%A8%E6%9D%B0%E4%BC%A6%E7%9A%84%E6%AD%8C&ts=1540365398&t=c6890eee0e669832ca96d5223582d6e" target="_blank">
- 常见租房陷阱
- </a>
- <a class="g-a-noline" data-md="{"b":"list","p":"1-4"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E9%9D%B3%E4%B8%9C%E5%9B%9E%E5%BA%94%E5%8F%91%E9%94%99%E8%AF%97%E8%AF%8D&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 靳东回应发错诗词
- </a>
- <a class="g-a-noline" data-md="{"b":"list","p":"1-5"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E7%85%A4%E8%80%81%E6%9D%BF%E4%BB%AC%E7%9A%84%E5%BD%B1%E8%A7%86%E6%B1%9F%E6%B9%96&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 煤老板们的影视江湖
- </a>
- <a class="g-a-noline" data-md="{"b":"list","p":"1-6"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=1024%E7%A8%8B%E5%BA%8F%E5%91%98%E8%8A%82&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 1024程序员节
- </a>
- <a class="g-a-noline" data-md="{"b":"list","p":"1-7"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E7%BE%8E%E7%9A%84%E5%90%88%E5%B9%B6%E5%B0%8F%E5%A4%A9%E9%B9%85&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 美的合并小天鹅
- </a>
- <a class="g-a-noline" data-md="{"b":"list","p":"1-8"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E4%BA%AC%E6%98%86%E9%AB%98%E9%80%9F4%E8%BD%A6%E7%9B%B8%E6%92%9E&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 京昆高速4车相撞
- </a>
操作
- html='''
- <html>
- <body>
- <ul class="mh-col">
- <li class="g-ellipsis1">
- <a class="g-a-noline1" data-md='{"b":"list","p":"1-1"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E6%AD%8C%E6%89%8B%E9%AB%98%E7%A9%BA%E6%8B%8DMV%E5%9D%A0%E4%BA%A1&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 歌手高空拍MV坠亡
- </a>
- <span class="mh-ico-up">
- </span>
- </li>
- <li class="g-ellipsis2">
- <a class="g-a-noline2" data-md='{"b":"list","p":"1-2"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E4%BD%9B%E7%A5%96%E6%9C%B1%E9%BE%99%E5%B9%BF%E9%87%91%E5%A9%9A&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D&fr=hao_360so_history_b" target="_blank">
- 佛祖朱龙广金婚
- </a>
- <span class="mh-ico-down">
- </span>
- </li>
- <li class="g-ellipsis3">
- <a class="g-a-noline3" data-md='{"b":"list","p":"1-3"}' href="http://www.so.com/link?url=http%3A%2F%2Fbaike.so.com%2Fzt%2Fzufangfangpian.html%3Fsrc%3Dreci&q=%E5%91%A8%E6%9D%B0%E4%BC%A6%E7%9A%84%E6%AD%8C&ts=1540365398&t=c6890eee0e669832ca96d5223582d6e" target="_blank">
- 常见租房陷阱
- </a>
- <span class="mh-ico-up">
- </span>
- </li>
- <li class="g-ellipsis4">
- <a class="g-a-noline4" data-md='{"b":"list","p":"1-4"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E9%9D%B3%E4%B8%9C%E5%9B%9E%E5%BA%94%E5%8F%91%E9%94%99%E8%AF%97%E8%AF%8D&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 靳东回应发错诗词
- </a>
- </li>
- <li class="g-ellipsis5">
- <a class="g-a-noline5" data-md='{"b":"list","p":"1-5"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E7%85%A4%E8%80%81%E6%9D%BF%E4%BB%AC%E7%9A%84%E5%BD%B1%E8%A7%86%E6%B1%9F%E6%B9%96&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 煤老板们的影视江湖
- </a>
- </li>
- <li class="g-ellipsis6">
- <a class="g-a-noline6" data-md='{"b":"list","p":"1-6"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=1024%E7%A8%8B%E5%BA%8F%E5%91%98%E8%8A%82&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 1024程序员节
- </a>
- </li>
- <li class="g-ellipsis7">
- <a class="g-a-noline7" data-md='{"b":"list","p":"1-7"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E7%BE%8E%E7%9A%84%E5%90%88%E5%B9%B6%E5%B0%8F%E5%A4%A9%E9%B9%85&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- 美的合并小天鹅
- </a>
- </li>
- <li class="g-ellipsis8">
- <a class="g-a-noline8" data-md='{"b":"list","p":"1-8"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E4%BA%AC%E6%98%86%E9%AB%98%E9%80%9F4%E8%BD%A6%E7%9B%B8%E6%92%9E&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
- <p>新闻</p>
- 京昆高速4车相撞
- </a>
- </li>
- </ul>
- </body>
- </html>
- '''
- from pyquery import PyQuery as pq
- doc = pq(html)
- items = doc('.mh-col')
#.find()
:查找嵌套元素- alist = items.find('li a')
- print(alist)
#查找所有子元素- alist2 = items.children()
- print(alist2)
- #查找指定的子元素
- alist3 = items.children('.g-ellipsis1')
print(alist2)
#查找父元素
#注意:一个元素只有一个父元素
body = items.parent()
print(body)
#查找祖先元素
content = items.parents()
print(content)
#查找兄弟元素
li = doc('.mh-col .g-ellipsis1')
print(li.siblings())
#遍历 单个元素
#遍历所有的a标签
alist =doc('.mh-col li a').items()
for a in alist: print(a)
获取信息
- 获取属性
- a =doc('.mh-col li .g-a-noline8')
- print(a.attr['href'])
- print(a.attr.href)
- 获取文本
- a =doc('.mh-col li .g-a-noline8')
- print(a.text())
- 获取HTML
- a =doc('.mh-col li .g-a-noline8')
- print(a.html())
简单的DOM操作
- #addClass、removerClass
#修改类名- a =doc('.mh-col li .g-a-noline8')
- print(a)
- a.removeClass('g-a-noline8')
- print(a)
- a.addClass('g-a-noline8')
- print(a)
- #attr、css
#修改属性和样式- a =doc('.mh-col li .g-a-noline8')
- print(a)
- a.attr('name','link')
- print(a)
- a.css('font-size','14px')
- print(a)
- #remove
#删除标签
li = doc('.mh-col .g-ellipsis8')
print(li)
li.find('a').remove()
print(li)
更多的DOM操作:https://pyquery.readthedocs.io/en/latest/api.html