爬虫——IP代理池与BeautifulSoup模块

IP代理池的概念及使用

1.有很多网站在防爬措施上面都加了封禁IP的措施
	一旦我的网站发现某一个IP在固定的时间内访问了很多次(一分钟访问了30次)，那么我会直接获取到该请求对应的主机IP地址,然后加入网站的黑名单
    刚请求来访问我的网站的时候我会先去黑名单中查看当前请求的ip在不在如果在直接拒绝
    如果不在才会进去下一个环节
    
针对上述ip封禁的情况，出现了IP代理池
	IP代理池里面有很多IP，你每次访问别人网站的时候
    随机从池子里面拿一个IP做伪装
    
具体使用
# 代理的地址获取有免费的也有收费的
import requests
proxies={
    \'https\':\'123.163.117.55:9999\',
    \'https\':\'123.163.117.55:9999\',
    \'https\':\'123.163.117.55:9999\',
}
respone=requests.get(\'https://www.12306.cn\',
                     proxies=proxies)

print(respone.status_code)

Beautiful Soup模块

Beautiful Soup会帮你节省数小时甚至数天的工作时间

# 安装 Beautiful Soup
pip install beautifulsoup4  # 这个4千万不要少了

# 解析器
	有四种 常用的两种
    html.parse  内置的不需要下载
    lxml		需要下载
    	pip3 install lxml
 
# 导入
from bs4 import BeautifulSoup

基本使用

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse\'s story</title></head>
<body>
<p class="title"><b>The Dormouse\'s story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
# 先将html页面内容传入BeautifulSoup 生成一个对象
soup = BeautifulSoup(html_doc,\'lxml\')  # 具有容错功能

res = soup.prettify()  # 处理好缩进，结构化显示  美化
print(res)

操作方法

html_doc = """
<html><head><title>The Dormouse\'s story</title></head>
<body>
<p id="my p" class="title jason" username="jason">123<b id="bbb" class="boldest">The Dormouse\'s story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,\'lxml\')

print(soup.a)  # 查找a标签 只会拿第一个

print(soup.p.name)  # 获取标签名

print(soup.p.attrs)  # 用字典的形式给你列举出标签所有的属性
# {\'id\': \'my p\', \'class\': [\'title\'], \'username\': \'jason\'}

print(soup.p.text)  # 获取p标签内部所有的文本

# string用的很少
print(soup.p.string)  # 只有p下面有单独的文本的时候才能拿到

# 嵌套选择
soup.head.title.string  # 依次往内部查找
soup.body.a.string

# 子节点、子孙节点
soup.p.contents #p下所有子节点
soup.p.children #得到一个迭代器,包含p下所有子节点
for child in soup.p.children:
    print(child)
    
# 父节点、祖先节点
soup.a.parent #获取a标签的父节点
soup.a.parents #找到a标签所有的祖先节点，父亲的父亲，父亲的父亲的父亲...
for p in soup.a.parents:
    print(p)
    
# 兄弟节点
soup.a.next_siblings #下一个兄弟
for i in soup.a.next_siblings:
    print(i)
soup.a.previous_sibling #上一个兄弟
list(soup.a.next_siblings) #下面的兄弟们=>生成器对象
soup.a.previous_siblings #上面的兄弟们=>生成器对象

过滤器

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc,\'lxml\')
# 五种过滤器: 字符串、正则表达式、列表、True、方法
# 1、字符串：即标签名   结果是一个列表 里面的元素才是真正的标签对象
print(soup.find_all(\'b\'))  #[<b class="boldest" id="bbb">The Dormouse\'s story</b>]

# 2、正则表达式
import re   # 一定要注意拿到的结果到底是什么数据类型
print(soup.find_all(re.compile(\'^b\'))) #找出b开头的标签，结果有body和b标签

# 3、列表：如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.
# 下面代码找到文档中所有<a>标签和<b>标签:
print(soup.find_all([\'a\',\'b\']))  # 找到文档中所有<a>标签和<b>标签

# 4、True：可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点
print(soup.find_all(True))  # True表示所有
for tag in soup.find_all(True):
    print(tag.name)

# 5、方法:如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数,
# 如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False
def has_class_but_no_id(tag):
    return tag.has_attr(\'class\') and not tag.has_attr(\'id\')
print(soup.find_all(has_class_but_no_id))

总结

1.查找标签非常的简单
	find()
    find_all()
"""
括号内常用的参数
	name		根据标签的名字查找标签
	id			根据标签的id查找标签
	class_      根据标签的class查找
"""

2.查找标签内部的文本
	标签对象.text
    
3.查找标签属性对应的值
	a标签的href属性对应的值
    	a.get(\'href\')
    img标签的src属性对应的值
		img.get(\'src\')

中文文档

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-parents-find-parent

本文链接：https://www.cnblogs.com/shof/p/13694225.html

爬虫——IP代理池与BeautifulSoup模块

IP代理池的概念及使用

Beautiful Soup模块

基本使用

操作方法

过滤器

总结

中文文档

爬虫——IP代理池与BeautifulSoup模块的更多相关文章

随机推荐

热门专题

目录导航