beautifulSoup基本用法及find选择器
总结来源于官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-all
示例代码段
html_doc = """
<html>
<head><title>The Dormouse\'s story <!--Hey, buddy. Want to buy a used parser?-->
<a><!--Hey, buddy. Want to buy a used parser?--></a></title>
</head>
<body>
<p class="title">
<b>The Dormouse\'s story</b>
<a><!--Hey, buddy. Want to buy a used parser?--></a>
</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1 link4">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
1、快速操作:
soup.title == soup.find(\'title\') # <title>The Dormouse\'s story</title> soup.title.name # u\'title\' soup.title.string == soup.title.text == soup.title.get_text() # u\'The Dormouse\'s story\' soup.title.parent.name # u\'head\' soup.p == soup.find(\'p\') # . 点属性,只能获取当前标签下的第一个标签 # <p class="title"><b>The Dormouse\'s story</b></p> soup.p[\'class\'] # u\'title\' soup.a == soup.find(\'a\') # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> soup.find_all(\'a\') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all([\'a\',\'b\']) # 查找所有的a标签和b标签
soup.find_all(id=["link1","link2"]) # 查找所有id=link1 和id=link2的标签
soup.find(id="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
2、Beautiful Soup对象有四种类型:
1、BeautifulSoup
2、tag:标签
3、NavigableString : 标签中的文本,可包含注释内容
4、Comment :标签中的注释,纯注释,没有正文内容
标签属性的操做跟字典是一样一样的
html多值属性(xml不适合):
意思为一个属性名称,它是多值的,即包含多个属性值,即使属性中只有一个值也返回值为list,
如:class,rel , rev , accept-charset , headers , accesskey
其它属性为单值属性,即使属性值中有多个空格隔开的值,也是反回一个字符串
soup.a[\'class\'] #[\'sister\'] id_soup = BeautifulSoup(\'<p id="my id"></p>\') id_soup.p[\'id\'] #\'my id\'
3、html中tag内容输出:
string:输出单一子标签文本内容或注释内容(选其一,标签中包含两种内容则输出为None)
strings: 返回所有子孙标签的文本内容的生成器(不包含注释)
stripped_strings:返回所有子孙标签的文本内容的生成器(不包含注释,并且在去掉了strings中的空行和空格)
text:只输出文本内容,可同时输出多个子标签内容
get_text():只输出文本内容,可同时输出多个子标签内容
string:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" soup = BeautifulSoup(markup, \'html.parser\') comm = soup.b.string print(comm) # Hey, buddy. Want to buy a used parser? print(type(comm)) #<class \'bs4.element.Comment\'>
strings:
head_tag = soup.body for s in head_tag.strings: print(repr(s)) 结果: \'\n\' "The Dormouse\'s story" \'\n\' \'Once upon a time there were three little sisters; and their names were\n \' \'Elsie\' \',\n \' \'Lacie\' \' and\n \' \'Tillie\' \';\n and they lived at the bottom of a well.\n \' \'\n\' \'...\' \'\n\'
stripped_strings:
head_tag = soup.body for s in head_tag.stripped_strings: print(repr(s)) 结果: "The Dormouse\'s story" \'Once upon a time there were three little sisters; and their names were\' \'Elsie\' \',\' \'Lacie\' \'and\' \'Tillie\' \';\n and they lived at the bottom of a well.\' \'...\'
text:
soup = BeautifulSoup(html_doc, \'html.parser\') head_tag = soup.body print(head_tag.text) 结果: The Dormouse\'s story Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ...
soup = BeautifulSoup(html_doc, \'html.parser\') head_tag = soup.body print(repr(head_tag.text)) 结果: "\nThe Dormouse\'s story\nOnce upon a time there were three little sisters; and their names were\n Elsie,\n Lacie and\n Tillie;\n and they lived at the bottom of a well.\n \n...\n"
4、返回子节点列表:
.contents: 以列表的方式返回节点下的直接子节点
.children:以生成器的方式反回节点下的直接子节点
soup = BeautifulSoup(html_doc, \'html.parser\') head_tag = soup.head print(head_tag) print(head_tag.contents) print(head_tag.contents[0]) print(head_tag.contents[0].contents) for ch in head_tag.children: print(ch) 结果: <head><title>The Dormouse\'s story</title></head> [<title>The Dormouse\'s story</title>] <title>The Dormouse\'s story</title> ["The Dormouse\'s story"] <title>The Dormouse\'s story</title>
5、返回子孙节点的生成器:
.descendants: 以列表的方式返回标签下的子孙节点
for ch in head_tag.descendants: print(ch) 结果: <title>The Dormouse\'s story</title> The Dormouse\'s story
6、父标签(parent):如果是bs4对象,不管本来是标签还是文本都可以找到其父标签,但是文本对象不能找到父标签
soup = BeautifulSoup(html_doc, \'html.parser\') tag_title = soup.b # b标签 print(tag_title.parent) # b标签的父标签 p print(type(tag_title.string)) # b标签中的文本的类型,文本中有注释时结果为None <class \'bs4.element.NavigableString\'> print(tag_title.string.parent) # b标签中文本的父标签 b print(type(tag_title.text)) # b 标签中的文本类型为str,无bs4属性找到父标签
7、递归父标签(parents):递归得到元素的所有父辈节点
soup = BeautifulSoup(html_doc, \'html.parser\') link = soup.a for parent in link.parents: print(parent.name)
结果:
p
body
html
[document]
8、前后节点查询(不是前后标签哦,文本也是节点之一):previous_sibling,next_sibling
9、以生成器的方式迭代返回所有兄弟节点
for sib in soup.a.next_siblings: print(sib) print("---------") 结果: ------------- , --------- <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> --------- --------- <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> --------- ; and they lived at the bottom of a well. ---------
10、搜索文档树
过滤器:
1、字符串
2、正则表达式
3、列表
4、True
5、方法
html_doc = """<html><head><title>The Dormouse\'s story</title></head> <body> <p class="title"><b>The Dormouse\'s story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were</p> <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. <p class="story">...</p> </body> """ from bs4 import BeautifulSoup import re soup = BeautifulSoup(html_doc, \'html.parser\') soup.find_all("a") # 字符串参数 soup.find_all(re.compile("^b")) # 正则参数 soup.find_all(re.compile("a")) # 正则参数 soup.find_all(re.compile("l$")) # 正则参数 soup.find_all(["a", "b"]) # 标签的列表参数 soup.find_all(True) # 返回所有标签 def has_class_no_id(tag): return tag.has_attr("class") and not tag.has_attr("id") soup.find_all(has_class_no_id) # 方法参数
11、find选择器:
语法 :
# find_all( name , attrs , recursive , text , **kwargs ) # name :要查找的标签名 # attrs: 标签的属性 # recursive: 递归 # text: 查找文本 # **kwargs :其它 键值参数
特殊情况:
data-foo="value",因中横杠不识别的原因,只能写成attrs={"data-foo":"value"},
class="value",因class是关键字,所以要写成class_="value"或attrs={"class":"value"}
from bs4 import BeautifulSoup import re html_doc = """ <html><head><title>The Dormouse\'s story</title></head> <p class="title"><b>The Dormouse\'s story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # find_all( name , attrs , recursive , text , **kwargs ) # name :要查找的标签名(字符串、正则、方法、True) # attrs: 标签的属性 # recursive: 递归 # text: 查找文本 # **kwargs :其它 键值参数 soup = BeautifulSoup(html_doc, \'html.parser\') print(soup.find_all(\'p\', \'title\')) # p标签且class="title" soup.find_all(\'title\') # 以列表形式返回 所有title标签a soup.find_all(attrs={"class":"sister"}) # 以列表形式返回 所有class属性==sister的标签 soup.find_all(id=\'link2\') # 返回所有id属性==link2的标签 soup.find_all(href=re.compile("elsie")) # 返回所有href属性包含elsie的标签 soup.find_all(id=True) # 返回 所有包含id属性的标签 soup.find_all(id="link1", href=re.compile(\'elsie\')) # id=link1且href包含elsie
关于class的搜索
soup = BeautifulSoup(html_doc, \'html.parser\') css_soup = BeautifulSoup(\'<p class="body strikeout"></p>\', \'html.parser\') css_soup.find_all("p", class_="body") # 多值class,指定其中一个即可 css_soup.find_all("p", class_="strikeout") css_soup.find_all("p", class_="body strikeout") # 精确匹配 # text 参数可以是字符串,列表、方法、True soup.find_all("a", text="Elsie") # text="Elsie"的a标签
12、父节点方法:
find_parents( name , attrs , recursive , text , **kwargs )
find_parent( name , attrs , recursive , text , **kwargs )
html_doc = """<html> <head> <title>The Dormouse\'s story</title> </head> <body> <p class="title"><b>The Dormouse\'s story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were</p> <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <p> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and </p> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. <p class="story">...</p> </body> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, \'html.parser\') a_string = soup.find(text="Lacie") # 文本为Lacie的节点 type(a_string), a_string # <class \'bs4.element.NavigableString\'> Lacie a_parent = a_string.find_parent() # a_string的父节点中的第一个节点 a_parent = a_string.find_parent("p") # a_string的父节点中的第一个p节点 a_parents = a_string.find_parents() # a_string的父节点 a_parents = a_string.find_parents("a") # a_string的父点中所有a节点
13、后面的邻居节点:
find_next_siblings( name , attrs , recursive , text , **kwargs )
find_next_sibling( name , attrs , recursive , text , **kwargs )
html_doc = """<html><head><title>The Dormouse\'s story</title></head> <body> <p class="title"><b>The Dormouse\'s story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were</p> <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <b href="http://example.com/elsie" class="sister" id="link1">Elsie</b>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. <p class="story">...</p> </body> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, \'html.parser\') first_link = soup.a # 第一个a标签 a_sibling = first_link.find_next_sibling() # 后面邻居的第一个 a_sibling = first_link.find_next_sibling("a") # 后面邻居的第一个a a_siblings = first_link.find_next_siblings() # 后面的所有邻居 a_siblings = first_link.find_next_siblings("a") # 后面邻居的所有a邻居
14、前面的邻居节点:
find_previous_siblings( name , attrs , recursive , text , **kwargs )
find_previous_sibling( name , attrs , recursive , text , **kwargs )
15、后面的节点:
find_all_next( name , attrs , recursive , text , **kwargs )
find_next( name , attrs , recursive , text , **kwargs )
html_doc = """<html> <head> <title>The Dormouse\'s story</title> </head> <body> <p class="title"><b>The Dormouse\'s story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were</p> <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <p> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and </p> <p> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; </p> and they lived at the bottom of a well. <p class="story">...</p> </body> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, \'html.parser\') a_string = soup.find(text="Lacie") a_next = a_string.find_next() # 后面所有子孙标签的第一个 a_next = a_string.find_next(\'a\') # 后面所有子孙标签的第一个a标签 a_nexts = a_string.find_all_next() # 后面的所有子孙标签 a_nexts = a_string.find_all_next(\'a\') # 后面的所有子孙标签中的所有a标签
16、前面的节点:
find_all_previous( name , attrs , recursive , text , **kwargs )
find_previous( name , attrs , recursive , text , **kwargs )
17、解析部分文档:
如果仅仅因为想要查找文档中的<a>标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把<a>标签以外的东西都忽略掉. SoupStrainer 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 SoupStrainer 中定义过的文档. 创建一个 SoupStrainer 对象并作为 parse_only 参数给 BeautifulSoup 的构造方法即可。
SoupStrainer 类参数:name , attrs , recursive , text , **kwargs
html_doc = """<html> <head> <title>The Dormouse\'s story</title> </head> <body> <p class="title"><b>The Dormouse\'s story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; </p> and they lived at the bottom of a well. <p class="story">...</p> </body> """ from bs4 import SoupStrainer a_tags = SoupStrainer(\'a\') # 所有a标签 id_tags = SoupStrainer(id="link2") # id=link2的标签 def is_short_string(string): return len(string) < 10 # string长度小于10,返回True short_string = SoupStrainer(text=is_short_string) # 符合条件的文本 from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, \'html.parser\', parse_only=a_tags).prettify() soup = BeautifulSoup(html_doc, \'html.parser\', parse_only=id_tags).prettify() soup = BeautifulSoup(html_doc, \'html.parser\', parse_only=short_string).prettify()
<div id=”zmdao_post_body” class=”blogpost-body”><p> </p><p> 总结来源于官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-all</p><p> </p><p>示例代码段</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>html_doc = “””<br><html><br> <head><title>The Dormouse\’s story <!–Hey, buddy. Want to buy a used parser?–><br> <a><!–Hey, buddy. Want to buy a used parser?–></a></title><br> </head><br><body><br> <p class=”title”><br> <b>The Dormouse\’s story</b><br> <a><!–Hey, buddy. Want to buy a used parser?–></a><br> </p><br> <p class=”story”>Once upon a time there were three little sisters; and their names were<br> <a href=”http://example.com/elsie” class=”sister” id=”link1 link4″>Elsie</a>,<br> <a href=”http://example.com/lacie” class=”sister” id=”link2″>Lacie</a> and<br> <a href=”http://example.com/tillie” class=”sister” id=”link3″>Tillie</a>;<br> and they lived at the bottom of a well.<br> </p><br> <p class=”story”>…</p><br>”””</pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> </p><p> 1、快速操作:</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>soup.title == soup.find(<span style=”color: #800000″>\'</span><span style=”color: #800000″>title</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)# </span><title>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</title></span><span style=”color: #000000″>soup.title.name# u</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>title</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>
soup.title.</span><span style=”color: #0000ff”>string</span> == soup.title.text ==<span style=”color: #000000″> soup.title.get_text()# u</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>The Dormouse</span><span style=”color: #800000″>\'</span>s story<span style=”color: #800000″>\'</span><span style=”color: #000000″>soup.title.parent.name# u</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>head</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>
soup.p </span>== soup.find(<span style=”color: #800000″>\'</span><span style=”color: #800000″>p</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>) # . 点属性,只能获取当前标签下的第一个标签# </span><p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>title</span><span style=”color: #800000″>”</span>><b>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</b></p></span><span style=”color: #000000″>soup.p[</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>class</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>]# u</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>title</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>
soup.a </span>== soup.find(<span style=”color: #800000″>\'</span><span style=”color: #800000″>a</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)# </span><a <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/elsie</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link1</span><span style=”color: #800000″>”</span>>Elsie</a><span style=”color: #000000″>
soup.find_all(</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>a</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)# [</span><a <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/elsie</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link1</span><span style=”color: #800000″>”</span>>Elsie</a><span style=”color: #000000″>,# </span><a <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/lacie</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link2</span><span style=”color: #800000″>”</span>>Lacie</a><span style=”color: #000000″>,# </span><a <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/tillie</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link3</span><span style=”color: #800000″>”</span>>Tillie</a><span style=”color: #000000″>]<br>soup.find_all([\’a\’,\’b\’]) # 查找所有的a标签和b标签<br>soup.find_all(id=[“link1″,”link2″]) # 查找所有id=link1 和id=link2的标签<br>soup.find(id</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>link3</span><span style=”color: #800000″>”</span><span style=”color: #000000″>)# </span><a <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/tillie</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link3</span><span style=”color: #800000″>”</span>>Tillie</a><br><br><br></pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> 2、Beautiful Soup对象有四种类型:</p><p> 1、BeautifulSoup</p><p> 2、tag:标签</p><p> 3、NavigableString : 标签中的文本,可包含注释内容</p><p> 4、Comment :标签中的注释,纯注释,没有正文内容</p><p> </p><p> 标签属性的操做跟字典是一样一样的</p><p> html多值属性(xml不适合):</p><p> 意思为一个属性名称,它是多值的,即包含多个属性值,即使属性中只有一个值也返回值为list,</p><p> 如:class,<tt class=”docutils literal”><span class=”pre”>rel</span></tt> , <tt class=”docutils literal”><span class=”pre”>rev</span></tt> , <tt class=”docutils literal”><span class=”pre”>accept-charset</span></tt> , <tt class=”docutils literal”><span class=”pre”>headers</span></tt> , <tt class=”docutils literal”><span class=”pre”>accesskey</span></tt></p><p> 其它属性为单值属性,即使属性值中有多个空格隔开的值,也是反回一个字符串</p><div class=”zmdao_code”><pre>soup.a[<span style=”color: #800000″>\'</span><span style=”color: #800000″>class</span><span style=”color: #800000″>\'</span>] #[<span style=”color: #800000″>\'</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>]
id_soup </span>= BeautifulSoup(<span style=”color: #800000″>\'</span><span style=”color: #800000″><p id=”my id”></p></span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)id_soup.p[</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>id</span><span style=”color: #800000″>\'</span>] #<span style=”color: #800000″>\'</span><span style=”color: #800000″>my id</span><span style=”color: #800000″>\'</span></pre></div><p> </p><p> 3、html中tag内容输出: </p><p> string:输出单一子标签文本内容或注释内容(选其一,标签中包含两种内容则输出为None)</p><p> strings: 返回所有子孙标签的文本内容的生成器(不包含注释)</p><p> stripped_strings:返回所有子孙标签的文本内容的生成器(不包含注释,并且在去掉了strings中的空行和空格)</p><p> text:只输出文本内容,可同时输出多个子标签内容</p><p> get_text():只输出文本内容,可同时输出多个子标签内容</p><p> string:</p><div class=”zmdao_code”><pre>markup = <span style=”color: #800000″>”</span><span style=”color: #800000″><b><!–Hey, buddy. Want to buy a used parser?–></b></span><span style=”color: #800000″>”</span><span style=”color: #000000″>soup </span>= BeautifulSoup(markup, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)comm </span>= soup.b.<span style=”color: #0000ff”>string</span><span style=”color: #000000″>print(comm) # Hey, buddy. Want to buy a used parser?print(type(comm)) #<class \’bs4.element.Comment\’></span></pre></div><p> strings:</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>head_tag =<span style=”color: #000000″> soup.body</span><span style=”color: #0000ff”>for</span> s <span style=”color: #0000ff”>in</span><span style=”color: #000000″> head_tag.strings: print(repr(s))
结果:</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\n</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>”</span><span style=”color: #800000″>The Dormouse\’s story</span><span style=”color: #800000″>”</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\n</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>Once upon a time there were three little sisters; and their names were\n </span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>Elsie</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>,\n </span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>Lacie</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″> and\n </span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>Tillie</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>;\n and they lived at the bottom of a well.\n </span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\n</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>…</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\n</span><span style=”color: #800000″>\'</span></pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> stripped_strings:</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>head_tag =<span style=”color: #000000″> soup.body</span><span style=”color: #0000ff”>for</span> s <span style=”color: #0000ff”>in</span><span style=”color: #000000″> head_tag.stripped_strings: print(repr(s))
结果:</span><span style=”color: #800000″>”</span><span style=”color: #800000″>The Dormouse\’s story</span><span style=”color: #800000″>”</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>Once upon a time there were three little sisters; and their names were</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>Elsie</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>,</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>Lacie</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>and</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>Tillie</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>;\n and they lived at the bottom of a well.</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>…</span><span style=”color: #800000″>\'</span></pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> text:</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)head_tag </span>=<span style=”color: #000000″> soup.bodyprint(head_tag.text)
结果:The Dormouse</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</span><span style=”color: #000000″>Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. …</span></pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)head_tag </span>=<span style=”color: #000000″> soup.bodyprint(repr(head_tag.text))
结果:</span><span style=”color: #800000″>”</span><span style=”color: #800000″>\nThe Dormouse\’s story\nOnce upon a time there were three little sisters; and their names were\n Elsie,\n Lacie and\n Tillie;\n and they lived at the bottom of a well.\n \n…\n</span><span style=”color: #800000″>”</span></pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> </p><p> </p><p> 4、返回子节点列表:</p><p> .contents: 以列表的方式返回节点下的直接子节点</p><p> .children:以生成器的方式反回节点下的直接子节点</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)head_tag </span>=<span style=”color: #000000″> soup.headprint(head_tag)print(head_tag.contents)print(head_tag.contents[</span><span style=”color: #800080″>0</span><span style=”color: #000000″>])print(head_tag.contents[</span><span style=”color: #800080″>0</span><span style=”color: #000000″>].contents)
</span><span style=”color: #0000ff”>for</span> ch <span style=”color: #0000ff”>in</span><span style=”color: #000000″> head_tag.children: print(ch)
结果:</span><head><title>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</title></head></span>[<title>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</title>]</span><title>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</title></span>[<span style=”color: #800000″>”</span><span style=”color: #800000″>The Dormouse\’s story</span><span style=”color: #800000″>”</span><span style=”color: #000000″>]</span><title>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</title></span></pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> </p><p> 5、返回子孙节点的生成器:</p><p> .descendants: 以列表的方式返回标签下的子孙节点</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre><span style=”color: #0000ff”>for</span> ch <span style=”color: #0000ff”>in</span><span style=”color: #000000″> head_tag.descendants: print(ch)
结果:</span><title>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</title></span>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</span></pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> </p><p> 6、父标签(parent):如果是bs4对象,不管本来是标签还是文本都可以找到其父标签,但是文本对象不能找到父标签</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)tag_title </span>=<span style=”color: #000000″> soup.b # b标签print(tag_title.parent) # b标签的父标签 pprint(type(tag_title.</span><span style=”color: #0000ff”>string</span>)) # b标签中的文本的类型,文本中有注释时结果为None <<span style=”color: #0000ff”>class</span> <span style=”color: #800000″>\'</span><span style=”color: #800000″>bs4.element.NavigableString</span><span style=”color: #800000″>\'</span>><span style=”color: #000000″>print(tag_title.</span><span style=”color: #0000ff”>string</span><span style=”color: #000000″>.parent) # b标签中文本的父标签 bprint(type(tag_title.text)) # b 标签中的文本类型为str,无bs4属性找到父标签</span></pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> </p><p> 7、递归父标签(parents):递归得到元素的所有父辈节点</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)link </span>=<span style=”color: #000000″> soup.a</span><span style=”color: #0000ff”>for</span> parent <span style=”color: #0000ff”>in</span><span style=”color: #000000″> link.parents: print(parent.name)<br><br>结果:<br></span></pre><p>p<br>body<br>html<br>[document]</p>
<div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> </p><p> 8、前后节点查询(不是前后标签哦,文本也是节点之一):previous_sibling,next_sibling</p><p><img src=”https://images2017.cnblogs.com/blog/931154/201801/931154-20180124082140694-1377077553.png” alt=””></p><p> </p><p> 9、以生成器的方式迭代返回所有兄弟节点</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre><span style=”color: #0000ff”>for</span> sib <span style=”color: #0000ff”>in</span><span style=”color: #000000″> soup.a.next_siblings: print(sib) print(</span><span style=”color: #800000″>”</span><span style=”color: #800000″>———</span><span style=”color: #800000″>”</span><span style=”color: #000000″>)
结果:</span>————-<span style=”color: #000000″>, </span>———<a <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/lacie</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link2</span><span style=”color: #800000″>”</span>>Lacie</a>———
———<a <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/tillie</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link3</span><span style=”color: #800000″>”</span>>Tillie</a>———<span style=”color: #000000″>; and they lived at the bottom of a well. </span>———</pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> </p><p> 10、搜索文档树</p><p> 过滤器:</p><p> 1、字符串</p><p> 2、正则表达式</p><p> 3、列表</p><p> 4、True</p><p> 5、方法</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>html_doc = <span style=”color: #800000″>”””</span><span style=”color: #800000″><html><head><title>The Dormouse\’s story</title></head></span><body><p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>title</span><span style=”color: #800000″>”</span>><b>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</b></p></span><p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>story</span><span style=”color: #800000″>”</span>>Once upon a time there were three little sisters; and their names were</p><a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/elsie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link1</span><span style=”color: #800000″>”</span>>Elsie</a><span style=”color: #000000″>,</span><a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/lacie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link2</span><span style=”color: #800000″>”</span>>Lacie</a><span style=”color: #000000″> and</span><a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/tillie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link3</span><span style=”color: #800000″>”</span>>Tillie</a><span style=”color: #000000″>;and they lived at the bottom of a well.
</span><p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>story</span><span style=”color: #800000″>”</span>>…</p></body><span style=”color: #800000″>”””</span><span style=”color: #0000ff”>from</span><span style=”color: #000000″> bs4 import BeautifulSoupimport resoup </span>= BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)soup.find_all(</span><span style=”color: #800000″>”</span><span style=”color: #800000″>a</span><span style=”color: #800000″>”</span><span style=”color: #000000″>) # 字符串参数soup.find_all(re.compile(</span><span style=”color: #800000″>”</span><span style=”color: #800000″>^b</span><span style=”color: #800000″>”</span><span style=”color: #000000″>)) # 正则参数soup.find_all(re.compile(</span><span style=”color: #800000″>”</span><span style=”color: #800000″>a</span><span style=”color: #800000″>”</span><span style=”color: #000000″>)) # 正则参数soup.find_all(re.compile(</span><span style=”color: #800000″>”</span><span style=”color: #800000″>l$</span><span style=”color: #800000″>”</span><span style=”color: #000000″>)) # 正则参数soup.find_all([</span><span style=”color: #800000″>”</span><span style=”color: #800000″>a</span><span style=”color: #800000″>”</span>, <span style=”color: #800000″>”</span><span style=”color: #800000″>b</span><span style=”color: #800000″>”</span><span style=”color: #000000″>]) # 标签的列表参数soup.find_all(True) # 返回所有标签def has_class_no_id(tag): </span><span style=”color: #0000ff”>return</span> tag.has_attr(<span style=”color: #800000″>”</span><span style=”color: #800000″>class</span><span style=”color: #800000″>”</span>) and not tag.has_attr(<span style=”color: #800000″>”</span><span style=”color: #800000″>id</span><span style=”color: #800000″>”</span><span style=”color: #000000″>)soup.find_all(has_class_no_id) # 方法参数</span></pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> </p><p><span style=”font-family: 黑体; font-size: 14pt”> 11、find选择器:</span></p><p> 语法 :</p><pre><span> # find_all( name , attrs , recursive , text , **<span>kwargs ) # name :要查找的标签名 # attrs: 标签的属性 # recursive: 递归 # text: 查找文本 # **<span>kwargs :其它 键值参数<br><br> 特殊情况:<br> </span></span></span><span class=”s”>data-foo=”value”,</span><span class=”s”>因中横杠不识别的原因,只能写成</span><span class=”s”>attrs={“data-foo”:”value”},</span></pre><pre><span><span><span> class=”value”,因class是关键字,所以要写成class_=”value”或attrs={“class”:”value”}</span></span></span></pre><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre><span style=”color: #0000ff”>from</span><span style=”color: #000000″> bs4 import BeautifulSoupimport rehtml_doc </span>= <span style=”color: #800000″>”””</span><html><head><title>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</title></head></span>
<p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>title</span><span style=”color: #800000″>”</span>><b>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</b></p></span>
<p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>story</span><span style=”color: #800000″>”</span>><span style=”color: #000000″>Once upon a time there were three little sisters; and their names were</span><a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/elsie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link1</span><span style=”color: #800000″>”</span>>Elsie</a><span style=”color: #000000″>,</span><a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/lacie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link2</span><span style=”color: #800000″>”</span>>Lacie</a><span style=”color: #000000″> and</span><a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/tillie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link3</span><span style=”color: #800000″>”</span>>Tillie</a><span style=”color: #000000″>;and they lived at the bottom of a well.</span></p>
<p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>story</span><span style=”color: #800000″>”</span>>…</p><span style=”color: #800000″>”””</span><span style=”color: #000000″># find_all( name , attrs , recursive , text , </span>**<span style=”color: #000000″>kwargs )# name :要查找的标签名(字符串、正则、方法、True)# attrs: 标签的属性# recursive: 递归# text: 查找文本# </span>**<span style=”color: #000000″>kwargs :其它 键值参数soup </span>= BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)print(soup.find_all(</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>p</span><span style=”color: #800000″>\'</span>, <span style=”color: #800000″>\'</span><span style=”color: #800000″>title</span><span style=”color: #800000″>\'</span>)) # p标签且class=<span style=”color: #800000″>”</span><span style=”color: #800000″>title</span><span style=”color: #800000″>”</span><span style=”color: #000000″>soup.find_all(</span><span style=”color: #800000″>\'</span><span style=”color: #800000″>title</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>) # 以列表形式返回 所有title标签asoup.find_all(attrs</span>={<span style=”color: #800000″>”</span><span style=”color: #800000″>class</span><span style=”color: #800000″>”</span>:<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span>}) # 以列表形式返回 所有class属性==<span style=”color: #000000″>sister的标签soup.find_all(id</span>=<span style=”color: #800000″>\'</span><span style=”color: #800000″>link2</span><span style=”color: #800000″>\'</span>) # 返回所有id属性==<span style=”color: #000000″>link2的标签soup.find_all(href</span>=re.compile(<span style=”color: #800000″>”</span><span style=”color: #800000″>elsie</span><span style=”color: #800000″>”</span><span style=”color: #000000″>)) # 返回所有href属性包含elsie的标签soup.find_all(id</span>=<span style=”color: #000000″>True) # 返回 所有包含id属性的标签soup.find_all(id</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>link1</span><span style=”color: #800000″>”</span>, href=re.compile(<span style=”color: #800000″>\'</span><span style=”color: #800000″>elsie</span><span style=”color: #800000″>\'</span>)) # id=link1且href包含elsie</pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p><img src=”https://images2017.cnblogs.com/blog/931154/201801/931154-20180128222706647-1457600468.png” alt=””></p><pre>关于class的搜索</pre><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>soup = BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)css_soup </span>= BeautifulSoup(<span style=”color: #800000″>\'</span><span style=”color: #800000″><p class=”body strikeout”></p></span><span style=”color: #800000″>\'</span>, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)css_soup.find_all(</span><span style=”color: #800000″>”</span><span style=”color: #800000″>p</span><span style=”color: #800000″>”</span>, class_=<span style=”color: #800000″>”</span><span style=”color: #800000″>body</span><span style=”color: #800000″>”</span><span style=”color: #000000″>) # 多值class,指定其中一个即可css_soup.find_all(</span><span style=”color: #800000″>”</span><span style=”color: #800000″>p</span><span style=”color: #800000″>”</span>, class_=<span style=”color: #800000″>”</span><span style=”color: #800000″>strikeout</span><span style=”color: #800000″>”</span><span style=”color: #000000″>)css_soup.find_all(</span><span style=”color: #800000″>”</span><span style=”color: #800000″>p</span><span style=”color: #800000″>”</span>, class_=<span style=”color: #800000″>”</span><span style=”color: #800000″>body strikeout</span><span style=”color: #800000″>”</span><span style=”color: #000000″>) # 精确匹配# text 参数可以是字符串,列表、方法、Truesoup.find_all(</span><span style=”color: #800000″>”</span><span style=”color: #800000″>a</span><span style=”color: #800000″>”</span>, text=<span style=”color: #800000″>”</span><span style=”color: #800000″>Elsie</span><span style=”color: #800000″>”</span>) # text=<span style=”color: #800000″>”</span><span style=”color: #800000″>Elsie</span><span style=”color: #800000″>”</span>的a标签</pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> </p><p> 12、父节点方法:</p><p> find_parents( <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32″>name</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css”>attrs</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive”>recursive</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text”>text</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword”>**kwargs</a> )</p><p> find_parent( <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32″>name</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css”>attrs</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive”>recursive</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text”>text</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword”>**kwargs</a> )</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>html_doc = <span style=”color: #800000″>”””</span><span style=”color: #800000″><html></span> <head> <title>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</title></span> </head><body> <p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>title</span><span style=”color: #800000″>”</span>><b>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</b></p></span> <p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>story</span><span style=”color: #800000″>”</span>>Once upon a time there were three little sisters; and their names were</p> <a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/elsie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link1</span><span style=”color: #800000″>”</span>>Elsie</a><span style=”color: #000000″>, </span><p> <a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/lacie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link2</span><span style=”color: #800000″>”</span>>Lacie</a><span style=”color: #000000″> and </span></p> <a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/tillie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link3</span><span style=”color: #800000″>”</span>>Tillie</a><span style=”color: #000000″>; and they lived at the bottom of a well. </span><p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>story</span><span style=”color: #800000″>”</span>>…</p></body><span style=”color: #800000″>”””</span><span style=”color: #0000ff”>from</span><span style=”color: #000000″> bs4 import BeautifulSoupsoup </span>= BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)a_string </span>= soup.find(text=<span style=”color: #800000″>”</span><span style=”color: #800000″>Lacie</span><span style=”color: #800000″>”</span><span style=”color: #000000″>) # 文本为Lacie的节点type(a_string), a_string # </span><<span style=”color: #0000ff”>class</span> <span style=”color: #800000″>\'</span><span style=”color: #800000″>bs4.element.NavigableString</span><span style=”color: #800000″>\'</span>><span style=”color: #000000″> Laciea_parent </span>=<span style=”color: #000000″> a_string.find_parent() # a_string的父节点中的第一个节点a_parent </span>= a_string.find_parent(<span style=”color: #800000″>”</span><span style=”color: #800000″>p</span><span style=”color: #800000″>”</span><span style=”color: #000000″>) # a_string的父节点中的第一个p节点a_parents </span>=<span style=”color: #000000″> a_string.find_parents() # a_string的父节点a_parents </span>= a_string.find_parents(<span style=”color: #800000″>”</span><span style=”color: #800000″>a</span><span style=”color: #800000″>”</span>) # a_string的父点中所有a节点</pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> </p><p> 13、后面的邻居节点:</p><p> find_next_siblings( <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32″>name</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css”>attrs</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive”>recursive</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text”>text</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword”>**kwargs</a> )</p><p> find_next_sibling( <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32″>name</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css”>attrs</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive”>recursive</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text”>text</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword”>**kwargs</a> )</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>html_doc = <span style=”color: #800000″>”””</span><span style=”color: #800000″><html><head><title>The Dormouse\’s story</title></head></span><body> <p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>title</span><span style=”color: #800000″>”</span>><b>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</b></p></span> <p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>story</span><span style=”color: #800000″>”</span>>Once upon a time there were three little sisters; and their names were</p> <a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/elsie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link1</span><span style=”color: #800000″>”</span>>Elsie</a><span style=”color: #000000″>, </span><b href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/elsie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link1</span><span style=”color: #800000″>”</span>>Elsie</b><span style=”color: #000000″>, </span><a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/lacie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link2</span><span style=”color: #800000″>”</span>>Lacie</a><span style=”color: #000000″> and </span><a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/tillie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link3</span><span style=”color: #800000″>”</span>>Tillie</a><span style=”color: #000000″>; and they lived at the bottom of a well. </span><p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>story</span><span style=”color: #800000″>”</span>>…</p></body><span style=”color: #800000″>”””</span><span style=”color: #0000ff”>from</span><span style=”color: #000000″> bs4 import BeautifulSoupsoup </span>= BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)first_link </span>=<span style=”color: #000000″> soup.a # 第一个a标签a_sibling </span>=<span style=”color: #000000″> first_link.find_next_sibling() # 后面邻居的第一个a_sibling </span>= first_link.find_next_sibling(<span style=”color: #800000″>”</span><span style=”color: #800000″>a</span><span style=”color: #800000″>”</span><span style=”color: #000000″>) # 后面邻居的第一个aa_siblings </span>=<span style=”color: #000000″> first_link.find_next_siblings() # 后面的所有邻居a_siblings </span>= first_link.find_next_siblings(<span style=”color: #800000″>”</span><span style=”color: #800000″>a</span><span style=”color: #800000″>”</span>) # 后面邻居的所有a邻居</pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> </p><p> 14、前面的邻居节点:</p><p> find_previous_siblings( <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32″>name</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css”>attrs</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive”>recursive</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text”>text</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword”>**kwargs</a> )</p><p> find_previous_sibling( <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32″>name</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css”>attrs</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive”>recursive</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text”>text</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword”>**kwargs</a> )</p><p> </p><p> 15、后面的节点:</p><p> find_all_next( <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32″>name</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css”>attrs</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive”>recursive</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text”>text</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword”>**kwargs</a> )</p><p> find_next( <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32″>name</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css”>attrs</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive”>recursive</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text”>text</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword”>**kwargs</a> )</p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>html_doc = <span style=”color: #800000″>”””</span><span style=”color: #800000″><html></span> <head> <title>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</title></span> </head><body> <p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>title</span><span style=”color: #800000″>”</span>><b>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</b></p></span> <p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>story</span><span style=”color: #800000″>”</span>>Once upon a time there were three little sisters; and their names were</p> <a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/elsie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link1</span><span style=”color: #800000″>”</span>>Elsie</a><span style=”color: #000000″>, </span><p> <a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/lacie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link2</span><span style=”color: #800000″>”</span>>Lacie</a><span style=”color: #000000″> and </span></p> <p> <a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/tillie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link3</span><span style=”color: #800000″>”</span>>Tillie</a><span style=”color: #000000″>; </span></p><span style=”color: #000000″> and they lived at the bottom of a well. </span><p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>story</span><span style=”color: #800000″>”</span>>…</p></body><span style=”color: #800000″>”””</span><span style=”color: #0000ff”>from</span><span style=”color: #000000″> bs4 import BeautifulSoupsoup </span>= BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>)a_string </span>= soup.find(text=<span style=”color: #800000″>”</span><span style=”color: #800000″>Lacie</span><span style=”color: #800000″>”</span><span style=”color: #000000″>)a_next </span>=<span style=”color: #000000″> a_string.find_next() # 后面所有子孙标签的第一个a_next </span>= a_string.find_next(<span style=”color: #800000″>\'</span><span style=”color: #800000″>a</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>) # 后面所有子孙标签的第一个a标签a_nexts </span>=<span style=”color: #000000″> a_string.find_all_next() # 后面的所有子孙标签a_nexts </span>= a_string.find_all_next(<span style=”color: #800000″>\'</span><span style=”color: #800000″>a</span><span style=”color: #800000″>\'</span>) # 后面的所有子孙标签中的所有a标签</pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> </p><p> 16、前面的节点:</p><p> find_all_previous( <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32″>name</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css”>attrs</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive”>recursive</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text”>text</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword”>**kwargs</a> )</p><p> find_previous( <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32″>name</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css”>attrs</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive”>recursive</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text”>text</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword”>**kwargs</a> )</p><p> </p><p> 17、解析部分文档:</p><p> 如果仅仅因为想要查找文档中的<a>标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把<a>标签以外的东西都忽略掉. <tt class=”docutils literal”><span class=”pre”>SoupStrainer</span></tt> 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 <tt class=”docutils literal”><span class=”pre”>SoupStrainer</span></tt> 中定义过的文档. 创建一个 <tt class=”docutils literal”><span class=”pre”>SoupStrainer</span></tt> 对象并作为 <tt class=”docutils literal”><span class=”pre”>parse_only</span></tt> 参数给 <tt class=”docutils literal”><span class=”pre”>BeautifulSoup</span></tt> 的构造方法即可。</p><p><tt class=”docutils literal”><span class=”pre”> SoupStrainer</span></tt> 类参数:<a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id32″>name</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#css”>attrs</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#recursive”>recursive</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#text”>text</a> , <a class=”reference internal” href=”https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#keyword”>**kwargs</a></p><div class=”zmdao_code”><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div><pre>html_doc = <span style=”color: #800000″>”””</span><span style=”color: #800000″><html></span> <head> <title>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</title></span> </head><body> <p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>title</span><span style=”color: #800000″>”</span>><b>The Dormouse<span style=”color: #800000″>\'</span><span style=”color: #800000″>s story</b></p></span> <p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>story</span><span style=”color: #800000″>”</span>><span style=”color: #000000″>Once upon a time there were three little sisters; and their names were </span><a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/elsie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link1</span><span style=”color: #800000″>”</span>>Elsie</a><span style=”color: #000000″>, </span><a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/lacie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link2</span><span style=”color: #800000″>”</span>>Lacie</a><span style=”color: #000000″> and </span><a href=<span style=”color: #800000″>”</span><span style=”color: #800000″>http://example.com/tillie</span><span style=”color: #800000″>”</span> <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>sister</span><span style=”color: #800000″>”</span> id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link3</span><span style=”color: #800000″>”</span>>Tillie</a><span style=”color: #000000″>; </span></p><span style=”color: #000000″> and they lived at the bottom of a well. </span><p <span style=”color: #0000ff”>class</span>=<span style=”color: #800000″>”</span><span style=”color: #800000″>story</span><span style=”color: #800000″>”</span>>…</p></body><span style=”color: #800000″>”””</span><span style=”color: #0000ff”>from</span><span style=”color: #000000″> bs4 import SoupStrainera_tags </span>= SoupStrainer(<span style=”color: #800000″>\'</span><span style=”color: #800000″>a</span><span style=”color: #800000″>\'</span><span style=”color: #000000″>) # 所有a标签id_tags </span>= SoupStrainer(id=<span style=”color: #800000″>”</span><span style=”color: #800000″>link2</span><span style=”color: #800000″>”</span>) # id=<span style=”color: #000000″>link2的标签def is_short_string(</span><span style=”color: #0000ff”>string</span><span style=”color: #000000″>): </span><span style=”color: #0000ff”>return</span> len(<span style=”color: #0000ff”>string</span>) < <span style=”color: #800080″>10</span><span style=”color: #000000″> # string长度小于10,返回Trueshort_string </span>= SoupStrainer(text=<span style=”color: #000000″>is_short_string) # 符合条件的文本
</span><span style=”color: #0000ff”>from</span><span style=”color: #000000″> bs4 import BeautifulSoupsoup </span>= BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span>, parse_only=<span style=”color: #000000″>a_tags).prettify()soup </span>= BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span>, parse_only=<span style=”color: #000000″>id_tags).prettify()soup </span>= BeautifulSoup(html_doc, <span style=”color: #800000″>\'</span><span style=”color: #800000″>html.parser</span><span style=”color: #800000″>\'</span>, parse_only=short_string).prettify()</pre><div class=”zmdao_code_toolbar”><span class=”zmdao_code_copy”><a href=”javascript:void(0);” onclick=”copyCnblogsCode(this)” title=”复制代码”><img src=”//common.cnblogs.com/images/copycode.gif” alt=”复制代码”></a></span></div></div><p> </p></div>