Python第三方库,用于从HTML或XML中提取数据
官方:http://www.crummv.com/software/BeautifulSoup/

安装:pip install beautifulsoup4

 

soup = BeautifulSoup(html_doc,’html.parser‘,from_encoding=’utf-8′ )

第一个参数:html文档字符串

第二个参数:html解析器

第三个参数:html文档的编码

 

标签选择器操作

注意:只会返回一个指定的标签,这也是标签选择器的特性

选择元素

  1. from bs4 import BeautifulSoup
  2. html_doc='''
  3. <div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推荐<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新时代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娱乐<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home?
  4. data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">财经<span></span></a></li>
  5. '''
  6. soup = BeautifulSoup(html_doc,'lxml')
    #将html代码自动补全,并按html代码格式返回
  7. print(soup.prettify())
    #输出第一个a标签
  8. print(soup.a)
    #输出第一个span标签
  9. print(soup.span)

  

运行结果如下:

  1. <html>
  2. <body>
  3. <div class="container">
  4. <a class="logo" href="/pc/home?sign=360_79aabe15">
  5. </a>
  6. <nav data-mod="nnav" id="nnav">
  7. <div class="nnav-wrap">
  8. <ul class="nnav-items" id="nnav_main">
  9. <li data-index="0">
  10. <a class="nnav-item" data-ch="youlike" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank">
  11. 推荐
  12. <span>
  13. </span>
  14. </a>
  15. </li>
  16. <li data-index="1">
  17. <a class="nnav-item" data-ch="good_safe2toera" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank">
  18. 新时代
  19. <span>
  20. </span>
  21. </a>
  22. </li>
  23. <li data-index="2">
  24. <a class="nnav-item" data-ch="fun" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank">
  25. 娱乐
  26. <span>
  27. </span>
  28. </a>
  29. </li>
  30. <li data-index="3">
  31. <a class="nnav-item" href="/pc/home?
  32. data-index=">
  33. </a>
  34. <a class="nnav-item" data-ch="economy" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank">
  35. 财经
  36. <span>
  37. </span>
  38. </a>
  39. </li>
  40. </ul>
  41. </div>
  42. </nav>
  43. </div>
  44. </body>
  45. </html>
  46. <a class="logo" href="/pc/home?sign=360_79aabe15"></a>
  47. <span></span>

  

获取名称

获取属性

获取内容

  1. from bs4 import BeautifulSoup
  2. html_doc='''
  3. <div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推荐<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新时代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娱乐<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home?
  4. data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">财经<span></span></a></li>
  5. '''
  6. soup = BeautifulSoup(html_doc,'lxml')
  7. #输出第一个a标签的name
  8. print(soup.a.name)
  9. #输出第一个a标签的的class属性值,下面两种方法都可以
  10. print(soup.a.attrs['class'])
  11. print(soup.a['class'])
  12. #输出第一个a标签的内容
  13. print(soup.a.string)

  

运行结果如下:

  1. a
  2. ['logo']
  3. ['logo']
  4. None

  

嵌套选择

  1. from bs4 import BeautifulSoup
  2. html_doc='''
  3. <a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike"><span>推荐</span></a>
  4. '''
  5. soup = BeautifulSoup(html_doc,'lxml')
  6. print(soup.a.span.string)

  

运行结果如下:

  1. 推荐

  

 

子节点和子孙节点操作

获取所有的子节点

  1. from bs4 import BeautifulSoup
  2. html='''
  3. <div class="bc">
  4. <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新东方在线网络课堂"></a></span>
  5. <span class="fl" style="padding-top: 6px;">
  6. <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四级</a>
  7. <a title="新东方在线网络课堂" href="http://www.koolearn.com/" target="_self">新东方在线</a> >
  8. <a title="四级网络课堂" href="http://cet4.koolearn.com/" target="_self">四级</a> >
  9. <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文
  10. </span>
  11. <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a>
  12. </div>
  13. '''
  14.  
  15. soup = BeautifulSoup(html,'lxml')
  16. #第一种方法
  17. print(soup.div.contents)
  18. #第二种方法
  19. print(soup.div.children)
  20. for i,child in enumerate(soup.div.children):
  21. print(i,child)

 

运行结果如下:

  1. ['\n', <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>, '\n', <span class="fl" style="padding-top: 6px;">
  2. <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四级</a>
  3. <a href="http://www.koolearn.com/" target="_self" title="新东方在线网络课堂">新东方在线</a> >
  4. <a href="http://cet4.koolearn.com/" target="_self" title="四级网络课堂">四级</a> >
  5. <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文
  6. </span>, '\n', <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a>, '\n']
  7. <list_iterator object at 0x0000000002E498D0>
  8. 0
  9.  
  10. 1 <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>
  11. 2
  12.  
  13. 3 <span class="fl" style="padding-top: 6px;">
  14. <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四级</a>
  15. <a href="http://www.koolearn.com/" target="_self" title="新东方在线网络课堂">新东方在线</a> >
  16. <a href="http://cet4.koolearn.com/" target="_self" title="四级网络课堂">四级</a> >
  17. <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文
  18. </span>
  19. 4
  20.  
  21. 5 <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a>
  22. 6

  

 获取所有的子孙节点

  1. from bs4 import BeautifulSoup
  2. html='''
  3. <div class="bc">
  4. <span class="fl" style="padding-top: 1px;">
  5. <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新东方在线网络课堂"></a></span>
  6. <span class="fl" style="padding-top: 6px;">
  7. <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四级</a>
  8. <a title="新东方在线网络课堂" href="http://www.koolearn.com/" target="_self">新东方在线</a> >
  9. <a title="四级网络课堂" href="http://cet4.koolearn.com/" target="_self">四级</a> >
  10. <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文</span>
  11. <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a> </div>
  12. '''
  13. soup = BeautifulSoup(html,'lxml')
  14. print(soup.div.descendants)
  15. for i,child in enumerate(soup.div.descendants):
  16. print(i,child)

 

运行结果如下:

  1. <generator object descendants at 0x00000000028F5AF0>
  2. 0
  3.  
  4. 1 <span class="fl" style="padding-top: 1px;">
  5. <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>
  6. 2
  7.  
  8. 3 <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a>
  9. 4 <img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/>
  10. 5
  11.  
  12. 6 <span class="fl" style="padding-top: 6px;">
  13. <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四级</a>
  14. <a href="http://www.koolearn.com/" target="_self" title="新东方在线网络课堂">新东方在线</a> >
  15. <a href="http://cet4.koolearn.com/" target="_self" title="四级网络课堂">四级</a> >
  16. <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文</span>
  17. 7
  18.  
  19. 8 <a class="ky" href="http://cet4.koolearn.com/" rel="nofollow" target="_blank">四级</a>
  20. 9 四级
  21. 10
  22.  
  23. 11 <a href="http://www.koolearn.com/" target="_self" title="新东方在线网络课堂">新东方在线</a>
  24. 12 新东方在线
  25. 13 >
  26. 14 <a href="http://cet4.koolearn.com/" target="_self" title="四级网络课堂">四级</a>
  27. 15 四级
  28. 16 >
  29. 17 <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a>
  30. 18 英语四级词汇
  31. 19 > 正文
  32. 20
  33.  
  34. 21 <a class="fr logo_p2" href="http://www.xdf.cn/" rel="nofollow" target="_blank"><img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/></a>
  35. 22 <img height="24" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208"/>
  36. 23

  

父节点和祖先节点操作

获取父节点和祖先节点

  1. from bs4 import BeautifulSoup
  2. html='''
  3. <div class="bc">
  4. <span class="fl" style="padding-top: 1px;">
  5. <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新东方在线网络课堂"></a></span>
  6. <span class="fl" style="padding-top: 6px;">
  7. <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四级</a>
  8. <a title="新东方在线网络课堂" href="http://www.koolearn.com/" target="_self">新东方在线</a> >
  9. <a title="四级网络课堂" href="http://cet4.koolearn.com/" target="_self">四级</a> >
  10. <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文</span>
  11. <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a> </div>
  12. '''
  13. soup = BeautifulSoup(html,'lxml')
  14. print(soup.a.parent) #获取父节点
  15. print(soup.a.parents) #获取祖先节点

 

运行结果如下:  

  1. <span class="fl" style="padding-top: 1px;">
  2. <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img alt="新东方在线网络课堂" height="48" src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105"/></a></span>
  3. <generator object parents at 0x00000000028C5B48>

  

兄弟节点操作

获取兄弟节点

  1. from bs4 import BeautifulSoup
  2. html='''
  3. <div class="more_box" id="moreBox">
  4.  <h3>360识图</h3>
  5. <a href="javascript:;" id="btnLoadMore" class="btn_loadmore">加载更多</a>
  6. <p id="imgTotal" class="img_total">找到相关图片约 2637 张</p>
  7. </div>
  8. '''
  9. soup = BeautifulSoup(html,'lxml')
  10. print(soup.a.next_siblings) #获取前面的兄弟节点
  11. print(soup.a.previous_siblings) #获取后面的兄弟节点

  

运行结果如下:

  1. <generator object next_siblings at 0x0000000002885B48>
  2. <generator object previous_siblings at 0x0000000002885B48>

  

  1. l = [x * x for x in range(10)]
  2. g = (x * x for x in range(10))
  3. print(l)
  4. print(g)

 

运行结果如下:

  1. [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
  2. <generator object <genexpr> at 0x000000000251C468>

L 是一个list, 而 G 是一个generator:它们在创建时候最基本的不同就list是 [ ] ,而generator是 ( ) 

如果要一个个打印出来,可以通过next()函数来获得generator的下一个返回值

 

  1. g = (x * x for x in range(10))
  2. for i in range(10):
  3. print(next(g))

  

运行结果如下

  1. 0
  2. 1
  3. 4
  4. 9
  5. 16
  6. 25
  7. 36
  8. 49
  9. 64
  10. 81

  

 



  1. #可根据标签名、属性、内容查找文档,返回所有匹配结果
  1. find_all(name,attrs,recusive,text,**kwargs)
  2.  
  3.  
  4. #查找所有标签为a的节点
  5. soup.find_all('a')
  6.  
  7. #查找所有标签为a,链接符合/view/123/htm形式的节点
  8. soup.find_all('a',href='/view/123.htm')
  9. soup.find_all('a',href=re.compile(r'/view/\d+\.htm'))
  10.  
  11. #查找所有标签为div,class为abc,文字为python的节点
  12. soup.find_all('div',class_='abc',string='python')
  13.  
  14. 属性:
  15. #获取查到的节点的标签名称
  16. node.name
  17.  
  18. #获取查找到的a节点的href属性
  19. node['href']
  20.  
  21. #获取查找到的a节点的链接文字
  22. node.get_text()
  23.  
  24.  
  25. find(name,attrs,recusive,text,**kwargs)
  26. 可根据标签名、属性、内容查找文档,和find_all使用方法差不多,只不过返回第一个符合匹配的结果
  27.  
  28. find_parents() find_parent()
  29. find_parents()返回所有祖先节点 find_parent()返回直接父节点
  30.  
  31. find_next_siblings() find_next_sibling()
  32. find_next_siblings()返回前面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点
  33.  
  34. find_previous_siblings() find_previous_sibling()
  35. find_previous_siblings()返回前面所有兄弟节点 find_previous_sibling()返回前面第一个兄弟节点
  36.  
  37. find_all_next() find_next()
  38. find_all_next()返回节点后所有符合条件的节点 find_next()返回第一个符合条件的节点
  39.  
  40. find_all_previous() find_previous()
  41. find_all_previous()返回节点后所有符合条件的节点 find_previous()返回第一个符合条件的节点

  

测试实例:

  1. import bs4
    html_doc='''
    <div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推荐<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新时代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娱乐<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home?
    data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">财经<span></span></a></li><li data-index="5"><a class="nnav-item" href="/pc/home?ch=estate&sign=360_79aabe15" target="_blank" data-ch="estate">房产<span></span></a></li><li data-index="6"><a class="nnav-item" href="/pc/home?ch=car&sign=360_79aabe15" target="_blank" data-ch="car">汽车<span></span></a></li><li data-index="7"><a class="nnav-item" href="/pc/home?ch=sport&sign=360_79aabe15" target="_blank" data-ch="sport">体育<span></span></a></li><li data-index="8"><a class="nnav-item" href="/pc/home?ch=domestic&sign=360_79aabe15" target="_blank" data-ch="domestic">国内
    '''
    #创建BeautifulSoup对象
    soup = bs4.BeautifulSoup(html_doc,'html.parser')


    #获取所有的链接
    links = soup.find_all('a')
    for link in links:
    print(link.name,link['href'],link.get_text())

    #获取/pc/home?sign=360_79aabe15的链接
    link_node = soup.find('a',href='/pc/home?sign=360_79aabe15')
    print(link_node.name,link_node['href'],link_node.get_text())

  

运行结果如下:

  1. a /pc/home?sign=360_79aabe15
  2. a /pc/home?ch=youlike&sign=360_79aabe15 推荐
  3. a /pc/home?ch=good_safe2toera&sign=360_79aabe15 新时代
  4. a /pc/home?ch=fun&sign=360_79aabe15 娱乐
  5. a /pc/home?
  6. data-index= 财经
  7. a /pc/home?ch=economy&sign=360_79aabe15 财经
  8. a /pc/home?ch=estate&sign=360_79aabe15 房产
  9. a /pc/home?ch=car&sign=360_79aabe15 汽车
  10. a /pc/home?ch=sport&sign=360_79aabe15 体育
  11. a /pc/home?ch=domestic&sign=360_79aabe15 国内
  12.  
  13. a /pc/home?sign=360_79aabe15

  

 

版权声明:本文为-wenli原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/-wenli/p/9878610.html