Python解析HTML的开发库pyquery

PyQuery是一个类似于jQuery的Python库，也可以说是jQuery在Python上的实现，能够以 jQuery 的语法来操作解析 HTML 文档，易用性和解析速度都很好。

例如，一段豆瓣html片段http://movie.douban.com/subject/3530403/

 1 <div id="info">
 2         <span><span class=\'pl\'>导演</span>: <a href="/celebrity/1047989/" rel="v:directedBy">汤姆·提克威</a> / <a href="/celebrity/1161012/" rel="v:directedBy">拉娜·沃卓斯基</a> / <a href="/celebrity/1013899/" rel="v:directedBy">安迪·沃卓斯基</a></span><br/>
 3         <span><span class=\'pl\'>编剧</span>: <a href="/celebrity/1047989/">汤姆·提克威</a> / <a href="/celebrity/1013899/">安迪·沃卓斯基</a> / <a href="/celebrity/1161012/">拉娜·沃卓斯基</a></span><br/>
 4         <span><span class=\'pl\'>主演</span>: <a href="/celebrity/1054450/" rel="v:starring">汤姆·汉克斯</a> / <a href="/celebrity/1054415/" rel="v:starring">哈莉·贝瑞</a> / <a href="/celebrity/1019049/" rel="v:starring">吉姆·布劳德本特</a> / <a href="/celebrity/1040994/" rel="v:starring">雨果·维文</a> / <a href="/celebrity/1053559/" rel="v:starring">吉姆·斯特吉斯</a> / <a href="/celebrity/1057004/" rel="v:starring">裴斗娜</a> / <a href="/celebrity/1025149/" rel="v:starring">本·卫肖</a> / <a href="/celebrity/1049713/" rel="v:starring">詹姆斯·达西</a> / <a href="/celebrity/1027798/" rel="v:starring">周迅</a> / <a href="/celebrity/1019012/" rel="v:starring">凯斯·大卫</a> / <a href="/celebrity/1201851/" rel="v:starring">大卫·吉雅西</a> / <a href="/celebrity/1054392/" rel="v:starring">苏珊·萨兰登</a> / <a href="/celebrity/1003493/" rel="v:starring">休·格兰特</a></span><br/>
 5         <span class="pl">类型:</span> <span property="v:genre">剧情</span> / <span property="v:genre">科幻</span> / <span property="v:genre">悬疑</span><br/>
 6         <span class="pl">官方网站:</span> <a href="http://cloudatlas.warnerbros.com" rel="nofollow" target="_blank">cloudatlas.warnerbros.com</a><br/>
 7         <span class="pl">制片国家/地区:</span> 德国 / 美国 / 香港 / 新加坡<br/>
 8         <span class="pl">语言:</span> 英语<br/>
 9         <span class="pl">上映日期:</span> <span property="v:initialReleaseDate" content="2013-01-31(中国大陆)">2013-01-31(中国大陆)</span> / <span property="v:initialReleaseDate" content="2012-10-26(美国)">2012-10-26(美国)</span><br/>
10         <span class="pl">片长:</span> <span property="v:runtime" content="134">134分钟(中国大陆)</span> / 172分钟(美国)<br/>
11         
12         <span class="pl">IMDb链接:</span> <a href="http://www.imdb.com/title/tt1371111" target="_blank" rel="nofollow">tt1371111</a><br>
13 
14         <span class="pl">官方小站:</span>
15         <a href="http://site.douban.com/202494/" target="_blank">电影《云图》</a>
16 </div>

View Code

from pyquery import PyQuery as pq
doc=pq(url=\'http://movie.douban.com/subject/3530403/\')
data=doc(\'.pl\')
for i in data:
    print pq(i).text()

　输出

导演
编剧
主演
类型:
官方网站:
制片国家/地区:
语言:
上映日期:
片长:
IMDb链接:
官方小站:

　　用起来很像jQuery吧。

用法

用户可以使用PyQuery类从字符串、lxml对象、文件或者url来加载xml文档:

>>> from pyquery import PyQuery as pq
>>> from lxml import etree
>>> doc=pq("<html></html>")
>>> doc=pq(etree.fromstring("<html></html>"))
>>> doc=pq(filename=path_to_html_file)
>>> doc=pq(url=\'http://movie.douban.com/subject/3530403/\')

　　可以像jQuery一样选择对象了

>>> doc(\'.pl\')
[<span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span#rateword.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <p.pl>]

　　这样，class为\’pl\’的对象就全部选择出来了。

不过在使用迭代时需要对文本进行重新封装：

for para in doc(\'.pl\'):
    para=pq(para)
    print para.text()	
导演
编剧
主演
类型:
官方网站:
制片国家/地区:
语言:
上映日期:
片长:
IMDb链接:
官方小站:

　　这里得到的text是unicode码，如果要写入文件需要编码为字符串。

用户可以使用jquery提供的一些伪类（但还不支持css）来进行操作，诸如：

>>> doc(\'.pl:first\')
[<span.pl>]
>>> print  doc(\'.pl:first\').text()
导演

Attributes

获取html元素的属性

>>> p=pq(\'<p id="hello" class="hello"></p>\')(\'p\')
>>> p.attr(\'id\')
\'hello\'
>>> p.attr.id
\'hello\'
>>> p.attr[\'id\']
\'hello\'

　　赋值

>>> p.attr.id=\'plop\'
>>> p.attr.id
\'plop\'
>>> p.attr[\'id\']=\'ola\'
>>> p.attr.id
\'ola\'
>>> p.attr(id=\'hello\',class_=\'hello2\')
[<p#hello.hell0>]

Traversing

过滤

>>> d=pq(\'<p id="hello" class="hello"><a/>hello</p><p id="test"><a/>world</p>\')
>>> d(\'p\').filter(\'.hello\')
[<p#hello.hello>]
>>> d(\'p\').filter(\'#test\')
[<p#test>]
>>> d(\'p\').filter(lambda i:i==1)
[<p#test>]
>>> d(\'p\').filter(lambda i:i==0)
[<p#hello.hello>]
>>> d(\'p\').filter(lambda i:pq(this).text()==\'hello\')
[<p#hello.hello>]

　　按照顺序选择

>>> d(\'p\').eq(0)
[<p#hello.hello>]
>>> d(\'p\').eq(1)
[<p#test>]

　　选择内嵌元素

>>> d(\'p\').eq(1).find(\'a\')
[<a>]

　　选择父元素

>>> d=pq(\'<p><span><em>Whoah!</em></span></p><p><em> there</em></p>\')
>>> d(\'p\').eq(1).find(\'em\')
[<em>]
>>> d(\'p\').eq(1).find(\'em\').end()
[<p>]
>>> d(\'p\').eq(1).find(\'em\').end().text()
\'there\'
>>> d(\'p\').eq(1).find(\'em\').end().end()
[<p>, <p>]

下载：http://pypi.python.org/pypi/pyquery

文档：http://packages.python.org/pyquery/

选择器总结：http://www.cnblogs.com/onlys/articles/jQuery.html