【python】爬虫编写--简单的文字爬虫

自己动手的第一个python爬虫，脚本如下：

 1 #!/usr/bin/python
 2 # -*- coding: UTF-8 -*-
 3 import requests
 4 import re
 5 # 下载一个网页
 6 url = \'http://www.jingcaiyuedu8.com/novel/BaJoa2/list.html\'
 7 # 模拟浏览器发送http请求
 8 response = requests.get(url)
 9 # 编码方式
10 response.encoding=\'utf-8\'
11 # 目标小说主页的网页源码
12 html = response.text
13 # 小说的名字
14 title = re.findall(r\'<h1>(.*?)</h1>\',html)[0]
15 # 新建一个文件，保存小说
16 fb = open(\'%s.txt\' % title, \'w\', encoding=\'utf-8\')
17 # 获取每一章的信息（章节，url）
18 dl = re.findall(r\'<dl class="panel-body panel-chapterlist">.*?</dl>\', html, re.S)[0]
19 chapter_info_list = re.findall(r\'href="/novel/BaJoa2(.*?)">(.*?)<\',dl)
20 # 循环章节，下载
21 for chapter_info in chapter_info_list:
22     # chapter_title = chapter_info[0]
23     # chapter_url = chapter_info[1]
24     chapter_url,chapter_title = chapter_info
25     chapter_url="http://www.jingcaiyuedu8.com/novel/BaJoa2/%s" % chapter_url
26     # 下载章节内容
27     chapter_response = requests.get(chapter_url)
28     chapter_response.encoding = \'utf-8\'
29     chapter_html = chapter_response.text
30     # 提取章节内容
31     chapter_content = str(re.findall(r\'<br />&nbsp;&nbsp;&nbsp;&nbsp;(.*?)<p>\', chapter_html, re.S))
32     # 数据整理
33     chapter_content = str(chapter_content.replace(r\'\r<br />\r<br />&nbsp;&nbsp;&nbsp;&nbsp;\',\'\n\'))
34     # 保存文档
35     fb.write(chapter_title)
36     fb.write(\'\n\')
37     fb.write(chapter_content)
38     fb.write(\'\n\'*2)
39 fb.close()

1、编写爬虫思路：

　　确定下载目标，找到网页，找到网页中需要的内容。对数据进行处理。保存数据。

2、知识点说明：

　　1）确定网络中需要的信息，打开网页后使用F12打开开发者模式。

在Network中可以看到很多信息，我们在页面上看到的文字信息都保存在一个html文件中。点击文件后可以看到response，文字信息都包含在response中。

对于需要输入的信息，可以使用ctrl+f，进行搜索。查看信息前后包含哪些特定字段。

对于超链接的提取，可以使用最左边的箭头点击超链接，这时Elements会打开有该条超链接的信息，从中判断需要提取的信息。从下载小说来看，在目录页提取出小说的链接和章节名。

　　2）注意编码格式

输入字符集一定要设置成utf-8。页面大多为GBK字符集。不设置会乱码。

　　3）正则匹配

r\’内容\’ 内容里默认不需要转义，但是（）这种可能有功能的符号前面需要加转义符”\”。

.*?表示所有匹配。没有（）时，会输出含前后分割符匹配到的信息。带（）时只会输出（）中匹配到的内容。

末尾的re.S。表示使 . 匹配包括换行在内的所有字符

　　4）replace的使用

replace只能对字符串处理，上面的函数默认为list，不能直接使用，使用str()把函数转换后可以使用replace。

还有正则re.sub支持字符串替换。可以不使用replace。

3、未完待续

　　脚本中的网站已经具备反扒功能，当相同IP连接数过多的时候会屏蔽连接的IP。本编只是最基本的爬虫，后续继续更新。

本文链接：https://www.cnblogs.com/godwall/p/12011297.html

【python】爬虫编写--简单的文字爬虫

【python】爬虫编写--简单的文字爬虫的更多相关文章

随机推荐

热门专题

目录导航