用PYTHON爬虫简单爬取网络小说

用PYTHON爬虫简单爬取网络小说。

这里是17K小说网上，随便找了一本小说，名字是《千万大奖》。

里面主要是三个函数：

1、get_download_url() 用于获取该小说的所有章节的URL。

分析了该小说的目录页http://www.17k.com/list/2819620.html的HTML源码，发现其目录是包含在Volume里的A标签合集。所以就提取出了URLS列表。

2、get_contents(target) 用于获取小说指定章节的正文内容

分析了小说中第一章节的页面http://www.17k.com/chapter/2819620/34988369.html，发现其正文内容包含在P标签中，正文标题包含在H1标签中，经过对换行等处理，得到正文内容。传入参数是上一函数得到的URL。

3、writer(name, path, text) 用于将得到的正文内容和章节标题写入到千万大奖.txt

理论上，该简单爬虫可以爬取该网站的任意小说。

from bs4 import BeautifulSoup
import requests, sys

target=\’http://www.17k.com/list/2819620.html\’
server=\’http://www.17k.com\’
urls=[]

def get_download_url():
req = requests.get(url = target)
html = req.text
div_bf = BeautifulSoup(html,\’lxml\’)
div = div_bf.find_all(\’dl\’, class_ = \’Volume\’)
a_bf = BeautifulSoup(str(div[0]),\’lxml\’)
a = a_bf.find_all(\’a\’)
for each in a[1:]:
urls.append(server + each.get(\’href\’))

def get_contents(target):
req = requests.get(url = target)
html = req.text
bf = BeautifulSoup(html,\’lxml\’)
title=bf.find_all(\’div\’, class_ = \’readAreaBox content\’)
title_bf = BeautifulSoup(str(title[0]),\’lxml\’)
title = title_bf.find_all(\’h1\’)
title=str(title[0]).replace(\'<h1>\’,\’\’)
title=str(title).replace(\'</h1>\’,\’\’)
title=str(title).replace(\’ \’,\’\’)
title=str(title).replace(\’\n\’,\’\’)
texts = bf.find_all(\’div\’, class_ = \’p\’)
texts=str(texts).replace(\'<br/>\’,\’\n\’)
texts=texts[:texts.index(\’本书首发来自17K小说网，第一时间看正版内容！\’)]
texts=str(texts).replace(\’ 　　\’,\’\’)
return title,str(texts[len(\'[<div class=”p”>\’):])

def writer(name, path, text):
write_flag = True
with open(path, \’a\’, encoding=\’utf-8\’) as f:
f.write(name + \’\n\’)
f.writelines(text)
f.write(\’\n\’)

#title,content=get_contents(target)
#print(title,content)
#writer(title,title+”.txt”,content)
get_download_url()
#print(urls)
i=1
for url in urls:
title,content=get_contents(url)
writer(title,”千万大奖.txt”,content)
print(str(int(i/len(urls)*100))+”%”)
i+=1

本文链接：https://www.cnblogs.com/babihuang/p/9084044.html

用PYTHON爬虫简单爬取网络小说

用PYTHON爬虫简单爬取网络小说的更多相关文章

随机推荐

热门专题

目录导航