【python】:用爬虫脚本爬取招聘网站上的信息
方法:
1,一个招聘只为下,会显示多个页面数据,依次把每个页面的连接爬到url;
2,在page_x页面中,爬到15条的具体招聘信息的s_url保存下来;
3,打开每个s_url链接,获取想要的信息例如,title,connect,salary等;
4,将信息保存并输入到csv文本中去。
代码:
from lxml import etree import requests import time #要爬取的网站链接 url = "https://www.lagou.com/zhaopin/Java/?labelWords=label" #设置信息头,模拟人为操作,可以避免一些反爬虫 head = {\'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36\'} res = requests.get(url, headers=head).content.decode("utf-8") re = etree.HTML(res) #获得该页面翻页地址链接 s_url = re.xpath("//div[@class=\'pager_container\']/a[position()>2 and position()<7]/@href") print(\'s_url=\', s_url) #依次循环page1,page2等等 for x in s_url: res = requests.get(x, headers=head).content.decode("utf-8") re = etree.HTML(res) print(\'x==\', x) #获取当前页面下的所有招聘信息链接 list_url = re.xpath("//div[@class=\'s_position_list \']/ul/li[position()>=0 and position()<15]/div/div[1]/div/a/@href") print(\'list_url=\', list_url) #依次循环每个招聘信息,将标题,内容,薪资获取到 for y in list_url: r01 = requests.get(y, headers=head).content.decode("utf-8") html01 = etree.HTML(r01) print(\'y==\', y) title = html01.xpath("string(//div[@class=\'job-name\'])") print(\'title===\', title) content = html01.xpath("string(//div[@class=\'job-detail\'])") print(\'content===\', content) salary = html01.xpath("string(/html/body/div[5]/div/div[1]/dd/h3/span[1])") print(\'salary===\', salary) #设置休眠是防止网站识别自己,最好是random休眠 time.sleep(5) # 保存爬虫信息内容 with open("cn-blog.csv", "a+", encoding="utf-8") as file: file.write(title + "\n") file.write(content + "\n") file.write(salary + "\n") file.write("*" * 50 + "\n")
总结:
1,设置head信息以及sleep,防止网站识别自己(虽然网站还是会屏蔽些,但是也能抓取大部分数据了);
2,用xpath获取同一个元素下所有内容,用下标[position()>x and position()<y]表示;