Python爬虫学习笔记——防豆瓣反爬虫

rockwall 2021-09-01 原文

Python爬虫学习笔记——防豆瓣反爬虫

开始慢慢测试爬虫以后会发现IP老被封，原因应该就是单位时间里面访问次数过多，虽然最简单的方法就是降低访问频率，但是又不想降低访问频率怎么办呢？查了一下最简单的方法就是使用转轮代理IP，网上找了一些方法和免费的代理IP，尝试了一下，可以成功，其中IP代理我使用的是http://www.xicidaili.com/nn/

获取Proxies的代码如下：

 1 for page in range(1,5):
 2     IPurl = \'http://www.xicidaili.com/nn/%s\' %page
 3     rIP=requests.get(IPurl,headers=headers)
 4     IPContent=rIP.text
 5     soupIP = BeautifulSoup(IPContent,"html5lib")
 6     trs = soupIP.find_all(\'tr\')
 7     for tr in trs[1:]:
 8         tds = tr.find_all(\'td\')
 9         ip = tds[2].text.strip()
10         port = tds[3].text.strip()
11         protocol = tds[6].text.strip()
12         if protocol == \'HTTP\':
13             httpResult = \'http://\' + ip + \':\' + port
14         elif protocol ==\'HTTPS\':
15             httpsResult = \'https://\' + ip + \':\' + port

由于Requests是可以直接在访问时候加上proxies的，所以我直接得到的格式使用的是proxies中的格式，requests库文档中，添加代理的格式如下：

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)

测试可以使用http://www.ip.cn测试访问时的本地IP，代码如下：

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import html5lib
 4 headers = {
 5 "user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
 6 }
 7 proxies ={
 8     "http":\'http://122.193.14.102:80\',
 9     "https":"http://120.203.18.33:8123"
10 }
11 r = requests.get(\'http://www.ip.cn\',headers=headers,proxies=proxies)
12 content = r.text
13 ip=re.search(r\'code.(.*?)..code\',content)
14 print (ip.group(1))