前情提要:最近使用PHP实现了简单的网盘搜索程序,并且关联了微信公众平台,名字是网盘小说。用户可以通过公众号输入关键字,公众号会返回相应的网盘下载地址。就是这么一个简单的功能,类似很多的网盘搜索类网站,我这个采集和搜索程序都是PHP实现的,全文和分词搜索部分使用到了开源软件xunsearch。

真实上线案例:搜盘子-网盘电影资源站

上一篇([PHP] 网盘搜索引擎-采集爬取百度网盘分享文件实现网盘搜索)中我重点介绍了怎样去获取一大批的百度网盘用户,这一篇介绍怎样获得指定网盘用户的分享列表。同样的原理,也是找到百度获取分享列表的接口,然后去循环就可以了。

 

查找分享接口

随便找一个网盘用户的分享页面,点击最下面的分页链接,可以看到发起的请求接口,这个就是获取分享列表的接口。

整个的请求url是这个 https://pan.baidu.com/pcloud/feed/getsharelist?t=1493892795526&category=0&auth_type=1&request_location=share_home&start=60&limit=60&query_uk=4162539356&channel=chunlei&clienttype=0&web=1&logid=MTQ5Mzg5Mjc5NTUyNzAuOTEwNDc2NTU1NTgyMTM1OQ==&bdstoken=bc329b0677cad94231e973953a09b46f

 

调用接口获取数据

使用PHP的CURL去请求这个接口,看看是否能够获取到数据。测试后发现,返回的是{“errno”:2,”request_id”:1775381927},并没有获取到数据。这是因为百度对header头信息里面的Referer进行了限制,我把Referer改成http://www.baidu.com,就可以获取到数据了。接口的参数也可以进行简化成 https://pan.baidu.com/pcloud/feed/getsharelist?&auth_type=1&request_location=share_home&start=60&limit=60&query_uk=4162539356

测试代码如下:

  1. <?php
  2. /*
  3. * 获取分享列表
  4. */
  5. class TextsSpider{
  6. /**
  7. * 发送请求
  8. */
  9. public function sendRequest($url,$data = null,$header=null){
  10. $curl = curl_init();
  11. curl_setopt($curl, CURLOPT_URL, $url);
  12. curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
  13. curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, FALSE);
  14. if (!empty($data)){
  15. curl_setopt($curl, CURLOPT_POST, 1);
  16. curl_setopt($curl, CURLOPT_POSTFIELDS, $data);
  17. }
  18. if (!empty($header)){
  19. curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
  20. }
  21. curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
  22. $output = curl_exec($curl);
  23. curl_close($curl);
  24. return $output;
  25. }
  26. }
  27. $textsSpider=new TextsSpider();
  28. $header=array(
  29. \'Referer:http://www.baidu.com\'
  30. );
  31. $str=$textsSpider->sendRequest("https://pan.baidu.com/pcloud/feed/getsharelist?&auth_type=1&request_location=share_home&start=60&limit=60&query_uk=4162539356",null,$header);
  32. echo $str;

分享列表的json结果如下:

  1. {
  2. "errno": 0,
  3. "request_id": 1985680203,
  4. "total_count": 1025,
  5. "records": [
  6. {
  7. "feed_type": "share",
  8. "category": 6,
  9. "public": "1",
  10. "shareid": "98963537",
  11. "data_id": "1799945104803474515",
  12. "title": "《通灵少女》2017.同步台视(完结待删)",
  13. "third": 0,
  14. "clienttype": 0,
  15. "filecount": 1,
  16. "uk": 4162539356,
  17. "username": "a20****3762",
  18. "feed_time": 1493626027308,
  19. "desc": "",
  20. "avatar_url": "https://ss0.bdstatic.com/7Ls0a8Sm1A5BphGlnYG/sys/portrait/item/01f8831f.jpg",
  21. "dir_cnt": 1,
  22. "filelist": [
  23. {
  24. "server_filename": "《通灵少女》2017.同步台视(完结待删)",
  25. "category": 6,
  26. "isdir": 1,
  27. "size": 1024,
  28. "fs_id": 98994643773159,
  29. "path": "%2F%E3%80%8A%E9%80%9A%E7%81%B5%E5%B0%91%E5%A5%B3%E3%80%8B2017.%E5%90%8C%E6%AD%A5%E5%8F%B0%E8%A7%86%EF%BC%88%E5%AE%8C%E7%BB%93%E5%BE%85%E5%88%A0%EF%BC%89",
  30. "md5": "0",
  31. "sign": "86de8a14f72e6e3798d525c689c0e4575b1a7728",
  32. "time_stamp": 1493895381
  33. }
  34. ],
  35. "source_uid": "528742401",
  36. "source_id": "98963537",
  37. "shorturl": "1pKPCF0J",
  38. "vCnt": 356,
  39. "dCnt": 29,
  40. "tCnt": 184
  41. },
  42. {
  43. "source_uid": "528742401",
  44. "source_id": "152434783",
  45. "shorturl": "1qYdhFkC",
  46. "vCnt": 1022,
  47. "dCnt": 29,
  48. "tCnt": 345
  49. }
  50. ]
  51. }

还是和上次一样,综合性的搜索站,可以把有用的数据都留下存住,我只是做个最简单的,就只要了标题title和shareid

每个分享文件的下载页面url是这样的:http://pan.baidu.com/share/link?shareid={$shareId}&uk={$uk} ,只需要用户编号和分享id就可以拼出下载url。

 

生成分页接口URL

假设用户最多分享了30000个,每页60个,可以分500页,这样url可以这样生成

  1. <?php
  2. /*
  3. * 获取分享列表
  4. */
  5. class TextsSpider{
  6. private $pages=500;//分页数
  7. private $start=60;//每页个数
  8. /**
  9. * 生成分页接口的url
  10. */
  11. public function makeUrl($rootUk){
  12. $urls=array();
  13. for($i=0;$i<=$this->pages;$i++){
  14. $start=$this->start*$i;
  15. $url="https://pan.baidu.com/pcloud/feed/getsharelist?&auth_type=1&request_location=share_home&start={$start}&limit={$this->start}&query_uk={$rootUk}";
  16. $urls[]=$url;
  17. }
  18. return $urls;
  19. }
  20. }
  21. $textsSpider=new TextsSpider();
  22. $urls=$textsSpider->makeUrl(4162539356);
  23. print_r($urls);

分页的url结果是这样的

  1. Array
  2. (
  3. [0] => https://pan.baidu.com/pcloud/feed/getsharelist?&auth_type=1&request_location=share_home&start=0&limit=60&query_uk=4162539356
  4. [1] => https://pan.baidu.com/pcloud/feed/getsharelist?&auth_type=1&request_location=share_home&start=60&limit=60&query_uk=4162539356
  5. [2] => https://pan.baidu.com/pcloud/feed/getsharelist?&auth_type=1&request_location=share_home&start=120&limit=60&query_uk=4162539356
  6. [3] => https://pan.baidu.com/pcloud/feed/getsharelist?&auth_type=1&request_location=share_home&start=180&limit=60&query_uk=4162539356
  7. [4] => https://pan.baidu.com/pcloud/feed/getsharelist?&auth_type=1&request_location=share_home&start=240&limit=60&query_uk=4162539356
  8. [5] => https://pan.baidu.com/pcloud/feed/getsharelist?&auth_type=1&request_location=share_home&start=300&limit=60&query_uk=4162539356
  9. [6] => https://pan.baidu.com/pcloud/feed/getsharelist?&auth_type=1&request_location=share_home&start=360&limit=60&query_uk=4162539356
  10. [7] => https://pan.baidu.com/pcloud/feed/getsharelist?&auth_type=1&request_location=share_home&start=420&limit=60&query_uk=4162539356
  11. [8] => https://pan.baidu.com/pcloud/feed/getsharelist?&auth_type=1&request_location=share_home&start=480&limit=60&query_uk=4162539356
  12. [9] => https://pan.baidu.com/pcloud/feed/getsharelist?&auth_type=1&request_location=share_home&start=540&limit=60&query_uk=4162539356

 

数据表存储结构

  1. CREATE TABLE `texts` (
  2. `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  3. `title` varchar(255) NOT NULL DEFAULT \'\',
  4. `url` varchar(255) NOT NULL DEFAULT \'\',
  5. `time` int(10) unsigned NOT NULL DEFAULT \'0\',
  6. PRIMARY KEY (`id`),
  7. KEY `title` (`title`(250))
  8. ) ENGINE=MyISAM

 

 循环读取的时候,应该注意,每次间隔一定的时间,防止被封。

下一篇主要介绍xunsearch分词和全文搜索和这次的完整代码

演示地址,关注微信公众号:网盘小说,或者扫描下面的二维码

上一篇循环获取uk并存入数据库的完整代码如下:

  1. <?php
  2. /*
  3. * 获取订阅者
  4. */
  5. class UkSpider{
  6. private $pages;//分页数
  7. private $start=24;//每页个数
  8. private $db=null;//数据库
  9. public function __construct($pages=100){
  10. $this->pages=$pages;
  11. $this->db = new PDO("mysql:host=localhost;dbname=pan","root","root");
  12. $this->db->query(\'set names utf8\');
  13. }
  14. /**
  15. * 生成分页接口的url
  16. */
  17. public function makeUrl($rootUk){
  18. $urls=array();
  19. for($i=0;$i<=$this->pages;$i++){
  20. $start=$this->start*$i;
  21. $url="https://pan.baidu.com/pcloud/friend/getfollowlist?query_uk={$rootUk}&limit={$this->start}&start={$start}";
  22. $urls[]=$url;
  23. }
  24. return $urls;
  25. }
  26. /**
  27. * 根据URL获取订阅用户id
  28. */
  29. public function getFollowsByUrl($url){
  30. $result=$this->sendRequest($url);
  31. $arr=json_decode($result,true);
  32. if(empty($arr)||!isset($arr[\'follow_list\'])){
  33. return;
  34. }
  35. $ret=array();
  36. foreach($arr[\'follow_list\'] as $fan){
  37. $ret[]=$fan[\'follow_uk\'];
  38. }
  39. return $ret;
  40. }
  41. /**
  42. * 发送请求
  43. */
  44. public function sendRequest($url,$data = null,$header=null){
  45. $curl = curl_init();
  46. curl_setopt($curl, CURLOPT_URL, $url);
  47. curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
  48. curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, FALSE);
  49. if (!empty($data)){
  50. curl_setopt($curl, CURLOPT_POST, 1);
  51. curl_setopt($curl, CURLOPT_POSTFIELDS, $data);
  52. }
  53. if (!empty($header)){
  54. curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
  55. }
  56. curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
  57. $output = curl_exec($curl);
  58. curl_close($curl);
  59. return $output;
  60. }
  61. /*
  62. 获取到的uks存入数据
  63. */
  64. public function addUks($uks){
  65. foreach($uks as $uk){
  66. $sql="insert into uks (uk)values({$uk})";
  67. $this->db->prepare($sql)->execute();
  68. }
  69. }
  70. /*
  71. 获取某个用户的所有订阅并入库
  72. */
  73. public function sleepGetByUk($uk){
  74. $urls=$this->makeUrl($uk);
  75. //$this->updateUkFollow($uk);
  76. //循环分页url
  77. foreach($urls as $url){
  78. echo "loading:".$url."\r\n";
  79. //随机睡眠7到11秒
  80. $second=rand(7,11);
  81. echo "sleep...{$second}s\r\n";
  82. sleep($second);
  83. //发起请求
  84. $followList=$this->getFollowsByUrl($url);
  85. //如果已经没有数据了,要停掉请求
  86. if(empty($followList)){
  87. break;
  88. }
  89. $this->addUks($followList);
  90. }
  91. }
  92. /*从数据库取get_follow=0的uk*/
  93. public function getUksFromDb(){
  94. $sth = $this->db->prepare("select * from uks where get_follow=0");
  95. $sth->execute();
  96. $uks = $sth->fetchAll(PDO::FETCH_ASSOC);
  97. $result=array();
  98. foreach ($uks as $key => $uk) {
  99. $result[]=$uk[\'uk\'];
  100. }
  101. return $result;
  102. }
  103. /*已经取过follow的置为1*/
  104. public function updateUkFollow($uk){
  105. $sql="UPDATE uks SET get_follow=1 where uk={$uk}";
  106. $this->db->prepare($sql)->execute();
  107. }
  108. }
  109. $ukSpider=new UkSpider();
  110. $uks=$ukSpider->getUksFromDb();
  111. foreach($uks as $uk){
  112. $ukSpider->sleepGetByUk($uk);
  113. }

 

版权声明:本文为taoshihan原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/taoshihan/p/6808575.html