爬取公众号文章思路

使用fiddler拦截电脑端微信公众号的历史文章列表

构造请求获取文章详情链接

 

fiddler下载链接:https://www.telerik.com/download/fiddler

 

java抓取代码:

  1. package com.mybatis.plus.utils;
  2. import cn.hutool.core.lang.Console;
  3. import cn.hutool.http.HttpRequest;
  4. import cn.hutool.http.HttpUtil;
  5. import com.alibaba.fastjson.JSONArray;
  6. import com.alibaba.fastjson.JSONObject;
  7. import org.apache.commons.lang3.StringUtils;
  8. import java.io.BufferedReader;
  9. import java.io.InputStreamReader;
  10. import java.util.ArrayList;
  11. import java.util.HashMap;
  12. import java.util.List;
  13. public class test1 {
  14. public static void main(String[] args) {
  15. String biz = "MzIxOTc0MzQzMg==";
  16. String uin = "MjYyMjc0NzA0Mw%3D%3D";
  17. // 这个是用的的我的微信号 动态改变的只有key
  18. String key = "af4c4474d22b7e510ffb9a836217934142013a224ea9f909f2762f52f45f5062868454601519f489fc119061433823dc131c7ab44c519fd0bba40b284935ff53a721808cb1eb9e04027e28eb2ce97e327ca2dac3a97cbb9313f032850d66335fc3b0223c26ebb15a26f99879aa9643745069960c3bb252c253419feefe9e86e8";
  19. int offset = 0;
  20. int count = 10;
  21. int countFlag = 1;
  22. while (true) {
  23. // HttpRequest httpRequest = HttpUtil.createGet("").header("", "");
  24. // httpRequest.execute().body();
  25. String response = HttpUtil.get("https://mp.weixin.qq.com/mp/profile_ext?" +
  26. "action=getmsg&" +
  27. "__biz=" + biz +
  28. "&f=json&offset=" + offset +
  29. "&count=10" +
  30. "&uin=" + uin +
  31. "&key=" + key);
  32. offset = offset + count;
  33. // 解析数据返回内容
  34. JSONObject jsonObject = JSONObject.parseObject(response);
  35. String general_msg_list = jsonObject.getString("general_msg_list");
  36. if (StringUtils.isEmpty(general_msg_list)) {
  37. Console.log("====访问秘钥已过期,停止采集====");
  38. break;
  39. }
  40. JSONObject jsonObject1 = JSONObject.parseObject(general_msg_list);
  41. JSONArray list = jsonObject1.getJSONArray("list");
  42. if (list.isEmpty()) {
  43. Console.log("=======该公众号没有更多文章了,停止采集====");
  44. break;
  45. }
  46. for (Object o : list) {
  47. String toJSONString = JSONObject.toJSONString(o);
  48. JSONObject jsonObject2 = JSONObject.parseObject(toJSONString);
  49. JSONObject app_msg_ext_info = jsonObject2.getJSONObject("app_msg_ext_info");
  50. String title = app_msg_ext_info.getString("title");
  51. String content_url = app_msg_ext_info.getString("content_url");
  52. Console.log("获取到第:" + countFlag++ + " 条数据");
  53. Console.log("获取到文章标题:" + title);
  54. Console.log("获取到文章url:" + content_url);
  55. MyUtil.saveAsFileWriter("获取到文章标题:" + title + "\n" + "获取到文章url:" + content_url + "\n");
  56. }
  57. }
  58. }
  59. }


    抓取效果:

 

  1.  

 

爬虫弊端:由于key会很快失效 需要从fiddler监控软件中获取到key

  1.  

 

版权声明:本文为guanxiaohe原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/guanxiaohe/p/14330235.html