项目地址:LeetCodeCrawler

概述

  现在一般或多或少都会在LeetCode上面进行刷题练习,然后将代码放在GitHub上,当然我也一样,这是我的刷题仓库Algorithm。刷完题如果每次都去重新编辑README.md文件进行更新,未免显得有些费时,因此有了需求,个人就写了一个工具——LeetCodeCrawler:爬取 LeetCode 题目内容以及提交的AC代码的工具,并支持生成相应的 README.md 文件,美化你的 LeetCode 仓库的README。

使用方法

下载LeetCodeCrawler.jar到本地

建立好如下config.json文件(可直接对 repo 的config.json进行更改),config.json文件需与LeetCodeCrawler.jar放置于同一目录下:

{
    "username": "leetcode@leetcode",
    "password": "leetcode",
    "language": ["cpp", "java"],
    "outputDir": "."
}
  • usernamepassword对应你的 LeetCode 账号和密码
  • language对应于你在 LeetCode 刷题使用的编程语言,可多选,选填字段如下(请严格按照如下字段填写):

    • cpp
    • java
    • c
    • csharp
    • javascript
    • python
    • python3
    • ruby
    • swift
    • golang
    • scala
    • kotlin
  • outputDir字段表示你希望存放源码文件的目录,默认为.,即当前目录

运行java -jar LeetCodeCrawler.jar

效果

爬取解析

几个相关API

  主要通过两种方式来获取我们想要的数据:1.RESTful API 2.GraphQL
以下是爬取过程中几个有用的 API:

  • 所有题目的相关信息:https://leetcode.com/api/problems/all/,数据大致如下:
{
    "user_name": "",
    "num_solved": 0,
    "num_total": 949,
    "ac_easy": 0,
    "ac_medium": 0,
    "ac_hard": 0,
    "stat_status_pairs": [
    {
        "stat":
        {
            "question_id": 993,
            "question__article__live": true,
            "question__article__slug": "tallest-billboard",
            "question__title": "Tallest Billboard",
            "question__title_slug": "tallest-billboard",
            "question__hide": false,
            "total_acs": 1361,
            "total_submitted": 4295,
            "frontend_question_id": 956,
            "is_new_question": false
        },
        "status": null,
        "difficulty":
        {
            "level": 3
        },
        "paid_only": false,
        "is_favor": false,
        "frequency": 0,
        "progress": 0
    },
    ...省略
    ],
    "frequency_high": 0,
    "frequency_mid": 0,
    "category_slug": "all"
}
  • 某道题目提交的代码的信息:https://leetcode.com/api/submissions/two-sum/?offset=0&limit=10&lastkey=,提交的代码列表可能超过一页的显示篇幅,因此需要做翻页判断的逻辑,数据大致如下:
{
    "submissions_dump": [
    {
        "id": xxx,
        "lang": "java",
        "time": "2 weeks, 5 days",
        "timestamp": 154****320,
        "status_display": "Accepted",
        "runtime": "4 ms",
        "url": "/submissions/detail/19****359/",
        "is_pending": "Not Pending",
        "title": ""
    },
    ...省略
    ],
    "has_next": true,
    "last_key": "xxx"
}
  • GraphQL:https://leetcode.com/graphql,向这个链接发送query请求,获取我们想要的数据

模拟登陆

  之前写过一篇博文来说明了如何模拟登陆——使用OkHttp模拟登陆LeetCode,可进一步查看,这里简单说一下。根据抓包结果可以得到:

因此我们只要创建一个Content-Type类型为multipart/form-data的请求,然后带上初始打开登录页返回的Cookie值即可完成模拟登陆。


/**
  * 模拟登陆 LeetCodo,登陆过程分析见:https://www.cnblogs.com/ZhaoxiCheung/p/9302510.html
  */
public boolean doLogin() throws IOException {
    boolean success;
    Connection.Response response = Jsoup.connect(URL.LOGIN)
                                   .method(Connection.Method.GET)
                                   .execute();

    csrftoken = response.cookie("csrftoken");
    __cfduid = response.cookie("__cfduid");

    OkHttpClient client = new OkHttpClient.Builder()
                      .followRedirects(false)
                      .followSslRedirects(false)
                      .cookieJar(new MyCookieJar())
                      .connectTimeout(30, TimeUnit.SECONDS)
                      .readTimeout(30, TimeUnit.SECONDS)
                      .writeTimeout(30, TimeUnit.SECONDS)
                      .build();

    String form_data = "--" + boundary + "\r\n"
                       + "Content-Disposition: form-data; name=\"csrfmiddlewaretoken\"" + "\r\n\r\n"
                       + csrftoken + "\r\n"
                       + "--" + boundary + "\r\n"
                       + "Content-Disposition: form-data; name=\"login\"" + "\r\n\r\n"
                       + usrname + "\r\n"
                       + "--" + boundary + "\r\n"
                       + "Content-Disposition: form-data; name=\"password\"" + "\r\n\r\n"
                       + passwd + "\r\n"
                       + "--" + boundary + "\r\n"
                       + "Content-Disposition: form-data; name=\"next\"" + "\r\n\r\n"
                       + "/problems" + "\r\n"
                       + "--" + boundary + "--";

    RequestBody requestBody = RequestBody.create(MULTIPART, form_data);

    Request request = new Request.Builder()
                    .addHeader("Content-Type", "multipart/form-data; boundary=" + boundary)
                    .addHeader("Connection", "keep-alive")
                    .addHeader("Accept", "*/*")
                    .addHeader("Origin", "https://leetcode.com")
                    .addHeader("Referer", URL.LOGIN)
                    .addHeader("Cookie", "__cfduid=" + __cfduid + ";" + "csrftoken=" + csrftoken)
                    .post(requestBody)
                    .url(URL.LOGIN)
                    .build();

    Response loginResponse = client.newCall(request).execute();

    if (Main.isDebug)   out.println(loginResponse.message());

    Headers headers = loginResponse.headers();
    List<String>cookies = headers.values("Set-Cookie");
    for (String cookie : cookies) {
        int found = cookie.indexOf("LEETCODE_SESSION");
        if (found > -1) {
            if (Main.isDebug)   out.println(cookie);
            int last = cookie.indexOf(";");
            LEETCODE_SESSION = cookie.substring("LEETCODE_SESSION".length() + 1, last);
            if (Main.isDebug)   out.println(LEETCODE_SESSION);
        }
    }


    if (LEETCODE_SESSION != null) {
        success = true;
        out.println("Login Successfully");
    } else {
        success = false;
        out.println("Login Unsuccessfully");
    }
    loginResponse.close();

    return success;
}

利用 GraphQL 获取数据

  并非所有的数据都可以通过RESTful API的形式获取,LeetCode 对于有些数据用的是GraphQL的方式,比如题目的Description。之前也写了一篇关于使用GraphQL来获取 LeetCode 数据的文章——爬取LeetCode题目——如何发送GraphQL Query获取数据,可进一步查看。这里主要说一下怎么知道我们要发送怎样的query语句。在 Chrome 浏览器下使用 F12,右键 Network 下,从Header中的Request Payload中我们可以看到一个 query 的字段,这是我们要构造的 GraphQL Query 的一个重要信息。,如下图所示:

其他

获取题目的描述

public String getProblemDescription(String problemTitle) throws IOException {
    String problemDescriptionString = "";
    String postBody = "query{question(titleSlug:\"" + problemTitle + "\") {content}}\n";
    RequestBody requestBody = RequestBody.create(MediaType.parse("application/graphql; charset=utf-8"), postBody);
    Headers headers = new Headers.Builder()
                .add("Content-Type", "application/graphql")
                .add("Referer", "https://leetcode.com/problems/" + problemTitle)
                .add("Cookie", "__cfduid=" + Login.__cfduid + ";" + "csrftoken=" + Login.csrftoken + ";" + "LEETCODE_SESSION=" + Login.LEETCODE_SESSION)
                .add("x-csrftoken", Login.csrftoken)
                .build();

    Response graphqlResponse = okHttpHelper.post(URL.GRAPHQL, requestBody, headers);

    if (graphqlResponse != null) {
        ProblemContentBean problemContentBean = okHttpHelper.fromJson(graphqlResponse.body().string(), ProblemContentBean.class);
        problemDescriptionString = problemContentBean.getData().getQuestion().getContent();

        graphqlResponse.close();
    } else {
        //TODO 输出错误信息
    }
    return problemDescriptionString;
}

获取某道题对于某个语言提交的代码

public String getSubmissionCode(String submissionUrl) throws IOException {
    String url = URL.LEETCODE + submissionUrl;
    if (Main.isDebug)   out.println(url);
    String codeString = null;

    Headers headers = new Headers.Builder()
            .add("Cookie", "__cfduid=" + Login.__cfduid + ";" + "csrftoken=" + Login.csrftoken + ";" + "LEETCODE_SESSION=" + Login.LEETCODE_SESSION)
            .build();

    Response response = okHttpHelper.get(url, headers);

    if (response != null) {
        String htmlString = response.body().string();

        Document document = Jsoup.parse(htmlString);
        Elements elements = document.getElementsByTag("script");
        for (Element element : elements) {
            int indexStart = element.toString().indexOf("submissionCode: '");
            if (indexStart > -1) {
                int indexTo = element.toString().indexOf("editCodeUrl");
                codeString = element.toString().substring(indexStart + ("submissionCode: '").length(), indexTo - 5);
                break;
            }
        }

        response.close();
    } else {
        //TODO 错误信息处理
    }

    codeString = encode(codeString);

    return codeString;
}

获取题目对于 config 文件指定的语言提交的代码

public synchronized Map<String, String> getSubmissions(String problemTitle, ResultBean resultBean) throws IOException {
    if (Main.isDebug)   out.println("pre problemTitle = " + problemTitle);
    //保存语言对应的提交代码
    Map<String, String> submissionMap = new HashMap<>();
    int offset = 0;
    int limit = 10;
    boolean hasNext = true;
    String lastKey = "";

    List<String> languageList = Config.getSingleton().getLanguageList();
    //已经在本地存有对应语言的代码
    List<String> savedLanguageList = resultBean != null ? resultBean.getLanguage() : new ArrayList<>(0);

    //保存某个语言的代码是否已经抓取
    Map<String, Boolean>languageMap = new HashMap<>();
    for (int i = 0; i < languageList.size(); i++) {
        boolean hasExist = false;
        //数据量较小,暴力搜索
        for (int j = 0; j < savedLanguageList.size(); j++) {
            if (languageList.get(i).equals(savedLanguageList.get(j))) {
                hasExist = true;
                break;
            }
        }
        if (!hasExist)  languageMap.put(languageList.get(i), false);
    }

    //想要爬取的题目的对应语言提交的代码已经保存在本地了
    if (languageMap.size() == 0)    return submissionMap;

    while(hasNext) {
        String submissionsUrl = String.format(URL.SUBMISSIONS_FORMAT, problemTitle, offset, limit, lastKey);

        Headers headers = new Headers.Builder()
                    .add("Cookie", "__cfduid=" + Login.__cfduid + ";" + "csrftoken=" + Login.csrftoken + ";" + "LEETCODE_SESSION=" + Login.LEETCODE_SESSION)
                    .build();

        Response response = okHttpHelper.get(submissionsUrl, headers);

        if (response != null) {
            String responseData = response.body().string();

            SubmissionBean submissionBean = okHttpHelper.fromJson(responseData, SubmissionBean.class);
            List<SubmissionBean.SubmissionsDumpBean> submissionsDumpList = submissionBean.getSubmissions_dump();

            if (submissionsDumpList == null) {
                if (Main.isDebug) {
                    out.println("submissionsUrl = " + submissionsUrl);
                    out.println("problemTitle = " + problemTitle);
                    out.println("responseData = " + responseData);
                    out.println("status message = " + response.message());
                    out.println("message code = " + response.code());
                }

                continue;
            }

            for (int i = 0; i < submissionsDumpList.size(); i++) {
                SubmissionBean.SubmissionsDumpBean submission = submissionsDumpList.get(i);
                String language = submission.getLang();
                if (languageMap.containsKey(language) && languageMap.get(language) == false && submission.getStatus_display().equals("Accepted")) {
                    submissionMap.put(language, getSubmissionCode(submission.getUrl()));
                    languageMap.put(language, true);
                }
            }

            //翻页逻辑
            hasNext = submissionBean.isHas_next();
            offset = (++offset) * limit;
            lastKey = submissionBean.getLast_key();

            response.close();
        } else {
            //TODO
        }
    }

    return submissionMap;
}

  更详细的代码可在 GitHub 查看——LeetCodeCrawler

版权声明:本文为ZhaoxiCheung原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/ZhaoxiCheung/p/10123926.html