爬取LeetCode数据,生成README文件,美化GitHub仓库
项目地址:LeetCodeCrawler
概述
现在一般或多或少都会在LeetCode上面进行刷题练习,然后将代码放在GitHub上,当然我也一样,这是我的刷题仓库Algorithm。刷完题如果每次都去重新编辑README.md
文件进行更新,未免显得有些费时,因此有了需求,个人就写了一个工具——LeetCodeCrawler:爬取 LeetCode 题目内容以及提交的AC代码的工具,并支持生成相应的 README.md 文件,美化你的 LeetCode 仓库的README。
使用方法
下载LeetCodeCrawler.jar到本地
建立好如下config.json
文件(可直接对 repo 的config.json
进行更改),config.json
文件需与LeetCodeCrawler.jar
放置于同一目录下:
{
"username": "leetcode@leetcode",
"password": "leetcode",
"language": ["cpp", "java"],
"outputDir": "."
}
-
username
和password
对应你的 LeetCode 账号和密码 -
language
对应于你在 LeetCode 刷题使用的编程语言,可多选,选填字段如下(请严格按照如下字段填写):- cpp
- java
- c
- csharp
- javascript
- python
- python3
- ruby
- swift
- golang
- scala
- kotlin
-
outputDir
字段表示你希望存放源码文件的目录,默认为.
,即当前目录
运行java -jar LeetCodeCrawler.jar
效果
爬取解析
几个相关API
主要通过两种方式来获取我们想要的数据:1.RESTful API
2.GraphQL
以下是爬取过程中几个有用的 API:
- 所有题目的相关信息:
https://leetcode.com/api/problems/all/
,数据大致如下:
{
"user_name": "",
"num_solved": 0,
"num_total": 949,
"ac_easy": 0,
"ac_medium": 0,
"ac_hard": 0,
"stat_status_pairs": [
{
"stat":
{
"question_id": 993,
"question__article__live": true,
"question__article__slug": "tallest-billboard",
"question__title": "Tallest Billboard",
"question__title_slug": "tallest-billboard",
"question__hide": false,
"total_acs": 1361,
"total_submitted": 4295,
"frontend_question_id": 956,
"is_new_question": false
},
"status": null,
"difficulty":
{
"level": 3
},
"paid_only": false,
"is_favor": false,
"frequency": 0,
"progress": 0
},
...省略
],
"frequency_high": 0,
"frequency_mid": 0,
"category_slug": "all"
}
- 某道题目提交的代码的信息:
https://leetcode.com/api/submissions/two-sum/?offset=0&limit=10&lastkey=
,提交的代码列表可能超过一页的显示篇幅,因此需要做翻页判断的逻辑,数据大致如下:
{
"submissions_dump": [
{
"id": xxx,
"lang": "java",
"time": "2 weeks, 5 days",
"timestamp": 154****320,
"status_display": "Accepted",
"runtime": "4 ms",
"url": "/submissions/detail/19****359/",
"is_pending": "Not Pending",
"title": ""
},
...省略
],
"has_next": true,
"last_key": "xxx"
}
- GraphQL:
https://leetcode.com/graphql
,向这个链接发送query
请求,获取我们想要的数据
模拟登陆
之前写过一篇博文来说明了如何模拟登陆——使用OkHttp模拟登陆LeetCode,可进一步查看,这里简单说一下。根据抓包结果可以得到:
因此我们只要创建一个Content-Type
类型为multipart/form-data
的请求,然后带上初始打开登录页返回的Cookie
值即可完成模拟登陆。
/**
* 模拟登陆 LeetCodo,登陆过程分析见:https://www.cnblogs.com/ZhaoxiCheung/p/9302510.html
*/
public boolean doLogin() throws IOException {
boolean success;
Connection.Response response = Jsoup.connect(URL.LOGIN)
.method(Connection.Method.GET)
.execute();
csrftoken = response.cookie("csrftoken");
__cfduid = response.cookie("__cfduid");
OkHttpClient client = new OkHttpClient.Builder()
.followRedirects(false)
.followSslRedirects(false)
.cookieJar(new MyCookieJar())
.connectTimeout(30, TimeUnit.SECONDS)
.readTimeout(30, TimeUnit.SECONDS)
.writeTimeout(30, TimeUnit.SECONDS)
.build();
String form_data = "--" + boundary + "\r\n"
+ "Content-Disposition: form-data; name=\"csrfmiddlewaretoken\"" + "\r\n\r\n"
+ csrftoken + "\r\n"
+ "--" + boundary + "\r\n"
+ "Content-Disposition: form-data; name=\"login\"" + "\r\n\r\n"
+ usrname + "\r\n"
+ "--" + boundary + "\r\n"
+ "Content-Disposition: form-data; name=\"password\"" + "\r\n\r\n"
+ passwd + "\r\n"
+ "--" + boundary + "\r\n"
+ "Content-Disposition: form-data; name=\"next\"" + "\r\n\r\n"
+ "/problems" + "\r\n"
+ "--" + boundary + "--";
RequestBody requestBody = RequestBody.create(MULTIPART, form_data);
Request request = new Request.Builder()
.addHeader("Content-Type", "multipart/form-data; boundary=" + boundary)
.addHeader("Connection", "keep-alive")
.addHeader("Accept", "*/*")
.addHeader("Origin", "https://leetcode.com")
.addHeader("Referer", URL.LOGIN)
.addHeader("Cookie", "__cfduid=" + __cfduid + ";" + "csrftoken=" + csrftoken)
.post(requestBody)
.url(URL.LOGIN)
.build();
Response loginResponse = client.newCall(request).execute();
if (Main.isDebug) out.println(loginResponse.message());
Headers headers = loginResponse.headers();
List<String>cookies = headers.values("Set-Cookie");
for (String cookie : cookies) {
int found = cookie.indexOf("LEETCODE_SESSION");
if (found > -1) {
if (Main.isDebug) out.println(cookie);
int last = cookie.indexOf(";");
LEETCODE_SESSION = cookie.substring("LEETCODE_SESSION".length() + 1, last);
if (Main.isDebug) out.println(LEETCODE_SESSION);
}
}
if (LEETCODE_SESSION != null) {
success = true;
out.println("Login Successfully");
} else {
success = false;
out.println("Login Unsuccessfully");
}
loginResponse.close();
return success;
}
利用 GraphQL 获取数据
并非所有的数据都可以通过RESTful API
的形式获取,LeetCode 对于有些数据用的是GraphQL
的方式,比如题目的Description
。之前也写了一篇关于使用GraphQL
来获取 LeetCode 数据的文章——爬取LeetCode题目——如何发送GraphQL Query获取数据,可进一步查看。这里主要说一下怎么知道我们要发送怎样的query
语句。在 Chrome 浏览器下使用 F12,右键 Network 下,从Header
中的Request Payload
中我们可以看到一个 query 的字段,这是我们要构造的 GraphQL Query 的一个重要信息。,如下图所示:
其他
获取题目的描述
public String getProblemDescription(String problemTitle) throws IOException {
String problemDescriptionString = "";
String postBody = "query{question(titleSlug:\"" + problemTitle + "\") {content}}\n";
RequestBody requestBody = RequestBody.create(MediaType.parse("application/graphql; charset=utf-8"), postBody);
Headers headers = new Headers.Builder()
.add("Content-Type", "application/graphql")
.add("Referer", "https://leetcode.com/problems/" + problemTitle)
.add("Cookie", "__cfduid=" + Login.__cfduid + ";" + "csrftoken=" + Login.csrftoken + ";" + "LEETCODE_SESSION=" + Login.LEETCODE_SESSION)
.add("x-csrftoken", Login.csrftoken)
.build();
Response graphqlResponse = okHttpHelper.post(URL.GRAPHQL, requestBody, headers);
if (graphqlResponse != null) {
ProblemContentBean problemContentBean = okHttpHelper.fromJson(graphqlResponse.body().string(), ProblemContentBean.class);
problemDescriptionString = problemContentBean.getData().getQuestion().getContent();
graphqlResponse.close();
} else {
//TODO 输出错误信息
}
return problemDescriptionString;
}
获取某道题对于某个语言提交的代码
public String getSubmissionCode(String submissionUrl) throws IOException {
String url = URL.LEETCODE + submissionUrl;
if (Main.isDebug) out.println(url);
String codeString = null;
Headers headers = new Headers.Builder()
.add("Cookie", "__cfduid=" + Login.__cfduid + ";" + "csrftoken=" + Login.csrftoken + ";" + "LEETCODE_SESSION=" + Login.LEETCODE_SESSION)
.build();
Response response = okHttpHelper.get(url, headers);
if (response != null) {
String htmlString = response.body().string();
Document document = Jsoup.parse(htmlString);
Elements elements = document.getElementsByTag("script");
for (Element element : elements) {
int indexStart = element.toString().indexOf("submissionCode: '");
if (indexStart > -1) {
int indexTo = element.toString().indexOf("editCodeUrl");
codeString = element.toString().substring(indexStart + ("submissionCode: '").length(), indexTo - 5);
break;
}
}
response.close();
} else {
//TODO 错误信息处理
}
codeString = encode(codeString);
return codeString;
}
获取题目对于 config 文件指定的语言提交的代码
public synchronized Map<String, String> getSubmissions(String problemTitle, ResultBean resultBean) throws IOException {
if (Main.isDebug) out.println("pre problemTitle = " + problemTitle);
//保存语言对应的提交代码
Map<String, String> submissionMap = new HashMap<>();
int offset = 0;
int limit = 10;
boolean hasNext = true;
String lastKey = "";
List<String> languageList = Config.getSingleton().getLanguageList();
//已经在本地存有对应语言的代码
List<String> savedLanguageList = resultBean != null ? resultBean.getLanguage() : new ArrayList<>(0);
//保存某个语言的代码是否已经抓取
Map<String, Boolean>languageMap = new HashMap<>();
for (int i = 0; i < languageList.size(); i++) {
boolean hasExist = false;
//数据量较小,暴力搜索
for (int j = 0; j < savedLanguageList.size(); j++) {
if (languageList.get(i).equals(savedLanguageList.get(j))) {
hasExist = true;
break;
}
}
if (!hasExist) languageMap.put(languageList.get(i), false);
}
//想要爬取的题目的对应语言提交的代码已经保存在本地了
if (languageMap.size() == 0) return submissionMap;
while(hasNext) {
String submissionsUrl = String.format(URL.SUBMISSIONS_FORMAT, problemTitle, offset, limit, lastKey);
Headers headers = new Headers.Builder()
.add("Cookie", "__cfduid=" + Login.__cfduid + ";" + "csrftoken=" + Login.csrftoken + ";" + "LEETCODE_SESSION=" + Login.LEETCODE_SESSION)
.build();
Response response = okHttpHelper.get(submissionsUrl, headers);
if (response != null) {
String responseData = response.body().string();
SubmissionBean submissionBean = okHttpHelper.fromJson(responseData, SubmissionBean.class);
List<SubmissionBean.SubmissionsDumpBean> submissionsDumpList = submissionBean.getSubmissions_dump();
if (submissionsDumpList == null) {
if (Main.isDebug) {
out.println("submissionsUrl = " + submissionsUrl);
out.println("problemTitle = " + problemTitle);
out.println("responseData = " + responseData);
out.println("status message = " + response.message());
out.println("message code = " + response.code());
}
continue;
}
for (int i = 0; i < submissionsDumpList.size(); i++) {
SubmissionBean.SubmissionsDumpBean submission = submissionsDumpList.get(i);
String language = submission.getLang();
if (languageMap.containsKey(language) && languageMap.get(language) == false && submission.getStatus_display().equals("Accepted")) {
submissionMap.put(language, getSubmissionCode(submission.getUrl()));
languageMap.put(language, true);
}
}
//翻页逻辑
hasNext = submissionBean.isHas_next();
offset = (++offset) * limit;
lastKey = submissionBean.getLast_key();
response.close();
} else {
//TODO
}
}
return submissionMap;
}
更详细的代码可在 GitHub 查看——LeetCodeCrawler