Android爬虫(一)使用OkHttp+Jsoup实现网络爬虫
这几天写了一个Android上面简单的爬虫Demo
数据爬取完后通过RecyclerView展示,这篇文章先写爬取数据部分
这里我爬虫测试网站是:什么值得买
想要爬取的数据是首页的一些精选文章,主要爬取文章标题、图片、简介
这个是我爬到的数据
这里需要引用到Jsoup和OkHttp的jar包,我是下载下来,添加到项目工程当中
也可以直接在gradle文件当中配置
implementation \'org.jsoup:jsoup:1.11.3\'
implementation \'com.squareup.okhttp3:okhttp:3.4.1\'
然后就可以开始写代码爬虫啦
实体类Article.java
/*
*@Author:Swallow
*@Date:2019/3/21
* 抓取到的文章数据封装
*/
public class Article {
private String title;
private String author;
private String imgUrl;
private String context;
private String articleUrl;
private String date;
private String from;
//有几个属性还没用到,所以构造方法先用上这四个有爬取到数据的
public Article(String title, String author, String imgUrl, String context) {
this.title = title;
this.author = author;
this.imgUrl = imgUrl;
this.context = context;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getAuthor() {
return author;
}
public void setAuthor(String author) {
this.author = author;
}
public String getImgUrl() {
return imgUrl;
}
public void setImgUrl(String imgUrl) {
this.imgUrl = imgUrl;
}
public String getContext() {
return context;
}
public void setContext(String context) {
this.context = context;
}
public String getArticleUrl() {
return articleUrl;
}
public void setArticleUrl(String articleUrl) {
this.articleUrl = articleUrl;
}
public String getDate() {
return date;
}
public void setDate(String date) {
this.date = date;
}
public String getFrom() {
return from;
}
public void setFrom(String from) {
this.from = from;
}
@Override
public String toString() {
return "Article{" +
"title=\'" + title + \'\\'\' +
", author=\'" + author + \'\\'\' +
", imgUrl=\'" + imgUrl + \'\\'\' +
", context=\'" + context + \'\\'\' +
", articleUrl=\'" + articleUrl + \'\\'\' +
", date=\'" + date + \'\\'\' +
", from=\'" + from + \'\\'\' +
\'}\';
}
}
OkHttp请求网络
/*
*@Author:Swallow
*@Date:2019/3/7
*/
public class OkHttpUtils {
public static String OkGetArt(String url) {
String html = null;
OkHttpClient client = new OkHttpClient();
Request request = new Request.Builder()
.url(url)
.build();
try (Response response = client.newCall(request).execute()) {
//return
html = response.body().string();
} catch (IOException e) {
e.printStackTrace();
}
return html;
}
}
抓取数据的类
这里用到Jsoup,它主要是解析获取到的网页资源的HTML标签来抓取里面的数据
这里我们可以到原本的网址上去查看网页源码,就可以看到网页的结构,还有要获取的数据所对应的标签
/*
*@Author:Swallow
*@Date:2019/3/21
*/
public class GetData {
/**
* 抓取什么值得买首页的精选文章
* @param html
* @return ArrayList<Article> articles
*/
public static ArrayList<Article> spiderArticle(String html){
ArrayList<Article> articles = new ArrayList<>();
Document document = Jsoup.parse(html);
Elements elements = document
.select("ul[class=feed-list-hits]")
.select("li[class=feed-row-wide J_feed_za ]");
for (Element element : elements) {
String title = element
.select("h5[class=feed-block-title]")
.text();
String author = element
.select("div[class=feed-block-info]")
.select("span")
.text();
String imgurl = element
.select("div[class=z-feed-img]")
.select("a")
.select("img")
.attr("src");
String context = element
.select("div[class=feed-block-descripe]")
.text();
String url = element
.select("div[class=feed-block z-hor-feed ]")
.select("a")
.attr("href");
Article article = new Article(title,author,imgurl,context);
articles.add(article);
Log.e("DATA>>",article.toString());
}
return articles;
}
}
后面直接调用方法就可以
这里要注意一点就是,Android上面发送网络请求要放到子线程当中,所以调用的时候需要开启一个新的子线程
final String url = "https://www.smzdm.com/";
new Thread(){
public void run(){
String html = OkHttpUtils.OkGetArt(url);
ArrayList<Article> articles = GetData.spiderArticle(html);
//发送信息给handler用于更新UI界面
Message message = handler.obtainMessage();
message.what = 1;
message.obj = articles;
handler.sendMessage(message);
}
}.start();