今日目标:爬取CVPR2018论文,进行分析总结出提到最多的关键字,生成wordCloud词云图展示,并且设置点击后出现对应的论文以及链接

对任务进行分解:

①爬取CVPR2018的标题,简介,关键字,论文链接

②将爬取的信息生成wordCloud词云图展示

③设置点击事件,展示对应关键字的论文以及链接

 

一、爬虫实现

由于文章中并没有找到关键字,于是将标题进行拆分成关键字,用逗号隔开


import requests from bs4 import BeautifulSoup import demjson import pymysql import os headers = {\'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36\'}#创建头部信息 url=\'http://openaccess.thecvf.com/CVPR2018.py\' r=requests.get(url,headers=headers) content=r.content.decode(\'utf-8\') soup = BeautifulSoup(content, \'html.parser\') dts=soup.find_all(\'dt\',class_=\'ptitle\') hts=\'http://openaccess.thecvf.com/\' #数据爬取 alllist=[] for i in range(len(dts)): print(\'这是第\'+str(i)+\'个\') title=dts[i].a.text.strip() href=hts+dts[i].a[\'href\'] r = requests.get(href, headers=headers) content = r.content.decode(\'utf-8\') soup = BeautifulSoup(content, \'html.parser\') #print(title,href) divabstract=soup.find(name=\'div\',attrs={"id":"abstract"}) abstract=divabstract.text.strip() #print(\'第\'+str(i)+\'个:\',abstract) alllink=soup.select(\'a\') link=hts+alllink[4][\'href\'][6:] keyword=str(title).split(\' \') keywords=\'\' for k in range(len(keyword)): if(k==0): keywords+=keyword[k] else: keywords+=\',\'+keyword[k] value=(title,abstract,link,keywords) alllist.append(value) print(alllist) tuplist=tuple(alllist) #数据保存 db = pymysql.connect("localhost", "root", "fengge666", "yiqing", charset=\'utf8\') cursor = db.cursor() sql_cvpr = "INSERT INTO cvpr values (%s,%s,%s,%s)" try: cursor.executemany(sql_cvpr,tuplist) db.commit() except: print(\'执行失败,进入回调3\') db.rollback()

 

二、将数据进行wordCloud展示

首先找到对应的包,来展示词云图

<script src=\’https://cdn.bootcss.com/echarts/3.7.0/echarts.simple.js\’></script> <script src=\’js/echarts-wordcloud.js\’></script> <script src=\’js/echarts-wordcloud.min.js\’></script>

然后通过异步加载,将后台的json数据进行展示。

由于第一步我们获得的数据并没有对其进行分析,因此我们在dao层会对其进行数据分析,找出所有的关键字的次数并对其进行降序排序(用Map存储是最好的方式)

public Map<String,Integer> getallmax()
    {
        String sql="select * from cvpr";
        Map<String, Integer>map=new HashMap<String, Integer>();
        Map<String, Integer>sorted=new HashMap<String, Integer>();
        Connection con=null;
        Statement state=null;
        ResultSet rs=null;
        con=DBUtil.getConn();
        try {
            state=con.createStatement();
            rs=state.executeQuery(sql);
            while(rs.next())
            {
                String keywords=rs.getString("keywords");
                String[] split = keywords.split(",");
                for(int i=0;i<split.length;i++)
                {
                    if(map.get(split[i])==null)
                    {
                        map.put(split[i],0);
                    }
                    else
                    {
                        map.replace(split[i], map.get(split[i])+1);
                    }
                }
            }
        } catch (SQLException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        DBUtil.close(rs, state, con);
        sorted = map
                .entrySet()
                .stream()
                .sorted(Collections.reverseOrder(comparingByValue()))
                .collect(
                        toMap(Map.Entry::getKey, Map.Entry::getValue, (e1, e2) -> e2,
                                LinkedHashMap::new));
        return sorted;
    }

 

到servlet层后,我们还需对数据进行一定的筛选(介词,a,等词语应该去除掉,要不然会干扰我们分析关键字),取前30名关键字,在前台进行展示

aracterEncoding("utf-8");
        Map<String, Integer>sortMap=dao.getallmax();
        JSONArray json =new JSONArray();
        int k=0;
        for (Map.Entry<String, Integer> entry : sortMap.entrySet()) 
        {
            JSONObject ob=new JSONObject();
            ob.put("name", entry.getKey());
            ob.put("value", entry.getValue());
            if(!(entry.getKey().equals("for")||entry.getKey().equals("and")||entry.getKey().equals("With")||entry.getKey().equals("of")||entry.getKey().equals("in")||entry.getKey().equals("From")||entry.getKey().equals("A")||entry.getKey().equals("to")||entry.getKey().equals("a")||entry.getKey().equals("the")||entry.getKey().equals("by")))
            {
                json.add(ob);
                k++;
            }
            if(k==30)
                break;
        }
        System.out.println(json.toString());
        response.getWriter().write(json.toString());

 

 

三、设置点击事件,展示对应关键字的论文以及链接

//设置点击效果 var ecConfig = echarts.config; myChart.on(\’click\’, eConsole);

用函数来实现点击事件的内容:通过点击的关键字,后台进行模糊查询,找到对应的论文题目以及链接,返回到前端页面

        function eConsole(param) {  
            if (typeof param.seriesIndex == \'undefined\') {  
                return;  
            }  
            if (param.type == \'click\') {
                var word=param.name;
                var htmltext="<table class=\'table table-striped\' style=\'text-align:center\'><caption style=\'text-align:center\'>论文题目与链接</caption>";
                $.post(
                        \'findkeytitle\',
                        {\'word\':word},
                        function(result)
                        {
                            json=JSON.parse(result);
                            for(i=0;i<json.length;i++)
                            {
                                htmltext+="<tr><td><a target=\'_blank\' href=\'"+json[i].Link+"\'>"+json[i].Title+"</a></td></tr>";    
                            }
                            htmltext+="</table>"
                            $("#show").html(htmltext);
                        }
                )
            }  
       }
 

 

 

成果展示:

 

 

前台页面代码:

    <head>
        <meta charset="utf-8">
        <link href="css/bootstrap.min.css" rel="stylesheet">
        <!-- jQuery (Bootstrap 的所有 JavaScript 插件都依赖 jQuery,所以必须放在前边) -->
        <script src="js/jquery-1.11.3.min.js"></script>
        <!-- 加载 Bootstrap 的所有 JavaScript 插件。你也可以根据需要只加载单个插件。 -->
        <script src="js/bootstrap.js"></script>
        <script src=\'https://cdn.bootcss.com/echarts/3.7.0/echarts.simple.js\'></script>
        <script src=\'js/echarts-wordcloud.js\'></script>
        <script src=\'js/echarts-wordcloud.min.js\'></script>
    </head>
    <body>
        <style>
            body{
                background-color: black;
            }
            #main {
                width: 70%;
                height: 100%;
                margin: 0;
                float:right;
                background: black;
            }
            #show{
                overflow-x: auto;
                 overflow-y: auto;
                width: 30%;
                height: 100%;
                float:left;
                margin-top:100dp;
                padding-top:100dp;
                background: pink;
            }
        </style>
        <div id=\'show\'></div>
        <div id=\'main\'></div>
    <script>
        $(function(){
            echartsCloud();
        });
        //点击事件
        function eConsole(param) {  
            if (typeof param.seriesIndex == \'undefined\') {  
                return;  
            }  
            if (param.type == \'click\') {
                var word=param.name;
                var htmltext="<table class=\'table table-striped\' style=\'text-align:center\'><caption style=\'text-align:center\'>论文题目与链接</caption>";
                $.post(
                        \'findkeytitle\',
                        {\'word\':word},
                        function(result)
                        {
                            json=JSON.parse(result);
                            for(i=0;i<json.length;i++)
                            {
                                htmltext+="<tr><td><a target=\'_blank\' href=\'"+json[i].Link+"\'>"+json[i].Title+"</a></td></tr>";    
                            }
                            htmltext+="</table>"
                            $("#show").html(htmltext);
                        }
                )
            }  
       }
        function echartsCloud(){
           
            
            $.ajax({
                 url:"getmax",
                 type:"POST",
                 dataType:"JSON",
                 async:true,
                 success:function(data)
                 {
                     var mydata = new Array(0);
               
                     for(var i=0;i<data.length;i++)
                     {
                         var d = {
                                 
                         };
                         d["name"] = data[i].name;//.substring(0, 2);
                         d["value"] = data[i].value;
                         mydata.push(d);
                     }
                     var myChart = echarts.init(document.getElementById(\'main\'));
                     //设置点击效果
                     var ecConfig = echarts.config;
                     myChart.on(\'click\', eConsole);
                     
                     myChart.setOption({
                         title: {
                             text: \'\'
                         },
                         tooltip: {},
                         series: [{
                             type : \'wordCloud\',  //类型为字符云
                                 shape:\'smooth\',  //平滑
                                 gridSize : 8, //网格尺寸
                                 size : [\'50%\',\'50%\'],
                                 //sizeRange : [ 50, 100 ],
                                 rotationRange : [-45, 0, 45, 90], //旋转范围
                                 textStyle : {
                                     normal : {
                                         fontFamily:\'微软雅黑\',
                                         color: function() {
                                             return \'rgb(\' + 
                                                 Math.round(Math.random() * 255) +
                                          \', \' + Math.round(Math.random() * 255) +
                                          \', \' + Math.round(Math.random() * 255) + \')\'
                                                }
                                         },
                                     emphasis : {
                                         shadowBlur : 5,  //阴影距离
                                         shadowColor : \'#333\'  //阴影颜色
                                     }
                                 },
                                 left: \'center\',
                                 top: \'center\',
                                 right: null,
                                 bottom: null,
                                 width:\'100%\',
                                 height:\'100%\',
                                 data:mydata
                         }]
                     });
                 }
             });  
    }

 

<html> <head> <meta charset=”utf-8″> <link href=”css/bootstrap.min.css” rel=”stylesheet”> <!– jQuery (Bootstrap 的所有 JavaScript 插件都依赖 jQuery,所以必须放在前边) –> <script src=”js/jquery-1.11.3.min.js”></script> <!– 加载 Bootstrap 的所有 JavaScript 插件。你也可以根据需要只加载单个插件。 –> <script src=”js/bootstrap.js”></script> <script src=\’https://cdn.bootcss.com/echarts/3.7.0/echarts.simple.js\’></script> <script src=\’js/echarts-wordcloud.js\’></script> <script src=\’js/echarts-wordcloud.min.js\’></script> </head> <body> <style> body{ background-color: black; } #main { width: 70%; height: 100%; margin: 0; float:right; background: black; } #show{ overflow-x: auto; overflow-y: auto; width: 30%; height: 100%; float:left; margin-top:100dp; padding-top:100dp; background: pink; } </style> <div id=\’show\’></div> <div id=\’main\’></div> <script> $(function(){ echartsCloud(); }); //点击事件 function eConsole(param) { if (typeof param.seriesIndex == \’undefined\’) { return; } if (param.type == \’click\’) { var word=param.name; var htmltext=”<table class=\’table table-striped\’ style=\’text-align:center\’><caption style=\’text-align:center\’>论文题目与链接</caption>”; $.post( \’findkeytitle\’, {\’word\’:word}, function(result) { json=JSON.parse(result); for(i=0;i<json.length;i++) { htmltext+=”<tr><td><a target=\’_blank\’ href=\'”+json[i].Link+”\’>”+json[i].Title+”</a></td></tr>”; } htmltext+=”</table>” $(“#show”).html(htmltext); } ) } } function echartsCloud(){ $.ajax({ url:”getmax”, type:”POST”, dataType:”JSON”, async:true, success:function(data) { var mydata = new Array(0); for(var i=0;i<data.length;i++) { var d = { }; d[“name”] = data[i].name;//.substring(0, 2); d[“value”] = data[i].value; mydata.push(d); } var myChart = echarts.init(document.getElementById(\’main\’)); //设置点击效果 var ecConfig = echarts.config; myChart.on(\’click\’, eConsole); myChart.setOption({ title: { text: \’\’ }, tooltip: {}, series: [{ type : \’wordCloud\’, //类型为字符云 shape:\’smooth\’, //平滑 gridSize : 8, //网格尺寸 size : [\’50%\’,\’50%\’], //sizeRange : [ 50, 100 ], rotationRange : [-45, 0, 45, 90], //旋转范围 textStyle : { normal : { fontFamily:\’微软雅黑\’, color: function() { return \’rgb(\’ + Math.round(Math.random() * 255) + \’, \’ + Math.round(Math.random() * 255) + \’, \’ + Math.round(Math.random() * 255) + \’)\’ } }, emphasis : { shadowBlur : 5, //阴影距离 shadowColor : \’#333\’ //阴影颜色 } }, left: \’center\’, top: \’center\’, right: null, bottom: null, width:\’100%\’, height:\’100%\’, data:mydata }] }); } }); } </script> </body> </html>

版权声明:本文为dazhi151原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/dazhi151/p/13040613.html