爬取微博用户公开信息,分析听李逼的逼粉们他们的真实年龄
一、功能描述
用爬虫爬取#我们的叁叁肆#
下的微博,然后再爬取他们的个人主页信息,获取年龄、地区、性别等信息,然后用数据分析,再可视化呈现。
注意:文中说的微博个人主页信息均为微博公开信息,不包含任何隐私信息,同时全文中将不会出现任何人的个人信息,信息仅用于学习分析,任何人不得使用此教程用作商用,违者后果自付!
二、技术方案
我们大概分解下技术步骤,以及使用的技术
1、爬取#我们的叁叁肆#下的微博
2、根据每条微博爬取该用户的基本信息
3、将信息保存到csv文件
4、使用数据分析用户年龄、性别分布
5、分析逼粉的地区
6、使用词云分析打榜微博内容
爬取数据我们可以使用requests库
,保存csv文件我们可以使用内置库csv
,而可视化数据分析这次给大家介绍一个超级好用的库pyecharts
,技术选型好了之后我们就可以开始技术实现了!
三、爬取超话微博
1、找到超话加载数据的url
我们在谷歌浏览器(chrome)中找到#我们的叁叁肆超话#
页面,然后调出调试窗口,改为手机模式,然后过滤请求,只查看异步请求,查看返回数据格式,找到微博内容所在!
2.代码模拟请求数据
拿到链接我们就可以模拟请求,这里我们还是使用我们熟悉的requests库
。简单几句便可以获取微博!
import requests def spider_topic(): \'\'\' 爬取新浪话题 :return: \'\'\' url = \'https://m.weibo.cn/api/container/getIndex?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_soul&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main\' kv = {\'Referer\': \'https://m.weibo.cn/p/index?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_soul&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main\', \'User-Agent\': \'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36\', \'Accept\': \'application/json, text/plain, */*\', \'MWeibo - Pwa\': \'1\', \'Sec - Fetch - Mode\': \'cors\', \'X - Requested - With\': \'XMLHttpRequest\', \'X - XSRF - TOKEN\': \'4dd422\'} try: r = requests.get(url, headers=kv) r.raise_for_status() print(r.text) except Exception as e: print(e) if __name__ == \'__main__\': spider_topic()
了解微博返回的数据结构之后我们就可以将微博内容和id提取出来啦!
import json import re import requests def spider_topic(): \'\'\' 爬取新浪话题 :return: \'\'\' url = \'https://m.weibo.cn/api/container/getIndex?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_soul&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main\' kv = {\'Referer\': \'https://m.weibo.cn/p/index?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_soul&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main\', \'User-Agent\': \'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36\', \'Accept\': \'application/json, text/plain, */*\', \'MWeibo - Pwa\': \'1\', \'Sec - Fetch - Mode\': \'cors\', \'X - Requested - With\': \'XMLHttpRequest\', \'X - XSRF - TOKEN\': \'4dd422\'} try: r = requests.get(url, headers=kv) r.raise_for_status() # 解析数据 r_json = json.loads(r.text) cards = r_json[\'data\'][\'cards\'] # 第一次请求的cards中包含了微博头部信息,以后请求只有微博信息 card_group = cards[2][\'card_group\'] if len(cards) > 1 else cards[0][\'card_group\'] for card in card_group: mblog = card[\'mblog\'] # 过滤掉html标签,留下内容 sina_text = re.compile(r\'<[^>]+>\', re.S).sub(" ", mblog[\'text\']) # # 除去无用开头信息 sina_text = sina_text.replace("我们的叁叁肆超话", \'\').strip() print(sina_text) except Exception as e: print(e) if __name__ == \'__main__\': spider_topic()
4.批量爬取微博
在我们提取一条微博之后,我们便可以批量爬取微博啦,如何批量?当然是要分页了?那如何分页?
查找分页参数技巧:比较第一次和第二次请求url,看看有何不同,找出不同的参数!给大家推荐一款文本比较工具:Beyond Compare
比较两次请求的URL发现,第二次比第一次请求链接中多了一个:since_id
参数,而这个since_id参数就是每条微博的id!
微博分页机制:根据时间分页,每一条微博都有一个since_id,时间越大的since_id越大所以在请求时将since_id传入,则会加载对应话题下比此since_id小的微博,然后又重新获取最小since_id将最小since_id传入,依次请求,这样便实现分页
了解微博分页机制之后,我们就可以制定我们的分页策略:我们将上一次请求返回的微博中最小的since_id作为下次请求的参数,这样就等于根据时间倒序分页抓取数据!
import json import re import requests min_since_id = None def spider_topic(): \'\'\' 爬取新浪话题 :return: \'\'\' global min_since_id url = \'https://m.weibo.cn/api/container/getIndex?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_feed&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main\' kv = {\'Referer\': \'https://m.weibo.cn/p/index?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_soul&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main\', \'User-Agent\': \'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36\', \'Accept\': \'application/json, text/plain, */*\', \'MWeibo - Pwa\': \'1\', \'Sec - Fetch - Mode\': \'cors\', \'X - Requested - With\': \'XMLHttpRequest\', \'X - XSRF - TOKEN\': \'4dd422\'} if min_since_id: url = url + \'&since_id=\' + min_since_id try: r = requests.get(url, headers=kv) r.raise_for_status() # 解析数据 r_json = json.loads(r.text) cards = r_json[\'data\'][\'cards\'] # 第一次请求的cards中包含了微博头部信息,以后请求只有微博信息 card_group = cards[2][\'card_group\'] if len(cards) > 1 else cards[0][\'card_group\'] for card in card_group: mblog = card[\'mblog\'] r_since_id = mblog[\'id\'] # 过滤掉html标签,留下内容 sina_text = re.compile(r\'<[^>]+>\', re.S).sub(" ", mblog[\'text\']) # # 除去无用开头信息 sina_text = sina_text.replace("我们的叁叁肆超话", \'\').strip() print(sina_text) # 获取最小since_id,下次请求使用 with open(\'sansansi.txt\', \'a+\', encoding=\'utf-8\') as f: f.write(sina_text + \'\n\') if min_since_id: min_since_id = r_since_id if min_since_id > r_since_id else min_since_id else: min_since_id = r_since_id except Exception as e: print(e) if __name__ == \'__main__\': for i in range(1000): spider_topic()
四、爬取用户信息
批量爬取微博搞定之后,我们就可以开始爬取用户信息啦!
首先我们得了解,用户基本信息页面的链接为:https://weibo.cn/用户id/info
所以我们只要获取到用户的id就可以拿到他的公开基本信息!
1.获取用户id
回顾我们之前分析的微博数据格式,发现其中便有我们需要的用户id!
所以我们在提取微博内容的时候可以顺便将用户id提取出来!
2.模拟登录
我们获取到用户id之后,只要请求https://weibo.cn/用户id/info 这个url就可以获取公开信息了,但是查看别人用户主页是需要登录的,那我们就先用代码模拟登录!
import requests # 每次请求中最小的since_id,下次请求是用,新浪分页机制 min_since_id = \'\' # 生成Session对象,用于保存Cookie s = requests.session() def login_sina(): """ 登录新浪微博 :return: """ # 登录rul login_url = \'https://passport.weibo.cn/sso/login\' # 请求头 headers = {\'user-agent\': \'Mozilla/5.0\', \'Referer\': \'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=https%3A%2F%2Fm.weibo.cn%2F\'} # 传递用户名和密码 data = {\'username\': \'用户名\', \'password\': \'密码\', \'savestate\': 1, \'entry\': \'mweibo\', \'mainpageflag\': 1} try: r = s.post(login_url, headers=headers, data=data) r.raise_for_status() except: print(\'登录请求失败\') return 0 # 打印请求结果 print(json.loads(r.text)[\'msg\']) return 1
登录我们使用的是requests.Session()对象,这个对象会自动保存cookies,下次请求自动带上cookies!
3.爬取用户公开信息
拿到用户id又登录之后,就可以开始爬取用户公开信息啦!
def spider_user_info(uid) -> list: \'\'\' 爬取用户信息(需要登录),并将基本信息解析字典返回 :param uid: :return: [\'用户名\',\'性别\',\'地区\',\'生日\'] \'\'\' user_info_url = \'https//weibo.cm/%s/info\'%uid kv = { \'user-agent\':\'Mozilla/5.0\' } try: r = s.get(url=user_info_url, headers=kv) r.raise_for_status() # 使用正则提取信息 basic_info_html = re.findall(\'<div class="tip">基本信息</div><div class="c">(.*?)</div>\', r.text) # 提取:用户名,性别,地区,生日 basic_infos = get_basic_info_list(basic_info_html) return basic_infos except Exception as e: print(e) return
这里公开信息我们只要:用户名、性别、地区、生日这些数据!所以我们需要将这几个数据提取出来
def get_basic_info_list(basic_info_html)-> list: \'\'\' 将html解析提取需要的字段 :param basic_info_html: :return: [\'用户名\',\'性别\',\'地区\',\'生日\'] \'\'\' basic_infos = [] basic_info_kvs = basic_info_html[0].split(\'<br/>\') for basic_info_kv in basic_info_kvs: if basic_info_kv.startswitch(\'昵称\'): basic_infos.append(basic_info_kv.split(\':\')[1]) elif basic_info_kv.startswitch(\'性别\'): basic_infos.append(basic_info_kv.split(":")[1]) elif basic_info_kv.startswitch(\'地区\'): area = basic_info_kv.split(\':\')[1] # 如果地区是其他的话,就添加空 if \'其他\' in area or \'海外\' in area: basic_infos.append(\'\') continue if \' \' in area: area = area.split(\' \')[0] basic_infos.append(area) elif basic_info_kv.startswitch(\'生日\'): birthday = basic_info_kv.split(\':\')[1] # 只判断带年份的才有效 if birthday.startswith(\'19\') or birthday.startswith(\'20\'): # 主要前三位 basic_infos.append(birthday[:3]) else: basic_infos.append("") else: pass #有些用户没有生日,直接添加一个空字符 if len(basic_infos)<4: basic_infos.append("") return basic_infos
爬取用户信息不能过于频繁,否则会出现请求失败(响应状态码=418),但是不会封你的ip,其实很多大厂 不太会轻易的封ip,太容易误伤了,也许一封就是一个小区甚至更大!
五、保存csv文件
微博信息拿到了、用户信息也拿到了,那我们就把这些数据保存起来,方便后面做数据分析!
我们之前一直是保存txt格式的,因为之前都是只有一项数据,而这次是多项数据(微博内容、用户名、地区、年龄、性别等),所以选择CSV(Comma Separated Values逗号分隔值)格式的文件!
我们生成一个列表,然后将数据按顺序放入,再写入csv文件!
然后我们看到有输出结果。
import csv import json import os import random import re import time import requests # 每次请求中最小的since_id,下次请求是用,新浪分页机制 min_since_id = \'\' # 生成Session对象,用于保存Cookie s = requests.session() # 新浪话题数据保存文件 CSV_FILE_PATH = \'sina_topic.csv\' def login_sina(): """ 登录新浪微博 :return: """ # 登录rul login_url = \'https://passport.weibo.cn/sso/login\' # 请求头 headers = {\'user-agent\': \'Mozilla/5.0\', \'Referer\': \'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=https%3A%2F%2Fm.weibo.cn%2F\'} # 传递用户名和密码 data = {\'username\': \'用户名\', \'password\': \'密码\', \'savestate\': 1, \'entry\': \'mweibo\', \'mainpageflag\': 1} try: r = s.post(login_url, headers=headers, data=data) r.raise_for_status() except: print(\'登录请求失败\') return 0 # 打印请求结果 print(json.loads(r.text)[\'msg\']) return 1 def spider_topic(): \'\'\' 爬取新浪话题 :return: \'\'\' global min_since_id url = \'https://m.weibo.cn/api/container/getIndex?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_feed&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main\' kv = {\'Referer\': \'https://m.weibo.cn/p/index?containerid=100808143f5419c464669aa3ec977bbeb21eeb_-_soul&luicode=10000011&lfid=100808143f5419c464669aa3ec977bbeb21eeb_-_main\', \'User-Agent\': \'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36\', \'Accept\': \'application/json, text/plain, */*\', \'MWeibo - Pwa\': \'1\', \'Sec - Fetch - Mode\': \'cors\', \'X - Requested - With\': \'XMLHttpRequest\', \'X - XSRF - TOKEN\': \'4dd422\'} if min_since_id: url = url + \'&since_id=\' + min_since_id try: r = requests.get(url, headers=kv) r.raise_for_status() # 解析数据 r_json = json.loads(r.text) cards = r_json[\'data\'][\'cards\'] # 第一次请求的cards中包含了微博头部信息,以后请求只有微博信息 card_group = cards[2][\'card_group\'] if len(cards) > 1 else cards[0][\'card_group\'] for card in card_group: # 创建保存数据的列表,最后将它写入csv文件中 sina_columns = [] mblog = card[\'mblog\'] # 解析用户信息 user = mblog[\'user\'] # 爬取用户信息,微博有反扒机制,太快会返回418 try: basic_infos = spider_user_info(user[\'id\']) sina_columns.append(user[\'id\']) # 现将用户信息放进去 sina_columns.extend(basic_infos) # 将用户信息放进去 except Exception as e: print(e) continue # 解析微博内容 r_since_id = mblog[\'id\'] # 过滤掉html标签,留下内容 sina_text = re.compile(r\'<[^>]+>\', re.S).sub(" ", mblog[\'text\']) # # 除去无用开头信息 sina_text = sina_text.replace("我们的叁叁肆超话", \'\').strip() # 将微博内容放入列表 sina_columns.append(r_since_id) sina_columns.append(sina_text) # 校验列表是否完整 # sina_colums数据格式:[\'用不id\',\'用户名\',\'性别\',\'地区\',\'生日\',\'微博id\',\'微博内容\'] if len(sina_columns) < 7: print(\'-----上一条数据不完整----\') continue # 保存数据 save_columns_to_csv(sina_columns) # 获取最小since_id,下次请求使用 if min_since_id: min_since_id = r_since_id if min_since_id > r_since_id else min_since_id else: min_since_id = r_since_id # 设置时间间隔 time.sleep(random.randint(3,6)) except Exception as e: print(e) def spider_user_info(uid) -> list: \'\'\' 爬取用户信息(需要登录),并将基本信息解析字典返回 :param uid: :return: [\'用户名\',\'性别\',\'地区\',\'生日\'] \'\'\' user_info_url = \'https//weibo.cm/%s/info\'%uid kv = { \'user-agent\':\'Mozilla/5.0\' } try: r = s.get(url=user_info_url, headers=kv) r.raise_for_status() # 使用正则提取信息 basic_info_html = re.findall(\'<div class="tip">基本信息</div><div class="c">(.*?)</div>\', r.text) # 提取:用户名,性别,地区,生日 basic_infos = get_basic_info_list(basic_info_html) return basic_infos except Exception as e: print(e) return def get_basic_info_list(basic_info_html)-> list: \'\'\' 将html解析提取需要的字段 :param basic_info_html: :return: [\'用户名\',\'性别\',\'地区\',\'生日\'] \'\'\' basic_infos = [] basic_info_kvs = basic_info_html[0].split(\'<br/>\') for basic_info_kv in basic_info_kvs: if basic_info_kv.startswitch(\'昵称\'): basic_infos.append(basic_info_kv.split(\':\')[1]) elif basic_info_kv.startswitch(\'性别\'): basic_infos.append(basic_info_kv.split(":")[1]) elif basic_info_kv.startswitch(\'地区\'): area = basic_info_kv.split(\':\')[1] # 如果地区是其他的话,就添加空 if \'其他\' in area or \'海外\' in area: basic_infos.append(\'\') continue if \' \' in area: area = area.split(\' \')[0] basic_infos.append(area) elif basic_info_kv.startswitch(\'生日\'): birthday = basic_info_kv.split(\':\')[1] # 只判断带年份的才有效 if birthday.startswith(\'19\') or birthday.startswith(\'20\'): # 主要前三位 basic_infos.append(birthday[:3]) else: basic_infos.append("") else: pass #有些用户没有生日,直接添加一个空字符 if len(basic_infos)<4: basic_infos.append("") return basic_infos def save_columns_to_csv(columns, encoding=\'utf-8\'): with open(CSV_FILE_PATH, \'a\', encoding=encoding) as f: f = csv.writer(f) f.writerow(columns) def path_spider_topic(): # 先登录,登录失败则不爬 if not login_sina(): return # 写入数据前线清空之前数据 if os.path.exists(CSV_FILE_PATH): os.remove(CSV_FILE_PATH) # 批量爬 for i in range(25): print(\'第%d页\' % (i + 1)) spider_topic() if __name__ == \'__main__\': path_spider_topic()
看看生成的csv文件,注意csv如果用wps或excel打开可能会乱码,因为我们写入文件用utf-8编码,而wps或excel只能打开gbk编码的文件,你可以用一般的文本编辑器即可,pycharm也可以!
六、数据分析
数据保存下来之后我们就可以进行数据分析了,首先我们要知道我们需要分析哪些数据?
- 我们可以将性别数据做生成饼图,简单直观
- 将年龄数据作出柱状图,方便对比
- 将地区做成中国热力图,看看哪个地区粉丝最活跃
- 最后将微博内容做成词云图,直观了解大家在说啥
1.读取csv文件列
因为我们保存的数据格式为:’用户id’, ‘用户名’, ‘性别’, ‘地区’, ‘生日’, ‘微博id’, ‘微博内容’,的很多行,而现在做数据分析需要获取指定的某一列,比如:性别列,所以我们需要封装一个方法用来读取指定的列!
def read_csv_to_dict(index) -> dict: """ 读取csv数据 数据格式为:\'用户id\', \'用户名\', \'性别\', \'地区\', \'生日\', \'微博id\', \'微博内容\' :param index: 读取某一列 从0开始 :return: dic属性为key,次数为value """ with open(CSV_FILE_PATH, \'r\', encoding=\'utf-8\') as csvfile: reader = csv.reader(csvfile) column = [columns[index] for columns in reader] dic = collections.Counter(column) # 删除空字符串 if \'\' in dic: dic.pop(\'\') print(dic) return dic
2.可视化库pyecharts
在我们分析之前,有一件很重要的事情,那就是选择一个合适可视化库!大家都知道Python可视化库非常多,之前我们一直在用matplotlib库
做词云,matplotlib做一些简单的绘图非常方便。但是今天我们需要做一个全国分布图,所以经过猪哥对比筛选,选择了国人开发的pyecharts库
。选择这个库的理由是:开源免费、文档详细、图形丰富、代码简介,用着就是一个字:爽!
- 官网:https://pyecharts.org/#/
- 源码:https://github.com/pyecharts/pyecharts
- 安装:pip install pyecharts
3.分析性别
选择了可视化库之后,我们就来使用吧!
补充生成的csv文件如果中间有空格,需要去掉空格。
def analysis_gender(): """ 分析性别 :return: """ # 读取性别列 dic = read_csv_to_dict(2) # 生成二维数组 gender_count_list = [list(z) for z in zip(dic.keys(), dic.values())] print(gender_count_list) pie = ( Pie() .add("", gender_count_list) .set_colors(["red", "blue"]) .set_global_opts(title_opts=opts.TitleOpts(title="性别分析")) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")) ) pie.render(\'gender.html\')
这里说下为什么生成的是html?因为这是动态图,就是可以点击选择显示的,非常人性化!执行之后会生成一个gender.html文件,在浏览器打开就可以!
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Awesome-pyecharts</title> <script type="text/javascript" src="https://assets.pyecharts.org/assets/echarts.min.js"></script> </head> <body> <div id="3b41603199d6404b8ca62be19e500c70" class="chart-container" style="width:900px; height:500px;"></div> <script> var chart_3b41603199d6404b8ca62be19e500c70 = echarts.init( document.getElementById(\'3b41603199d6404b8ca62be19e500c70\'), \'white\', {renderer: \'canvas\'}); var option_3b41603199d6404b8ca62be19e500c70 = { "animation": true, "animationThreshold": 2000, "animationDuration": 1000, "animationEasing": "cubicOut", "animationDelay": 0, "animationDurationUpdate": 300, "animationEasingUpdate": "cubicOut", "animationDelayUpdate": 0, "color": [ "red", "blue" ], "series": [ { "type": "pie", "clockwise": true, "data": [ { "name": "\u5973", "value": 111 }, { "name": "\u7537", "value": 160 } ], "radius": [ "0%", "75%" ], "center": [ "50%", "50%" ], "label": { "show": true, "position": "top", "margin": 8, "formatter": "{b}: {c}" }, "rippleEffect": { "show": true, "brushType": "stroke", "scale": 2.5, "period": 4 } } ], "legend": [ { "data": [ "\u5973", "\u7537" ], "selected": {}, "show": true } ], "tooltip": { "show": true, "trigger": "item", "triggerOn": "mousemove|click", "axisPointer": { "type": "line" }, "textStyle": { "fontSize": 14 }, "borderWidth": 0 }, "title": [ { "text": "\u6027\u522b\u5206\u6790" } ] }; chart_3b41603199d6404b8ca62be19e500c70.setOption(option_3b41603199d6404b8ca62be19e500c70); </script> </body> </html>
gender.html
效果图中可以看到,女逼粉稍小于男逼粉。
4.分析年龄
这一项是大家比较关心的,看看逼粉的年龄情况。
def analysis_age(): """ 分析年龄 :return: """ dic = read_csv_to_dict(4) # 生成柱状图 sorted_dic = {} for key in sorted(dic): sorted_dic[key] = dic[key] print(sorted_dic) bar = ( Bar() .add_xaxis(list(sorted_dic.keys())) .add_yaxis("李逼听众年龄分析", list(sorted_dic.values())) .set_global_opts( yaxis_opts=opts.AxisOpts(name="数量"), xaxis_opts=opts.AxisOpts(name="年龄"), ) ) bar.render(\'age_bar.html\') # 生成饼图 age_count_list = [list(z) for z in zip(dic.keys(), dic.values())] pie = ( Pie() .add("", age_count_list) .set_global_opts(title_opts=opts.TitleOpts(title="李逼听众年龄分析")) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")) ) pie.render(\'age-pie.html\')
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Awesome-pyecharts</title> <script type="text/javascript" src="https://assets.pyecharts.org/assets/echarts.min.js"></script> </head> <body> <div id="0d8f48190494437c8fca5a690eedde34" class="chart-container" style="width:900px; height:500px;"></div> <script> var chart_0d8f48190494437c8fca5a690eedde34 = echarts.init( document.getElementById(\'0d8f48190494437c8fca5a690eedde34\'), \'white\', {renderer: \'canvas\'}); var option_0d8f48190494437c8fca5a690eedde34 = { "animation": true, "animationThreshold": 2000, "animationDuration": 1000, "animationEasing": "cubicOut", "animationDelay": 0, "animationDurationUpdate": 300, "animationEasingUpdate": "cubicOut", "animationDelayUpdate": 0, "color": [ "#c23531", "#2f4554", "#61a0a8", "#d48265", "#749f83", "#ca8622", "#bda29a", "#6e7074", "#546570", "#c4ccd3", "#f05b72", "#ef5b9c", "#f47920", "#905a3d", "#fab27b", "#2a5caa", "#444693", "#726930", "#b2d235", "#6d8346", "#ac6767", "#1d953f", "#6950a1", "#918597" ], "series": [ { "type": "pie", "clockwise": true, "data": [ { "name": "199", "value": 124 }, { "name": "200", "value": 7 }, { "name": "198", "value": 13 }, { "name": "201", "value": 4 }, { "name": "190", "value": 3 }, { "name": "197", "value": 2 } ], "radius": [ "0%", "75%" ], "center": [ "50%", "50%" ], "label": { "show": true, "position": "top", "margin": 8, "formatter": "{b}: {c}" }, "rippleEffect": { "show": true, "brushType": "stroke", "scale": 2.5, "period": 4 } } ], "legend": [ { "data": [ "199", "200", "198", "201", "190", "197" ], "selected": {}, "show": true } ], "tooltip": { "show": true, "trigger": "item", "triggerOn": "mousemove|click", "axisPointer": { "type": "line" }, "textStyle": { "fontSize": 14 }, "borderWidth": 0 }, "title": [ { "text": "\u674e\u903c\u542c\u4f17\u5e74\u9f84\u5206\u6790" } ] }; chart_0d8f48190494437c8fca5a690eedde34.setOption(option_0d8f48190494437c8fca5a690eedde34); </script> </body> </html>
age-pie.html
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Awesome-pyecharts</title> <script type="text/javascript" src="https://assets.pyecharts.org/assets/echarts.min.js"></script> </head> <body> <div id="7af17079a4594b07815191837d99a19d" class="chart-container" style="width:900px; height:500px;"></div> <script> var chart_7af17079a4594b07815191837d99a19d = echarts.init( document.getElementById(\'7af17079a4594b07815191837d99a19d\'), \'white\', {renderer: \'canvas\'}); var option_7af17079a4594b07815191837d99a19d = { "animation": true, "animationThreshold": 2000, "animationDuration": 1000, "animationEasing": "cubicOut", "animationDelay": 0, "animationDurationUpdate": 300, "animationEasingUpdate": "cubicOut", "animationDelayUpdate": 0, "color": [ "#c23531", "#2f4554", "#61a0a8", "#d48265", "#749f83", "#ca8622", "#bda29a", "#6e7074", "#546570", "#c4ccd3", "#f05b72", "#ef5b9c", "#f47920", "#905a3d", "#fab27b", "#2a5caa", "#444693", "#726930", "#b2d235", "#6d8346", "#ac6767", "#1d953f", "#6950a1", "#918597" ], "series": [ { "type": "bar", "name": "\u674e\u903c\u542c\u4f17\u5e74\u9f84\u5206\u6790", "data": [ 3, 2, 13, 124, 7, 4 ], "barCategoryGap": "20%", "label": { "show": true, "position": "top", "margin": 8 } } ], "legend": [ { "data": [ "\u674e\u903c\u542c\u4f17\u5e74\u9f84\u5206\u6790" ], "selected": { "\u674e\u903c\u542c\u4f17\u5e74\u9f84\u5206\u6790": true }, "show": true } ], "tooltip": { "show": true, "trigger": "item", "triggerOn": "mousemove|click", "axisPointer": { "type": "line" }, "textStyle": { "fontSize": 14 }, "borderWidth": 0 }, "xAxis": [ { "name": "\u5e74\u9f84", "show": true, "scale": false, "nameLocation": "end", "nameGap": 15, "gridIndex": 0, "inverse": false, "offset": 0, "splitNumber": 5, "minInterval": 0, "splitLine": { "show": false, "lineStyle": { "width": 1, "opacity": 1, "curveness": 0, "type": "solid" } }, "data": [ "190", "197", "198", "199", "200", "201" ] } ], "yAxis": [ { "name": "\u6570\u91cf", "show": true, "scale": false, "nameLocation": "end", "nameGap": 15, "gridIndex": 0, "inverse": false, "offset": 0, "splitNumber": 5, "minInterval": 0, "splitLine": { "show": false, "lineStyle": { "width": 1, "opacity": 1, "curveness": 0, "type": "solid" } } } ], "title": [ {} ] }; chart_7af17079a4594b07815191837d99a19d.setOption(option_7af17079a4594b07815191837d99a19d); </script> </body> </html>
age-bar.html
5.地区分析
def analysis_area(): """ 分析地区 :return: """ dic = read_csv_to_dict(3) area_count_list = [list(z) for z in zip(dic.keys(), dic.values())] print(area_count_list) map = ( Map() .add("李逼听众地区分析", area_count_list, "china") .set_global_opts( visualmap_opts=opts.VisualMapOpts(max_=200), ) ) map.render(\'area.html\')
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Awesome-pyecharts</title> <script type="text/javascript" src="https://assets.pyecharts.org/assets/echarts.min.js"></script> <script type="text/javascript" src="https://assets.pyecharts.org/assets/maps/china.js"></script> </head> <body> <div id="3ec943ef847e4e89bf7b0066319b7cfa" class="chart-container" style="width:900px; height:500px;"></div> <script> var chart_3ec943ef847e4e89bf7b0066319b7cfa = echarts.init( document.getElementById(\'3ec943ef847e4e89bf7b0066319b7cfa\'), \'white\', {renderer: \'canvas\'}); var option_3ec943ef847e4e89bf7b0066319b7cfa = { "animation": true, "animationThreshold": 2000, "animationDuration": 1000, "animationEasing": "cubicOut", "animationDelay": 0, "animationDurationUpdate": 300, "animationEasingUpdate": "cubicOut", "animationDelayUpdate": 0, "color": [ "#c23531", "#2f4554", "#61a0a8", "#d48265", "#749f83", "#ca8622", "#bda29a", "#6e7074", "#546570", "#c4ccd3", "#f05b72", "#ef5b9c", "#f47920", "#905a3d", "#fab27b", "#2a5caa", "#444693", "#726930", "#b2d235", "#6d8346", "#ac6767", "#1d953f", "#6950a1", "#918597" ], "series": [ { "type": "map", "name": "\u674e\u903c\u542c\u4f17\u5730\u533a\u5206\u6790", "label": { "show": true, "position": "top", "margin": 8 }, "mapType": "china", "data": [ { "name": "\u4e0a\u6d77", "value": 6 }, { "name": "\u6e56\u5357", "value": 11 }, { "name": "\u5c71\u4e1c", "value": 26 }, { "name": "\u6cb3\u5317", "value": 3 }, { "name": "\u6c5f\u82cf", "value": 38 }, { "name": "\u6cb3\u5357", "value": 14 }, { "name": "\u56db\u5ddd", "value": 6 }, { "name": "\u9655\u897f", "value": 19 }, { "name": "\u8d35\u5dde", "value": 2 }, { "name": "\u7518\u8083", "value": 5 }, { "name": "\u6c5f\u897f", "value": 4 }, { "name": "\u6d59\u6c5f", "value": 21 }, { "name": "\u6e56\u5317", "value": 6 }, { "name": "\u5b89\u5fbd", "value": 2 }, { "name": "\u5317\u4eac", "value": 27 }, { "name": "\u91cd\u5e86", "value": 6 }, { "name": "\u5929\u6d25", "value": 1 }, { "name": "\u4e91\u5357", "value": 16 }, { "name": "\u5e7f\u897f", "value": 2 }, { "name": "\u5c71\u897f", "value": 3 }, { "name": "\u5185\u8499\u53e4", "value": 4 }, { "name": "\u798f\u5efa", "value": 2 }, { "name": "\u5e7f\u4e1c", "value": 4 }, { "name": "\u8fbd\u5b81", "value": 7 } ], "roam": true, "zoom": 1, "showLegendSymbol": true, "emphasis": {} } ], "legend": [ { "data": [ "\u674e\u903c\u542c\u4f17\u5730\u533a\u5206\u6790" ], "selected": { "\u674e\u903c\u542c\u4f17\u5730\u533a\u5206\u6790": true }, "show": true } ], "tooltip": { "show": true, "trigger": "item", "triggerOn": "mousemove|click", "axisPointer": { "type": "line" }, "textStyle": { "fontSize": 14 }, "borderWidth": 0 }, "title": [ {} ], "visualMap": { "show": true, "type": "continuous", "min": 0, "max": 200, "inRange": { "color": [ "#50a3ba", "#eac763", "#d94e5d" ] }, "calculable": true, "splitNumber": 5, "orient": "vertical", "showLabel": true } }; chart_3ec943ef847e4e89bf7b0066319b7cfa.setOption(option_3ec943ef847e4e89bf7b0066319b7cfa); </script> </body> </html>
area.html
6.内容分析
def analysis_sina_content(): """ 分析微博内容 :return: """ # 读取微博内容列 dic = read_csv_to_dict(6) # 数据清洗,去掉无效词 jieba.analyse.set_stop_words(STOP_WORDS_FILE_PATH) # 词数统计 words_count_list = jieba.analyse.textrank(\' \'.join(dic.keys()), topK=50, withWeight=True) print(words_count_list) # 生成词云 word_cloud = ( WordCloud() .add("", words_count_list, word_size_range=[20, 100], shape=SymbolType.DIAMOND) .set_global_opts(title_opts=opts.TitleOpts(title="叁叁肆超话微博内容分析")) ) word_cloud.render(\'word_cloud.html\')
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Awesome-pyecharts</title> <script type="text/javascript" src="https://assets.pyecharts.org/assets/echarts.min.js"></script> <script type="text/javascript" src="https://assets.pyecharts.org/assets/echarts-wordcloud.min.js"></script> </head> <body> <div id="80fcc52455ab4d7f91f4ab8d0197f6ee" class="chart-container" style="width:900px; height:500px;"></div> <script> var chart_80fcc52455ab4d7f91f4ab8d0197f6ee = echarts.init( document.getElementById(\'80fcc52455ab4d7f91f4ab8d0197f6ee\'), \'white\', {renderer: \'canvas\'}); var option_80fcc52455ab4d7f91f4ab8d0197f6ee = { "animation": true, "animationThreshold": 2000, "animationDuration": 1000, "animationEasing": "cubicOut", "animationDelay": 0, "animationDurationUpdate": 300, "animationEasingUpdate": "cubicOut", "animationDelayUpdate": 0, "color": [ "#c23531", "#2f4554", "#61a0a8", "#d48265", "#749f83", "#ca8622", "#bda29a", "#6e7074", "#546570", "#c4ccd3", "#f05b72", "#ef5b9c", "#f47920", "#905a3d", "#fab27b", "#2a5caa", "#444693", "#726930", "#b2d235", "#6d8346", "#ac6767", "#1d953f", "#6950a1", "#918597" ], "series": [ { "type": "wordCloud", "shape": "diamond", "rotationRange": [ 0, 0 ], "rotationStep": 45, "girdSize": 20, "sizeRange": [ 20, 100 ], "data": [ { "name": "\u5de1\u6f14", "value": 1.0, "textStyle": { "normal": { "color": "rgb(29,6,27)" } } }, { "name": "\u559c\u6b22", "value": 0.9226508966995808, "textStyle": { "normal": { "color": "rgb(158,141,125)" } } }, { "name": "\u5357\u4eac", "value": 0.8754942531538223, "textStyle": { "normal": { "color": "rgb(50,113,55)" } } }, { "name": "\u4e50\u961f", "value": 0.7644495543604096, "textStyle": { "normal": { "color": "rgb(105,112,8)" } } }, { "name": "\u6ca1\u6709", "value": 0.6389758204009732, "textStyle": { "normal": { "color": "rgb(8,76,151)" } } }, { "name": "\u73b0\u573a", "value": 0.5563834430285522, "textStyle": { "normal": { "color": "rgb(154,14,18)" } } }, { "name": "\u6b4c\u8bcd", "value": 0.4919889843989407, "textStyle": { "normal": { "color": "rgb(148,85,157)" } } }, { "name": "\u5907\u5fd8\u5f55", "value": 0.47410381920096223, "textStyle": { "normal": { "color": "rgb(73,96,59)" } } }, { "name": "\u4e91\u5357", "value": 0.4418237101923882, "textStyle": { "normal": { "color": "rgb(106,131,32)" } } }, { "name": "\u5206\u4eab", "value": 0.42553985129519145, "textStyle": { "normal": { "color": "rgb(131,90,74)" } } }, { "name": "\u8ba1\u5212", "value": 0.42260853596250325, "textStyle": { "normal": { "color": "rgb(115,31,71)" } } }, { "name": "\u4e2d\u56fd", "value": 0.41893695687993576, "textStyle": { "normal": { "color": "rgb(125,133,93)" } } }, { "name": "\u5168\u6587", "value": 0.41534584071854486, "textStyle": { "normal": { "color": "rgb(156,19,132)" } } }, { "name": "\u897f\u5b89", "value": 0.4020968979871474, "textStyle": { "normal": { "color": "rgb(14,143,44)" } } }, { "name": "\u97f3\u4e50", "value": 0.36753593035275844, "textStyle": { "normal": { "color": "rgb(148,49,107)" } } }, { "name": "\u5408\u5531", "value": 0.34895724885152013, "textStyle": { "normal": { "color": "rgb(4,144,30)" } } }, { "name": "\u6f14\u51fa", "value": 0.3273760128360437, "textStyle": { "normal": { "color": "rgb(4,55,148)" } } }, { "name": "\u671f\u5f85", "value": 0.31982089608563147, "textStyle": { "normal": { "color": "rgb(86,125,38)" } } }, { "name": "\u5730\u65b9", "value": 0.31852079404512396, "textStyle": { "normal": { "color": "rgb(122,153,121)" } } }, { "name": "\u9ed1\u8272", "value": 0.3151578718530896, "textStyle": { "normal": { "color": "rgb(123,60,61)" } } }, { "name": "\u4e13\u8f91", "value": 0.30256152354372157, "textStyle": { "normal": { "color": "rgb(123,57,86)" } } }, { "name": "\u7075\u9b42", "value": 0.3005181674986806, "textStyle": { "normal": { "color": "rgb(144,37,107)" } } }, { "name": "\u6b23\u8d4f", "value": 0.2874080563658188, "textStyle": { "normal": { "color": "rgb(51,101,94)" } } }, { "name": "\u56db\u5ddd\u7701", "value": 0.28384410667439436, "textStyle": { "normal": { "color": "rgb(93,75,103)" } } }, { "name": "\u56fa\u539f", "value": 0.28355721368087394, "textStyle": { "normal": { "color": "rgb(84,85,103)" } } }, { "name": "\u5f00\u7968", "value": 0.2814562460930172, "textStyle": { "normal": { "color": "rgb(91,80,104)" } } }, { "name": "\u6e2d\u5357", "value": 0.2738759542409853, "textStyle": { "normal": { "color": "rgb(50,105,97)" } } }, { "name": "\u4e16\u754c", "value": 0.26554597196155416, "textStyle": { "normal": { "color": "rgb(118,110,51)" } } }, { "name": "\u6b4c\u624b", "value": 0.26226629736896706, "textStyle": { "normal": { "color": "rgb(100,100,33)" } } }, { "name": "\u5b81\u590f", "value": 0.262117305085348, "textStyle": { "normal": { "color": "rgb(34,3,134)" } } }, { "name": "\u7f51\u9875", "value": 0.2586337982175665, "textStyle": { "normal": { "color": "rgb(117,103,32)" } } }, { "name": "\u5927\u5b66", "value": 0.25452608020863804, "textStyle": { "normal": { "color": "rgb(137,59,129)" } } }, { "name": "\u5ef6\u5b89", "value": 0.252528735118958, "textStyle": { "normal": { "color": "rgb(63,73,87)" } } }, { "name": "\u6986\u6797", "value": 0.249453214001209, "textStyle": { "normal": { "color": "rgb(2,137,81)" } } }, { "name": "\u751f\u6d3b", "value": 0.2483242679792578, "textStyle": { "normal": { "color": "rgb(34,97,21)" } } }, { "name": "\u60c5\u6000", "value": 0.24401279551604893, "textStyle": { "normal": { "color": "rgb(13,142,140)" } } }, { "name": "\u77f3\u5634\u5c71", "value": 0.24050781839423452, "textStyle": { "normal": { "color": "rgb(92,6,72)" } } }, { "name": "\u4e91\u5357\u7701", "value": 0.239736944611729, "textStyle": { "normal": { "color": "rgb(92,60,67)" } } }, { "name": "\u70ed\u6cb3", "value": 0.23860882828501404, "textStyle": { "normal": { "color": "rgb(17,105,41)" } } }, { "name": "\u5bb4\u4f1a\u5385", "value": 0.23758877028338876, "textStyle": { "normal": { "color": "rgb(15,137,145)" } } }, { "name": "\u773c\u6cea", "value": 0.23638824719202423, "textStyle": { "normal": { "color": "rgb(84,121,119)" } } }, { "name": "\u8fd8\u6709", "value": 0.23347783026986726, "textStyle": { "normal": { "color": "rgb(67,137,5)" } } }, { "name": "\u5076\u9047", "value": 0.23242232593990997, "textStyle": { "normal": { "color": "rgb(140,50,83)" } } }, { "name": "\u62a2\u5230", "value": 0.23213526070343848, "textStyle": { "normal": { "color": "rgb(49,148,100)" } } }, { "name": "\u770b\u7740", "value": 0.23050966174133866, "textStyle": { "normal": { "color": "rgb(121,120,27)" } } }, { "name": "\u770b\u5230", "value": 0.228819750756063, "textStyle": { "normal": { "color": "rgb(65,105,114)" } } }, { "name": "\u5730\u7ea7\u5e02", "value": 0.22616709310835467, "textStyle": { "normal": { "color": "rgb(82,96,22)" } } }, { "name": "\u9655\u897f", "value": 0.2234798284685065, "textStyle": { "normal": { "color": "rgb(158,86,2)" } } }, { "name": "\u5168\u8eab", "value": 0.22268124757470714, "textStyle": { "normal": { "color": "rgb(57,1,136)" } } }, { "name": "\u65f6\u5019", "value": 0.21614711807323328, "textStyle": { "normal": { "color": "rgb(54,99,102)" } } } ] } ], "legend": [ { "data": [], "selected": {}, "show": true } ], "tooltip": { "show": true, "trigger": "item", "triggerOn": "mousemove|click", "axisPointer": { "type": "line" }, "textStyle": { "fontSize": 14 }, "borderWidth": 0 }, "title": [ { "text": "\u53c1\u53c1\u8086\u8d85\u8bdd\u5fae\u535a\u5185\u5bb9\u5206\u6790" } ] }; chart_80fcc52455ab4d7f91f4ab8d0197f6ee.setOption(option_80fcc52455ab4d7f91f4ab8d0197f6ee); </script> </body> </html>
word_cloud.html