In [2]:
%load_ext sql
In [3]:
%%sql 
postgresql://postgres:postgres@localhost/hw8
Out[3]:
'Connected: postgres@hw8'
In [4]:
%%sql 
SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'utf-8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
Done.
Done.
Done.
Done.
Done.
Done.
Out[4]:
[]
In [7]:
%%sql 

drop table if exists weibo;
CREATE TABLE weibo (
    id bigint not null primary key,
	weibo_id text,
    daytime int,
    name text,
    content text,
	day int,
	month int,
	sentiments float,
	keyword0 text,
	keyword1 text,
	keyword2 text,
	keyword3 text
);


copy weibo from  'D:\\works\\GIS\\data-analyse\\dataSets\\nCoV_total_p2.csv' CSV HEADER;
 * postgresql://postgres:***@localhost/hw8
Done.
Done.
1010000 rows affected.
Out[7]:
[]

微博舆情分析与数据可视化

数据集基本情况

数据集为依据与“新冠肺炎”相关的230个主题关键词进行随机数据采集的2020年1月1日—2020年2月20日期间共计100万条微博数据

数据集部分内容展示:

其中每列的含义:

  • id: 微博序号,主键
  • weibo_id: 微博id
  • daytime: 微博发布时间, 从0000-2359
  • name:发布人账号
  • content:微博中文内容
  • day:日期
  • month:月份
  • sentiments:使用snowNLP分析得出的情感数值,范围为-0.5 ~ 0.5,大于0为正面情感,小于0为负面情感;
  • keyword0-4:使用jieba分词提取的主题关键词,取前四位;
In [8]:
%sql select * from weibo limit 5;
 * postgresql://postgres:***@localhost/hw8
5 rows affected.
Out[8]:
id weibo_id daytime name content day month sentiments keyword0 keyword1 keyword2 keyword3
0 4456068992182160 2338 -精緻的豬豬女戰士- #你好2020#新年第一天元气满满的早起出门买早饭结果高估了自己抗冻能力回家成功冻发烧(大概是想告诉我2020要量力而行)然鹅这并不影响后续计划一出门立马生龙活虎新年和新??更配哦??看了误杀吃了大餐就让新的一年一直这样美滋滋下去吧??? 1 1 -0.439334027065442 2020 新年 出门 然鹅
1 4456424178427250 2309 liujunyi88 大宝又感冒鼻塞咳嗽了,还有发烧。队友加班几天不回。感觉自己的情绪在家已然是随时引爆的状态。情绪一上来,容易对孩子说出自己都想不到的话来……2020年,真的要学会控制情绪,管理好家人健康。这是今年最大的目标。? 2 1 0.4999868041643475 控制情绪 情绪 2020 大宝
2 4456797466940200 2353 ablsa 还要去输两天液,这天也太容易感冒发烧了,一定要多喝热水啊? 3 1 -0.4940870356172161 发烧 热水 感冒 这天
3 4456791021108920 2327 喵吃鱼干Lynn 我太难了别人怎么发烧都没事就我一检查甲型流感? 3 1 -0.20655523503009487 我太难 发烧 没事 甲型
4 4457086404997440 1901 我的发小今年必脱单 果然是要病一场的喽回来第三天开始感冒今儿还发烧了喉咙眼睛都难受的一匹怎么样能不经意让我的毕设导师看到这条微博并给我放一天假呢? 4 1 0.1392079495868671 毕设 微博 不经意 导师

微博数量随时间变化

In [16]:
x = %sql select month, day, count(*) from weibo group by day,month order by month,day;
%matplotlib inline
count = len(x)
print(count)
x.bar()
 * postgresql://postgres:***@localhost/hw8
50 rows affected.
50
Out[16]:
<BarContainer object of 50 artists>

该数据集共包含了从1.1至2.19共50天的数据,从图中也可以看出:

  • 在1.1至1.20日前后,与疫情相关的微博数量约为10000条/天;
  • 在1.22前后有一个明显的高峰,对应于新冠肺炎的舆论开始在互联网上发酵的阶段;
  • 此后至2.19,有关新冠肺炎的微博数量约为25000条-30000条/天,并在缓慢增多

查看是否有微博用户的多篇文章在数据集中;

In [21]:
%%sql
select weibo_id,count(*) c from weibo group by weibo_id order by c desc limit 5;
 * postgresql://postgres:***@localhost/hw8
5 rows affected.
Out[21]:
weibo_id c
4464409348685640 3
4457086085339600 3
4455877316918170 2
4455826020501410 2
4455748258222440 2

该数据集中对于大多数用户只采集了一条或两条微博,因此无法对单一用户的多条微博进行分析;

微博关键词分析

统计所有微博中主题关键词出现的数量,按数量从高到低排序:

对处理过的content进行分词;

In [26]:
%%sql
select k0.keyword, k0.c+k1.c+k2.c+k3.c as c from 
(select keyword0 keyword,count(*) c from weibo group by keyword0) k0,
(select keyword1 keyword,count(*) c from weibo group by keyword1) k1,
(select keyword2 keyword,count(*) c from weibo group by keyword2) k2,
(select keyword3 keyword,count(*) c from weibo group by keyword3) k3
where k0.keyword = k1.keyword and k2.keyword = k1.keyword and k2.keyword = k3.keyword and k1.keyword != '##'
order by c desc limit 10;
10 rows affected.
Out[26]:
keyword c
冠状病毒 59028
疫情 51060
肺炎 49530
加油 41712
武汉 40002
微博 29262
口罩 28808
新冠 24207
2020 21340
网页 20538
In [48]:
query = """
select k0.keyword, k0.c+k1.c+k2.c+k3.c as c from 
(select keyword0 keyword,count(*) c from weibo group by keyword0) k0,
(select keyword1 keyword,count(*) c from weibo group by keyword1) k1,
(select keyword2 keyword,count(*) c from weibo group by keyword2) k2,
(select keyword3 keyword,count(*) c from weibo group by keyword3) k3
where k0.keyword = k1.keyword and k2.keyword = k1.keyword and k2.keyword = k3.keyword and k1.keyword != '##'
order by c desc limit 15;
"""
result = %sql $query
result.bar()
15 rows affected.
Out[48]:
<BarContainer object of 15 artists>

对所有主题关键词,取词频数为前100位生成词云图;

In [10]:
query = """
select k0.keyword, k0.c+k1.c+k2.c+k3.c as c from 
(select keyword0 keyword,count(*) c from weibo group by keyword0) k0,
(select keyword1 keyword,count(*) c from weibo group by keyword1) k1,
(select keyword2 keyword,count(*) c from weibo group by keyword2) k2,
(select keyword3 keyword,count(*) c from weibo group by keyword3) k3
where k0.keyword = k1.keyword and k2.keyword = k1.keyword and k2.keyword = k3.keyword and k1.keyword != '##'
order by c desc limit 100;
"""
result = %sql $query

from pyecharts import options as opts
from pyecharts.charts import WordCloud
from pyecharts.globals import SymbolType
from pyecharts.globals import CurrentConfig, NotebookType

CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_NOTEBOOK


words = list(result)
c = (
        WordCloud()
        .add("", words, word_size_range=[20, 100], shape=SymbolType.ROUND_RECT)
        .set_global_opts(title_opts=opts.TitleOpts(title='全国新型冠状病毒疫情词云图'))
    )
c.render_notebook()
100 rows affected.
Out[10]:

对微博数据进行分析,统计所有词频;

In [92]:
import jieba
from collections import Counter

total_data = %sql select content from weibo;
cut_words = ""
all_words = ""
i = 0
for line in total_data:
    line = str(line)
    i=i+1
    seg_list = jieba.cut(line,cut_all=False)
    cut_words = (" ".join(seg_list))
    all_words += cut_words
    if i%10000 == 0:
        print(str(i)+' ',end='')
all_words = all_words.split()

c = Counter()
for x in all_words:
    if len(x)>1 and x != '\r\n':
        c[x] += 1
1010000 rows affected.
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000 120000 130000 140000 150000 160000 170000 180000 190000 200000 210000 220000 230000 240000 250000 260000 270000 280000 290000 300000 310000 320000 330000 340000 350000 360000 370000 380000 390000 400000 410000 420000 430000 440000 450000 460000 470000 480000 490000 500000 510000 520000 530000 540000 550000 560000 570000 580000 590000 600000 610000 620000 630000 640000 650000 660000 670000 680000 690000 700000 710000 720000 730000 740000 750000 760000 770000 780000 790000 800000 810000 820000 830000 840000 850000 860000 870000 880000 890000 900000 910000 920000 930000 940000 950000 960000 970000 980000 990000 1000000 1010000 
In [102]:
words = []
for (k,v) in c.most_common(5000):
    # print(k, v)
    words.append((k,v))
words = words[1:]
In [2]:
import pandas as pd
words = pd.read_csv('词频前5000.csv')
In [96]:
all_word = (" ".join(all_words))
print(words)
[('疫情', 264740), ('武汉', 229667), ('肺炎', 224960), ('展开', 216745), ('全文', 216276), ('新型', 184562), ('冠状病毒', 137011), ('我们', 129709), ('加油', 123246), ('##', 117351), ('口罩', 115913), ('感染', 112794), ('医院', 107979), ('确诊', 106765), ('中国', 101060), ('病例', 100246), ('病毒', 99095), ('患者', 94574), ('没有', 90569), ('防控', 89573), ('自己', 81245), ('一个', 81225), ('微博', 75482), ('视频', 73078), ('可以', 69695), ('这个', 69277), ('新冠', 64804), ('工作', 64773), ('链接', 64081), ('一起', 60989), ('网页', 60956), ('大家', 60679), ('湖北', 58731), ('真的', 57658), ('医生', 57655), ('就是', 57163), ('隔离', 53838), ('希望', 53808), ('2020', 53516), ('今天', 53186), ('现在', 52990), ('什么', 51651), ('致敬', 47289), ('不是', 45965), ('国家', 45122), ('行动', 43875), ('他们', 43840), ('全国', 43834), ('一线', 43686)]
In [98]:
import jieba.analyse
# TF-IDF
keywords = jieba.analyse.extract_tags(all_word,
                                      topK=50,
                                      withWeight=True,
                                      allowPOS=('a','e','n','nr','ns', 'v')) #词性 形容词 叹词 名词 动词
[('疫情', 0.11912294859704146), ('肺炎', 0.11055183921684136), ('全文', 0.09896922233781136), ('武汉', 0.08653435069252709), ('展开', 0.07282289241826259), ('加油', 0.060396253662755174), ('口罩', 0.057737899296916), ('确诊', 0.047917340409658435), ('新冠', 0.045060153220399156), ('病例', 0.043112362196566556), ('感染', 0.04278686859643067), ('病毒', 0.04090026875872963), ('医院', 0.03522556557504731), ('患者', 0.033405320917091134), ('视频', 0.03276271976484176), ('网页', 0.032327948806561826), ('链接', 0.031009521739956396), ('致敬', 0.025251712758224455), ('隔离', 0.02448653098084746), ('医护人员', 0.022201996864772915), ('湖北', 0.02183352867944201), ('医生', 0.020275206438533633), ('打卡', 0.01906030403713307), ('卫健委', 0.01811146889874087), ('出院', 0.017497216222247255), ('中国', 0.01702356334989813), ('大家', 0.016448027627049253), ('治愈', 0.016390183033353274), ('没有', 0.015687250330511104), ('希望', 0.015585062800449967), ('医用', 0.01423829082197198), ('发热', 0.013859673942509093), ('钟南山', 0.013779294200016197), ('转发', 0.013740735889898976), ('健康', 0.013190900738973856), ('治疗', 0.0128036104960136), ('肖战', 0.012562349511614255), ('新闻', 0.012479091615982256), ('防护', 0.012414449293612637), ('新增', 0.012277666352539497), ('酒精', 0.012173722385749347), ('物资', 0.01212093900823893), ('复工', 0.011909280661426484), ('朱一龙', 0.01159714066113227), ('全国', 0.011574160533591114), ('社区', 0.011327061602028836), ('症状', 0.011320896900746403), ('医疗队', 0.01117094804171221), ('上班', 0.011100770212804955), ('平安', 0.010739102817894648)]

生成TF-IDF统计图

In [11]:
keywords = [('疫情', 0.11912294859704146), ('肺炎', 0.11055183921684136), ('全文', 0.09896922233781136), ('武汉', 0.08653435069252709), ('展开', 0.07282289241826259), ('加油', 0.060396253662755174), ('口罩', 0.057737899296916), ('确诊', 0.047917340409658435), ('新冠', 0.045060153220399156), ('病例', 0.043112362196566556), ('感染', 0.04278686859643067), ('病毒', 0.04090026875872963), ('医院', 0.03522556557504731), ('患者', 0.033405320917091134), ('视频', 0.03276271976484176), ('网页', 0.032327948806561826), ('链接', 0.031009521739956396), ('致敬', 0.025251712758224455), ('隔离', 0.02448653098084746), ('医护人员', 0.022201996864772915), ('湖北', 0.02183352867944201), ('医生', 0.020275206438533633), ('打卡', 0.01906030403713307), ('卫健委', 0.01811146889874087), ('出院', 0.017497216222247255), ('中国', 0.01702356334989813), ('大家', 0.016448027627049253), ('治愈', 0.016390183033353274), ('没有', 0.015687250330511104), ('希望', 0.015585062800449967), ('医用', 0.01423829082197198), ('发热', 0.013859673942509093), ('钟南山', 0.013779294200016197), ('转发', 0.013740735889898976), ('健康', 0.013190900738973856), ('治疗', 0.0128036104960136), ('肖战', 0.012562349511614255), ('新闻', 0.012479091615982256), ('防护', 0.012414449293612637), ('新增', 0.012277666352539497), ('酒精', 0.012173722385749347), ('物资', 0.01212093900823893), ('复工', 0.011909280661426484), ('朱一龙', 0.01159714066113227), ('全国', 0.011574160533591114), ('社区', 0.011327061602028836), ('症状', 0.011320896900746403), ('医疗队', 0.01117094804171221), ('上班', 0.011100770212804955), ('平安', 0.010739102817894648)]
In [12]:
import pandas as pd
import matplotlib.pyplot as plt
ss = pd.DataFrame(keywords,columns = ['词语','重要性'])     
# print(ss)
import numpy as np
from matplotlib.font_manager import FontProperties
plt.figure(figsize=(10,6))
plt.title('TF-IDF Ranking')
fig = plt.axes()
plt.barh(range(len(ss.重要性[:25][::-1])),ss.重要性[:25][::-1])
fig.set_yticks(np.arange(len(ss.重要性[:25][::-1])))
font = FontProperties(fname=r'c:\windows\fonts\simsun.ttc')
fig.set_yticklabels(ss.词语[:25][::-1],fontproperties=font)
fig.set_xlabel('Importance')
d:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  if __name__ == '__main__':
Out[12]:
Text(0.5, 0, 'Importance')

查询每日疫情主题关键词数量的前10位

In [15]:
results = []
for d in range(1,32):
    m = 1
    query = """
    with weibo_d as(
        select * from weibo where day = %d and month = %d
    )
    select k0.keyword, k0.c+k1.c+k2.c+k3.c as c from 
    (select keyword0 keyword,count(*) c from weibo_d group by keyword0) k0,
    (select keyword1 keyword,count(*) c from weibo_d group by keyword1) k1,
    (select keyword2 keyword,count(*) c from weibo_d group by keyword2) k2,
    (select keyword3 keyword,count(*) c from weibo_d group by keyword3) k3
    where k0.keyword = k1.keyword and k2.keyword = k1.keyword and k2.keyword = k3.keyword and k1.keyword != '##'
    order by c desc limit 100;
    """%(d,m)
    result = %sql $query
    results.append(result)
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.
100 rows affected.

生成每日词云图(可在展板中显示,这里先展示第一天的数据):

In [16]:
from pyecharts import options as opts
from pyecharts.charts import WordCloud
from pyecharts.globals import SymbolType
from pyecharts.globals import CurrentConfig, NotebookType

CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_NOTEBOOK


words = list(results[0])
c = (
        WordCloud()
        .add("", words, word_size_range=[20, 100], shape=SymbolType.ROUND_RECT)
        .set_global_opts(title_opts=opts.TitleOpts(title='全国新型冠状病毒疫情词云图'))
    )
c.render_notebook()
Out[16]:

对于部分关键词进行随时间变化的频数分析

In [64]:
query = """
select month||'-'||day as time,count(*) from weibo where keyword0 = '疫情' or keyword1 = '疫情' or keyword2 = '疫情' or keyword3 = '疫情' group by month,day;
"""
result = %sql $query
result.bar()
50 rows affected.
Out[64]:
<BarContainer object of 50 artists>

“疫情”这样的词语相对具有官方口径,是在1月20日左右随着官方承认武汉病毒存在人传人的现象的情况下开始被提及,并随之成为高频词;

In [63]:
query = """
select month||'-'||day as time,count(*) from weibo where keyword0 = '武汉' or keyword1 = '武汉' or keyword2 = '武汉' or keyword3 = '武汉' group by month,day;
"""
result = %sql $query
result.bar()
50 rows affected.
Out[63]:
<BarContainer object of 50 artists>

“武汉”在开始阶段被少量提及,在开始封城的时候达到高峰;随着疫情扩散到全国,频次缓慢下降。

微博情感分析

统计总体情感平均值随日期的变化:

In [44]:
x = %sql select month, day, avg(sentiments) from weibo group by day,month order by month,day;
%matplotlib inline
x.bar()
50 rows affected.
Out[44]:
<BarContainer object of 50 artists>

总体而言,疫情相关的微博情感倾向还是偏正面或中性;在1.19前,情感平均值较高,约为0.2;在1.19,疫情开始在全国扩散开来并且钟南山宣布人传人之后,情感平均值下降到了0,1附近;在2.18、2.19时,可能由于疫情国内接近拐点,情感分析数值又有显著提高;

统计情感分析数值在每个0.1区间中的数量

In [7]:
x = %sql select round(sentiments::numeric,1),count(*) from weibo group by round(sentiments::numeric,1) order by round(sentiments::numeric,1);
%matplotlib inline
x.bar()
11 rows affected.
Out[7]:
<BarContainer object of 11 artists>

大部分情感分析数值集中在0.5和-0.5的区间,呈两头凸起中间凹陷型,表明存在舆论的两极分化;

根据主题词数量排名前15的主题词,查看其平均情感分析结果:

In [49]:
query = """
select key, avg(sentiments) from 
(select k0.keyword as key, k0.c+k1.c+k2.c+k3.c as c from 
(select keyword0 keyword,count(*) c from weibo group by keyword0) k0,
(select keyword1 keyword,count(*) c from weibo group by keyword1) k1,
(select keyword2 keyword,count(*) c from weibo group by keyword2) k2,
(select keyword3 keyword,count(*) c from weibo group by keyword3) k3
where k0.keyword = k1.keyword and k2.keyword = k1.keyword and k2.keyword = k3.keyword and k1.keyword != '##'
order by c desc limit 15) top, weibo
where weibo.keyword0 = key or weibo.keyword1 = key or weibo.keyword2 = key or weibo.keyword3 = key
group by key,c order by c desc;
"""
result = %sql $query
import matplotlib as plt
plt.rcParams['font.sans-serif']=['SimHei'] #显示中文标签
plt.rcParams['axes.unicode_minus']=False
result.bar()
15 rows affected.
Out[49]:
<BarContainer object of 15 artists>
In [70]:
query = """
select month||'-'||day as time,avg(sentiments) from weibo where keyword0 = '肺炎' or keyword1 = '肺炎' or keyword2 = '肺炎' or keyword3 = '肺炎' group by month,day;
"""
result = %sql $query
result.bar()
49 rows affected.
Out[70]:
<BarContainer object of 49 artists>
  • 在疫情早期阶段,对于肺炎一词的情感分析波动较大,可能原因是官方的反复辟谣,以及民间的小道消息传递;
  • 在后期情感分析数值较为稳定,呈微弱的正面倾向,也体现了众志成城抗击疫情主要舆论导向;

对于李文亮医生的情感数值追踪分析:

查看提及李文亮医生的微博频次

In [86]:
query = """
select month||'-'||day as time,count(*) from weibo where keyword0 = '李文亮' or keyword1 = '李文亮' or keyword2 = '李文亮' or keyword3 = '李文亮' group by month,day;
"""
result = %sql $query
result.bar()
27 rows affected.
Out[86]:
<BarContainer object of 27 artists>

对相关微博根据日期进行情感分析

In [90]:
query = """
select month||'-'||day as time,avg(sentiments) from weibo where keyword0 = '李文亮' or keyword1 = '李文亮' or keyword2 = '李文亮' or keyword3 = '李文亮' group by month,day;
"""
result = %sql $query
result.bar()
27 rows affected.
Out[90]:
<BarContainer object of 27 artists>
  • 在开始阶段多为负面情感,可能由于官方宣布“新冠”为谣言;
  • 随着疫情的爆发,正面评价开始出现,并在李文亮医生去世之后达到高峰;

对双黄连可防治新冠的谣言情感追踪分析;

查看提及双黄连的微博频次

In [9]:
query = """
select month||'-'||day as time,count(*) from weibo where keyword0 = '双黄连' or keyword1 = '双黄连' or keyword2 = '双黄连' or keyword3 = '双黄连' group by month,day;
"""
result = %sql $query
result.bar()
49 rows affected.
Out[9]:
<BarContainer object of 49 artists>

随着双黄连的谣言产生,相关微博数量达到高峰;随后成逐步下降趋势。

In [7]:
query = """
select month||'-'||day as time,avg(sentiments) from weibo where keyword0 = '双黄连' or keyword1 = '双黄连' or keyword2 = '双黄连' or keyword3 = '双黄连' group by month,day having month = 2 or (month = 1 and day >28);
"""
%matplotlib inline
result = %sql $query
result.bar()
21 rows affected.
Out[7]:
<BarContainer object of 21 artists>
  • 总体来看,成正面评论和负面评论交替的趋势;
  • 在谣言出现的当天(1.31,微博情感分析成负面倾向;表明了大多数网民还是相对受教育的群体,对双黄连的效果此怀疑态度;

有待完善

  • 可以对该数据集分为两个阶段进行分析:
    • 1.20号前对于疫情的关注还未上升到主要阶段,但已经出现征兆;
    • 1.20里后,抗击疫情成为主要舆论;
  • 可以做一些主题关联分析;
  • 部分分析还未完善,也未和疫情数据进行关联;
In [ ]: