NLP文本相似度

一：相似度

定义：计算个体间的相似程度，相似度值越小，相似距离越大。
最常用：余弦相似度 -> 两个向量夹角的余弦值，余弦值为1，夹角为0，越相似
为什么不用欧氏距离：
余弦相似度值域在-1到1，欧氏距离在0到无穷大，究竟多大算大？没有度量标准
余弦相似度衡量的是维度间取值方向的一致性，注重维度之间的差异，不注重数值上的差异，而欧氏度量的正是数值上的差异性。
注解：

#两个用户对5部电影的打分情况：
a=[10,0,0,0,10]
b=[9,3,4,2,8]

a与b在实际意义上是相似的，都是喜欢第一部与第五部，只是打分的极端性不同，如果使用欧氏距离，a与b的距离是比较大的，即说明了上面那句：余弦相似度衡量的是维度间取值方向的一致性，注重维度之间的差异，不注重数值上的差异。

二：（实践部分）：求句子的词频向量并求余弦相似度

step1:切词（借助Jieba库）

## 1.切词
s1_seg = [x for x in jieba.cut(s1) if x!='']
s2_seg = [x for x in jieba.cut(s2) if x!='']
print(s1_seg)
print(s2_seg)

2.去除停用词并得到词典

#2.传入所需计算的列表
# 去除停用词
# 返回词典
def get_word_dict(*l):
    # 读取停用词txt
    word_dict=[]
    stop_list=set()
    with open('stop_word/中文停用词表.txt', 'r',encoding='utf-8') as f:
        for word in f.readlines():
            stop_list.add(word.strip())
    for t in l:
        for v in t:
            if v not in stop_list:
                word_dict.append(v)
    word_dict=list(set(word_dict))
    return word_dict

3.根据词典生成原始句子的词向量

def get_word_vec(l,dict):
    vec = [0]*len(dict)
    for v in l:
        if v in dict:
            index=dict.index(v)
            vec[index]+=1
    return vec

4.求余弦值

def cosVector(x,y):
    if(len(x)!=len(y)):
        print('error input,x and y is not in the same space')
        return;
    result1=0.0;
    result2=0.0;
    result3=0.0;
    for i in range(len(x)):
        result1+=x[i]*y[i]   #sum(X*Y)
        result2+=x[i]**2     #sum(X*X)
        result3+=y[i]**2     #sum(Y*Y)
    return result1/((result2*result3)**0.5)

三：TFIDF

TF（term frequency）词频可以很好的反应这个词出现的次数，但是还会存在一种状况：这个词频率很高但是在大多数文档里都出现了，因此也不能作为权重很高的代表，因此IDF（反文档频率）可以降低这种情况的影响。
按词频*反文档频率值排序，即可得出这篇文章的关键词。
例如：

#求一篇文章每个词的词频（去除停用词后）
#1.传入切完词的list集合，返回去除停用词的list
def convert_to_list_with_not_stop_word(str,type):
    str_seg=""# 切词后的原始结果
    not_stop_words=[]
    # 不同来源的数据
    if(type==0):
        str_seg = [x for x in jieba.cut(str) if x != '']
    else:
        with open(str,'r',encoding='utf-8') as f:
            for word in f.readlines():
                str_seg += word.strip()
        str_seg = [x for x in jieba.cut(str_seg) if x.strip() != '']
    #读取停用词生成列表
    stop_words=read_stop_word('stop_word/中文停用词表.txt')
    for v in str_seg:
        if v not in stop_words:
            not_stop_words.append(v)
    return not_stop_words
#2.对去除停用词的list做wordcount
def list_to_word_cnt_dict(l):
    t = {}
    for v in l:
        if t.get(v) == None:
            t[v] = 0
        t[v] += 1
    return t
#

词频的两种常见求法：

选用第二种(频率分布较为均匀，第一种基本数值很小，但是第一种和为1)：

def dict_to_TF_dict(dict):
    # 字典依照value排序
    TF_dict={}
    dict = sorted(dict.items(), key=lambda x: x[1], reverse=True)
    max = dict[0][1]
    for v in dict:
        TF_dict[v[0]]= v[1]/max
    return TF_dict

#接收一篇文章去重后的set集合
def set_to_IDF(res_no_repeat):
    IDF_dict={}
    for filename in os.listdir('allfiles'):
        no_stop_list = convert_to_list_with_not_stop_word('allfiles/' + filename, 1)
        for v in no_stop_list:
            if v in res_no_repeat:
                if (IDF_dict.get(v) == None):
                    IDF_dict[v] = 0
                IDF_dict[v] += 1
    for v in IDF_dict.keys():
        IDF_dict[v] = math.log(len(os.listdir('allfiles')) / (IDF_dict[v] + 1))
    return IDF_dict

结果：