python硬刚倒排索引
python硬刚倒排索引
2018-12-20 09:42
water_chen
阅读(1570)
评论(0)
编辑
收藏
举报
需要导入的库:jieba, json
json是python自带的库,jieba只需要在命令行输入pip install jieba即可
本代码采用直接硬刚倒排索引,可能会引起稍微不适,请选用。
代码分为三部分:分词、创建正排索引、创建倒排索引
需要文件:语料库、停用词库(停用词库请自行搜索即可)
语料库图片如下:
我用的是自己爬取的一部分新闻标题,包含网易,头条,凤凰网以及一小部分微信文章标题。语料库处理:只需要每一句的后面加个换行即可。
分词代码:
stopwords =[] with open('stopwords', 'r', encoding='utf-8')as f: for i in f: word = i.strip() stopwords.append(word) filename = 'test.txt' filename1 = 'test_cws.txt' # 写入分词 def write_cws(): num = 0 # 这个是文件id值,如果本身就有,这个可以更改为你自己的,我这里只是简单的计数作为id值 writing = open(filename1, 'a+', encoding='utf-8') with open(filename, 'r', encoding='utf-8')as f: for line in f: content = line.strip() content = content.replace(' ', '') seg = jieba.cut(content) test ='' for i in seg: if i not in stopwords: test += i+' ' writing.write(str(num)+" "+test+'\n') num += 1 writing.close()
正排索引代码:
filename2 = 'zhengxiang.txt' def zhengxiang(): all_words = [] all = {} file2 = open(filename2, 'a+', encoding='utf-8') with open(filename1, 'r', encoding='utf-8')as f: for line in f: line = line.strip() # print(line) content = line.split(' ')[1] num = line.split(' ')[0] words = content.split(' ') for word in words: word_num =[num] if word not in all_words: all_words.append(word) all[word] = word_num else: if num not in all[word]: all[word].append(num) for word, nums in all.items(): file2.write(word+' ') for i in range(len(nums)): if i ==0: file2.write(nums[i]) else: file2.write(','+nums[i]) file2.write('\n') file2.close()
倒排索引代码:
# 倒排索引 filename3 = 'daopai.txt' def daopai(): with open(filename2, 'r', encoding='utf-8')as f: for line in f: try:#这个异常处理是我数据有点问题,如果你本身数据和我上面截图的语料库数据一样,应该不会报错 word_dict = {}# 单词的字典,字典格式,方便存取 word_list =[] # 存放这个单词的情况 syc = [] # 存放单词以及单词在所有文件出现的次数,在一个文件出现就加1,不管其中出现多少次 Aword = line.strip()# Aword 是 all_word word = Aword.split(' ')[0] print(word) nums = Aword.split(' ')[1] count = len(nums.split(',')) syc.append(word+' '+str(count)) word_list.append(syc) with open(filename1, 'r', encoding='utf-8') as r: for line1 in r: acount = 0 # 这个单词在这行中出现的个数 words = line1.strip().split(' ')[1].split(' ') num = line1.strip().split(' ')[0] if word in words: # 判断这个单词在不在这个句子 for aword in words: if word == aword: acount += 1 temp1 = [num, acount]# 用于存放单词出现的地方以及它的次数 word_list.append(temp1) word_dict[word] = word_list with open(filename3, 'a', encoding='utf-8')as f: json.dump(word_dict,f,ensure_ascii=False) f.write(',') f.write('\n') except Exception as e: print(line) print(e)
这个代码是原语料库跑出分词之后,将分词文件去跑正排索引,将正排索引去跑倒排索引,所以运行的时候,请依次运行。
如果有一定的帮助,点个赞哦,谢谢!!!