自然语言处理(NLP)
苹果语音助手Siri的工作流程:
- 听
- 懂
- 思考
- 组织语言
- 回答
这其中每一步骤涉及的流程为:
- 语音识别
- 自然语言处理 – 语义分析
- 逻辑分析 – 结合业务场景与上下文
- 自然语言处理 – 分析结果生成自然语言文本
- 语音合成
自然语言处理的常用处理过程:
先针对训练文本进行分词处理(词干提取、原型提取),统计词频,通过词频-逆文档频率算法获得该词对样本语义的贡献,根据每个词的贡献力度,构建有监督分类模型。把测试样本交给模型处理,得到测试样本的语义类别。
自然语言工具包 – NLTK
nltk.download()下载数据集
jieba中文分词库
文本分词
分词处理相关API:
import nltk.tokenize as tk sent_list = tk.sent_tokenize(text) # 把样本按句子进行拆分 sent_list:句子列表 word_list = tk.word_tokenize(text) # 把样本按单词进行拆分 word_list:单词列表 # 把样本按单词进行拆分 punctTokenizer:分词器对象 punctTokenizer = tk.WordPunctTokenizer() word_list = punctTokenizer.tokenize(text)
案例:
import nltk.tokenize as tk doc = "Are you curious about tokenization? " \ "Let's see how it works! " \ "We need to analyze a couple of sentences " \ "with punctuations to see it in action." print(doc) tokens = tk.sent_tokenize(doc) # 句子分词 for i, token in enumerate(tokens): print("%2d" % (i + 1), token) # 1 Are you curious about tokenization? # 2 Let's see how it works! # 3 We need to analyze a couple of sentences with punctuations to see it in action. tokens = tk.word_tokenize(doc) # 单词分词 for i, token in enumerate(tokens): print("%2d" % (i + 1), token) # 1 Are # 2 you # 3 curious # 4 about # ... # 28 action # 29 . tokenizer = tk.WordPunctTokenizer() # 单词分词 tokens = tokenizer.tokenize(doc) for i, token in enumerate(tokens): print("%2d" % (i + 1), token) # 1 Are # 2 you # 3 curious # ... # 27 it # 28 in # 29 action # 30 .
词干提取
文本样本中的单词的 词性 与 时态 对于语义分析并无太大影响,所以需要对单词进行 词干提取。
词干提取相关API:
import nltk.stem.porter as pt import nltk.stem.lancaster as lc import nltk.stem.snowball as sb stemmer = pt.PorterStemmer() # 波特词干提取器,偏宽松 stemmer = lc.LancasterStemmer() # 朗卡斯特词干提取器,偏严格 # 思诺博词干提取器,偏中庸 stemmer = sb.SnowballStemmer('english') r = stemmer.stem('playing') # 提取单词playing的词干
案例:
import nltk.stem.porter as pt import nltk.stem.lancaster as lc import nltk.stem.snowball as sb words = ['table', 'probably', 'wolves', 'playing', 'is', 'dog', 'the', 'beaches', 'grounded', 'dreamt', 'envision'] pt_stemmer = pt.PorterStemmer() # 波特词干提取器,偏宽松 lc_stemmer = lc.LancasterStemmer() # 朗卡斯特词干提取器,偏严格 sb_stemmer = sb.SnowballStemmer('english') # 思诺博词干提取器,偏中庸 for word in words: pt_stem = pt_stemmer.stem(word) lc_stem = lc_stemmer.stem(word) sb_stem = sb_stemmer.stem(word) print('%8s %8s %8s %8s' % (word, pt_stem, lc_stem, sb_stem)) # table tabl tabl tabl # probably probabl prob probabl # wolves wolv wolv wolv # playing play play play # is is is is # dog dog dog dog # the the the the # beaches beach beach beach # grounded ground ground ground # dreamt dreamt dreamt dreamt # envision envis envid envis
词性还原
词性还原与词干提取的作用类似,词性还原更利于人工二次处理。因为有些词干并非正确的单词,人工阅读更麻烦。词性还原可以把名词复数形式恢复为单数形式,动词分词形式恢复为原型形式。
词性还原相关API:
import nltk.stem as ns # 获取词性还原器对象 lemmatizer = ns.WordNetLemmatizer() n_lemma = lemmatizer.lemmatize(word, pos='n') # 把单词word按照名词进行还原 v_lemma = lemmatizer.lemmatize(word, pos='v') # 把单词word按照动词进行还原
案例:
import nltk.stem as ns words = ['table', 'probably', 'wolves', 'playing', 'is', 'dog', 'the', 'beaches', 'grounded', 'dreamt', 'envision'] lemmatizer = ns.WordNetLemmatizer() for word in words: n_lemma = lemmatizer.lemmatize(word, pos='n') # 名词 词性还原 v_lemma = lemmatizer.lemmatize(word, pos='v') # 动词 词性还原 print('%8s %8s %8s' % (word, n_lemma, v_lemma)) # table table table # probably probably probably # wolves wolf wolves # playing playing play # is is be # dog dog dog # the the the # beaches beach beach # grounded grounded ground # dreamt dreamt dream # envision envision envision
词袋模型
一句话的语义很大程度取决于某个单词出现的次数,词袋模型以每一个句子作为一个样本,用特征名和特证值构建的数学模型称为“词袋模型”
- 特证名:句子中所有可能出现的单词
- 特证值:单词在句子中出现的次数
The brown dog is running. The black dog is in the black room. Running in the room is forbidden.
1 The brown dog is running
2 The black dog is in the black room
3 Running in the room is forbidden
the | brown | dog | is | running | black | in | room | forbidden |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 1 | 0 | 2 | 1 | 1 | 0 |
1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
词袋模型化相关API:
import sklearn.feature_extraction.text as ft cv = ft.CountVectorizer() # 构建词袋模型 bow = cv.fit_transform(sentences) # 训练模型 print(bow.toarray()) # 获取单词出现的次数 words = cv.get_feature_names() # 获取所有特征名
案例:
import nltk.tokenize as tk import sklearn.feature_extraction.text as ft doc = 'The brown dog is running. ' \ 'The black dog is in the black room. ' \ 'Running in the room is forbidden.' # 对doc按照句子进行拆分 sents = tk.sent_tokenize(doc) cv = ft.CountVectorizer() # 构建词袋模型 bow = cv.fit_transform(sents) # 训练词袋模型 print(cv.get_feature_names()) # 获取所有特征名 # ['black', 'brown', 'dog', 'forbidden', 'in', 'is', 'room', 'running', 'the'] print(bow.toarray()) # [[0 1 1 0 0 1 0 1 1] # [2 0 1 0 1 1 1 0 2] # [0 0 0 1 1 1 1 1 1]]
词频(TF)
$$词频=\frac{单词在句中出现的次数}{句子的总词数}$$
词频 :一个单词在一个句子中出现的频率。词频相比单词的出现次数可以更加客观的评估单词对一句话的语义的贡献度。词频越高,对语义的贡献度越大。对词袋矩阵归一化即可得到词频。
案例:对词袋矩阵进行归一化
import nltk.tokenize as tk import sklearn.feature_extraction.text as ft import sklearn.preprocessing as sp doc = 'The brown dog is running. The black dog is in the black room. ' \ 'Running in the room is forbidden.' sentences = tk.sent_tokenize(doc) # 通过句子分词 cv = ft.CountVectorizer() bow = cv.fit_transform(sentences) print(bow.toarray()) # 词 出现的次数 words = cv.get_feature_names() print(words) # 词 特征名 tf = sp.normalize(bow, norm='l1') print(tf) # 词频 # [[0. 0.2 0.2 0. 0. 0.2 0. 0.2 0.2] # [0.25 0. 0.125 0. 0.125 0.125 0.125 0. 0.25 ] # [0. 0. 0. 0.16666667 0.16666667 0.16666667 0.16666667 0.16666667 0.16666667]]
文档频率(DF)
$$文档频率=\frac{含有某个单词的文档样本数}{总文档样本数}$$
DF越低,代表当前单词对语义的贡献越高
逆文档频率(IDF)
$$逆文档频率=\frac{总样本数}{(含有某个单词的样本数+1)}$$
IDF越高,代表当前单词对语义的贡献越高
词频-逆文档频率(TF-IDF)
词频矩阵中的每一个元素乘以相应单词的逆文档频率,其值越大说明该词对样本语义的贡献越大,根据每个词的贡献力度,构建学习模型。
获取词频逆文档频率(TF-IDF)矩阵相关API:
# 获取词袋模型 cv = ft.CountVectorizer() bow = cv.fit_transform(sentences).toarray() # 获取TF-IDF模型训练器 tt = ft.TfidfTransformer() tfidf = tt.fit_transform(bow).toarray()
案例:获取TF_IDF矩阵:
import nltk.tokenize as tk import sklearn.feature_extraction.text as ft import numpy as np doc = 'The brown dog is running. ' \ 'The black dog is in the black room. ' \ 'Running in the room is forbidden.' # 对doc按照句子进行拆分 sents = tk.sent_tokenize(doc) # 构建词袋模型 cv = ft.CountVectorizer() bow = cv.fit_transform(sents) # TFIDF tt = ft.TfidfTransformer() # 获取TF-IDF模型训练器 tfidf = tt.fit_transform(bow) # 训练 print(np.round(tfidf.toarray(), 2)) # 精确到小数点后两位 # [[0. 0.59 0.45 0. 0. 0.35 0. 0.45 0.35] # [0.73 0. 0.28 0. 0.28 0.22 0.28 0. 0.43] # [0. 0. 0. 0.54 0.41 0.32 0.41 0.41 0.32]]
文本分类(主题识别)
使用给定的文本数据集进行主题识别训练,自定义测试集测试模型准确性。
import numpy as np import sklearn.datasets as sd import sklearn.feature_extraction.text as ft import sklearn.naive_bayes as nb train = sd.load_files('../machine_learning_date/20news', encoding='latin1', shuffle=True, random_state=7) # train.data: 2968个样本,每个样本都是一篇邮件文档 print(np.array(train.data).shape) # (2968,) # train.target: 2968个样本,每个样本都是文档对应的类别 print(np.array(train.target).shape) # (2968,) print(train.target_names) # ['misc.forsale', 'rec.motorcycles', 'rec.sport.baseball', 'sci.crypt', 'sci.space'] cv = ft.CountVectorizer() # 词袋模型 tt = ft.TfidfTransformer() # 获取TF-IDF模型训练器 bow = cv.fit_transform(train.data) # 训练词袋模型 tfidf = tt.fit_transform(bow) # 训练TF-IDF模型训练器 print(tfidf.shape) # (2968, 40605) model = nb.MultinomialNB() # 创建朴素贝叶斯模型 model.fit(tfidf, train.target) # 训练朴素贝叶斯模型 # 自定义测试集进行测试 test_data = [ 'The curveballs of right handed pitchers tend to curve to the left', 'Caesar cipher is an ancient form of encryption', 'This two-wheeler is really good on slippery roads'] # 怎么训练的,就必须怎么预测 bow = cv.transform(test_data) tfidf = tt.transform(bow) pred_y = model.predict(tfidf) for sent, index in zip(test_data, pred_y): print(sent, '->', train.target_names[index]) # The curveballs of right handed pitchers tend to curve to the left -> rec.sport.baseball # Caesar cipher is an ancient form of encryption -> sci.crypt # This two-wheeler is really good on slippery roads -> rec.motorcycles
性别识别
使用nltk提供的分类器对语料库中英文男名与女名文本进行性别划分训练,最终进行性别验证。
nltk提供的语料库及分类方法相关API:
import nltk.corpus as nc import nltk.classify as cf # 读取语料库中names文件夹里的male.txt文件,并且进行分词 male_names = nc.names.words('male.txt') ''' train_data的格式不再是样本矩阵,nltk要求的数据格式如下: [ ({'age': 15, 'score1': 95, 'score2': 95}, 'good'), ({'age': 15, 'score1': 45, 'score2': 55}, 'bad') ] ''' # 基于朴素贝叶斯分类器训练测试数据 model = cf.NaiveBayesClassifier.train(train_data) # 使用测试数据计算分类器精确度得分(测试数据格式与训练数据一致) ac = cf.accuracy(model, test_data) # 对具体的某个样本进行类别划分 feature = {'age': 15, 'score1': 95, 'score2': 95} gender = model.classify(feature)
案例:
import random import nltk.corpus as nc import nltk.classify as cf male_names = nc.names.words('male.txt') female_names = nc.names.words('female.txt') data = [] for male_name in male_names: feature = {'feature': male_name[-2:].lower()} # 取名字后面两个字母 data.append((feature, 'male')) for female_name in female_names: feature = {'feature': female_name[-2:].lower()} data.append((feature, 'female')) random.seed(7) random.shuffle(data) train_data = data[:int(len(data) / 2)] # 用数据集的前一半作为 训练数据 test_data = data[int(len(data) / 2):] # 用数据集的后一半作为 测试讯据 model = cf.NaiveBayesClassifier.train(train_data) # 朴素贝叶斯分类器 ac = cf.accuracy(model, test_data) names, genders = ['Leonardo', 'Amy', 'Sam', 'Tom', 'Katherine', 'Taylor', 'Susanne'], [] for name in names: feature = {'feature': name[-2:].lower()} gender = model.classify(feature) genders.append(gender) for name, gender in zip(names, genders): print(name, '->', gender) # Leonardo -> male # Amy -> female # Sam -> male # Tom -> male # Katherine -> female # Taylor -> male # Susanne -> female
nltk分类器
nltk提供了朴素贝叶斯分类器方便的处理自然语言相关的分类问题,并且可以自动处理词袋,完成TFIDF矩阵的整理,完成模型训练,最终实现类别预测。使用方法如下:
import nltk.classify as cf import nltk.classify.util as cu ''' train_data的格式不再是样本矩阵,nltk要求的数据格式如下: [ ({'How': 1, 'are': 1, 'you': 1}, 'ask'), ({'fine': 1, 'Thanks': 2}, 'answer') ] ''' # 基于朴素贝叶斯分类器训练测试数据 model = cf.NaiveBayesClassifier.train(train_data) ac = cu.accuracy(model, test_data) print(ac) pred = model.classify(test_data)
情感分析
分析语料库中movie_reviews文档,通过正面及负面评价进行自然语言训练,实现情感分析。
import nltk.corpus as nc import nltk.classify as cf import nltk.classify.util as cu # 存储所有的正向样本 # pdata: [({单词:true}, 'pos'),(),()...] pdata = [] # pos文件夹中的每个文件的路径 fileids = nc.movie_reviews.fileids('pos') # print(len(fileids)) # 整理所有正面评论单词,存入pdata列表 for fileid in fileids: sample = {} # words: 把当前文档分词处理 words = nc.movie_reviews.words(fileid) for word in words: sample[word] = True pdata.append((sample, 'POSITIVE')) # 整理所有反向样本,存入ndata列表 ndata = [] fileids = nc.movie_reviews.fileids('neg') for fileid in fileids: sample = {} words = nc.movie_reviews.words(fileid) for word in words: sample[word] = True ndata.append((sample, 'NEGATIVE')) # 拆分测试集与训练集数量(80%作为训练集) pnumb, nnumb = int(0.8 * len(pdata)), int(0.8 * len(ndata)) train_data = pdata[:pnumb] + ndata[:nnumb] test_data = pdata[pnumb:] + ndata[nnumb:] # 基于朴素贝叶斯分类器训练测试数据 model = cf.NaiveBayesClassifier.train(train_data) ac = cu.accuracy(model, test_data) print(ac) # 模拟业务场景 reviews = [ 'It is an amazing movie.', 'This is a dull movie. I would never recommend it to anyone.', 'The cinematography is pretty great in this movie.', 'The direction was terrible and the story was all over the place.'] for review in reviews: sample = {} words = review.split() for word in words: sample[word] = True pcls = model.classify(sample) print(review, '->', pcls)
主题抽取
经过分词、单词清洗、词干提取后,基于TF-IDF算法可以抽取一段文本中的核心主题词汇,从而判断出当前文本的主题。属于无监督学习。gensim模块提供了主题抽取的常用工具 。
主题抽取相关API:
import gensim.models.ldamodel as gm import gensim.corpora as gc # 把lines_tokens中出现的单词都存入gc提供的词典对象,对每一个单词做编码。 line_tokens = ['hello', 'world', ...] dic = gc.Dictionary(line_tokens) # 通过字典构建词袋 bow = dic.doc2bow(line_tokens) # 构建LDA模型 # bow: 词袋 # num_topics: 分类数 # id2word: 词典 # passes: 每个主题保留的最大主题词个数 model = gm.LdaModel(bow, num_topics=n_topics, id2word=dic, passes=25) # 输出每个类别中对类别贡献最大的4个主题词 topics = model.print_topics(num_topics=n_topics, num_words=4)
案例:
import nltk.tokenize as tk import nltk.corpus as nc import nltk.stem.snowball as sb import gensim.models.ldamodel as gm import gensim.corpora as gc doc = [] with open('../machine_learning_date/topic.txt', 'r') as f: for line in f.readlines(): doc.append(line[:-1]) tokenizer = tk.WordPunctTokenizer() stopwords = nc.stopwords.words('english') signs = [',', '.', '!'] stemmer = sb.SnowballStemmer('english') lines_tokens = [] for line in doc: tokens = tokenizer.tokenize(line.lower()) line_tokens = [] for token in tokens: if token not in stopwords and token not in signs: token = stemmer.stem(token) line_tokens.append(token) lines_tokens.append(line_tokens) # 把lines_tokens中出现的单词都存入gc提供的词典对象,对每一个单词做编码。 dic = gc.Dictionary(lines_tokens) # 遍历每一行,构建词袋列表 bow = [] for line_tokens in lines_tokens: row = dic.doc2bow(line_tokens) bow.append(row) n_topics = 2 # 通过词袋、分类数、词典、每个主题保留的最大主题词个数构建LDA模型 model = gm.LdaModel(bow, num_topics=n_topics, id2word=dic, passes=25) # 输出每个类别中对类别贡献最大的4个主题词 topics = model.print_topics(num_topics=n_topics, num_words=4) for label, words in topics: print(label, '->', words) # 0 -> 0.022*"cryptographi" + 0.022*"use" + 0.022*"need" + 0.013*"cryptograph" # 1 -> 0.046*"spaghetti" + 0.021*"made" + 0.021*"italian" + 0.015*"19th"