python实现统计一篇英文文本中的回文单词数,方法很简单,但是细节得注意。

1. 要求:

给定一篇纯英文的文本,统计其中回文单词的比列,并输出其中的回文单词,文本数据如下:

This is Everyday Grammar. I am Madam Lucija
And I am Kaveh. Why the title, Lucija?
Well, it is a special word. Madam?
Yeah, maybe I should spell it for you forward or backward?
I am lost. The word Madam is a Palindrome.
I just learned about them the other day and I am having a lot of fun!
Palindrome, huh? Let me try!
But first, we need to explain what a Palindrome is.
That is easy! Palindromes are words, phrases or numbers that read the same back and forward, like DAD.
So, Palindromes can be serious or just silly.
Yup, like, A nut for a jar of tuna.
Or, Borrow or Rob. Probably borrow!
And if you are hungry, you can always have a Taco cat?
That is gross. What about this one?
A man, a plan, a cat, a ham, a yak, a yam, a hat, a canal panama!
That is a real tongue twister. But I prefer Italy. Amore Roma!
So how do we make palindromes?
One, read words backwards and see if they make sense.
Two, try to make palindromes where even the spacing between words is consistent. Like, NotATon.
And three, you can always check the internet for great palindromes!
And that is Everyday Grammar.

注意:

  • 区分单词的大小写,即同一个单词的大写和小写视为不同的单词;

2. 分析

本次任务的思路很简单,基本步骤如下:

  • 第一步:读入文本数据,然后去掉文本中的换行符;
  • 第二步:去掉第一步处理后的文本中的标点符号,这里使用正则表达式将文本中的单词保留,从而达到去标点符号的目的。之后使用一个列表存入每一行去掉标点之后的文本。
  • 第三步:根据预处理之后的文本统计词频,因为一篇文本里面可能有很多重复的单词,那么只须判断文本构成的子典中的单词是否是回文单词即可。
  • 第四步:遍历字典中的键,并判断是否是回文单词,具体实现方法见代码;
  • 第五步:根据找到的回文单词计算文本中回文单词的比例;

3. 代码

import re
from collections import Counter


# 文本预处理,返回值为['This', 'is', 'Everyday']这种形式
def process(path):
    token = []
    with open(path, 'r') as f:
        text = f.readlines()
        for row_text in text:
            row_text_prod = row_text.rstrip('\n')
            row_text_prod = re.findall(r'\b\w+\b', row_text_prod)
            token = token + row_text_prod
        return token


# 统计回文单词
def palindrome(processed_text):
    c = Counter(processed_text)  # 词频字典
    palindrome_word = []  # 回文单词列表
    not_palindrome_word = []  # 非回文单词列表
    # 遍历词频字典
    for word in c.keys():
        flag = True
        i, j = 0, len(word)-1
        # 判断是否是回文单词
        while i < j:
            if word[i] != word[j]:
                not_palindrome_word.append(word)  # 不是回文单词
                flag = False
                break
            i += 1
            j -= 1
        if flag:
            palindrome_word.append(word)  # 是回文单词
    print("回文单词:")
    print(palindrome_word)
    print("非回文单词:")
    print(not_palindrome_word)
    # 统计回文单词的比率
    total_palindrome_word = 0
    for word in palindrome_word:
        total_palindrome_word += c[word]
    print("回文单词的比例为:{:.3f}".format(total_palindrome_word / len(processed_text)))


def main():
    text_path = 'test.txt'
    processed_text = process(text_path)
    palindrome(processed_text)


if __name__ == '__main__':
    main()

reference:
python3小技巧之:妙用string.punctuation
回文字符串(Palindromic_String)

版权声明:本文为victorxiao原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/victorxiao/p/12884047.html