新年干货分享--10分钟学会概率数据结构
平时总用hashmap,tree, set,vector,queue/stack/heap, linklist, graph,是不是觉得数据结构就那点东西。新年到,卿哥给大家分享点段位比较高的大数据专用数据结构–概率数据结构,让你不管是参与系统设计也好,平时和老板同事聊天也好,找工作面试也好都能让人眼前一亮,即probabilistic data structure, 也有人称之为 approximation algorithm 近似算法或者 online algorithm在线算法。今天教大家概率数据结构的5种招式俗称打狗5式,分别是用于基数统计的HyperLogLog, 元素存在检测的Bloom filter, 相似度检测的MinHash, 频率统计的count-min sketch 和 流统计的tdigest,把这打狗5式吃透就足够你闯荡江湖华山论剑啦。
Set cardinality — HyperLogLog
即基数统计,这是很常用的功能,比如我这个网站这段时间到底被多少独立IP访问过啊?诸如此类需要counting unique的问题。那么正常的思路是什么?建一个hashset往里装,最后返回hashset的结果。可是如果数据量很大,hashset内存就不够了怎么半呢?那就用多台机器memcached或者redis,内存不够了再往硬盘里装,总之就是硬吃。
这个时候其实可以换个思路,牺牲一点准确度来换取内存的节省。这就是HyperLogLog啦!下面的例子用1%的error bound来达到近似结果的目的,效率超级高,内存使用率超级低。下面用https://github.com/svpcom/hyperloglog提供的实现:
#!/usr/bin/python import re jabber_text = """ `Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe. "Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!" He took his vorpal sword in hand; Long time the manxome foe he sought- So rested he by the Tumtum tree And stood awhile in thought. And, as in uffish thought he stood, The Jabberwock, with eyes of flame, Came whiffling through the tulgey wood, And burbled as it came! One, two! One, two! And through and through The vorpal blade went snicker-snack! He left it dead, and with its head He went galumphing back. "And hast thou slain the Jabberwock? Come to my arms, my beamish boy! O frabjous day! Callooh! Callay!" He chortled in his joy. `Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe. """ packer_text = """ My answers are inadequate To those demanding day and date And ever set a tiny shock Through strangers asking what's o'clock; Whose days are spent in whittling rhyme- What's time to her, or she to Time? """ def clean_words(text): return filter(lambda x: len(x) >0, re.sub("[^A-Za-z]", " ", text).split(" ")) jabber_words = clean_words(jabber_text.lower()) #print jabber_words packer_words = clean_words(packer_text.lower()) #print packer_words jabber_uniq = sorted(set(jabber_words)) #print jabber_uniq import hyperloglog hll = hyperloglog.HyperLogLog(0.01) for word in jabber_words: hll.add(word) print "prob count %d, true count %d" % (len(hll),len(jabber_uniq)) print "observed error rate %0.2f" % (abs(len(hll) - len(jabber_uniq))/float(len(jabber_uniq)))
打印结果:
prob count 90, true count 91 observed error rate 0.01
Set membership — Bloom Filter
Bloom Filter是这几种数据结构里你最应该掌握的,如果是读过我“聊聊canssandra”的读者一定耳熟能详。在cassandra的读操作里,如果memtable里没有那么就看bloomfilter,如果bloomfilter说没有就结束了,真没有,如果说有继续查key cache,etc。说没有就没有这个属性太牛逼了。下半句是说有也不一定有但是误差率可以控制为0.001,下面用https://github.com/jaybaird/python-bloomfilter给大家举个例子(上面重复的code我就不写了):
from pybloom import BloomFilter bf = BloomFilter(capacity=1000, error_rate=0.001) for word in packer_words: bf.add(word) intersect = set([]) for word in jabber_words: if word in bf: intersect.add(word) print intersect
打印结果:
set(['and', 'in', 'o', 'to', 'through', 'time', 'my', 'day'])
Set Similarity — MinHash
就是说两篇文章我来看看它们到底有多相似,听起来可以预防论文抄袭啥的,下面我们通过https://github.com/ekzhu/datasketch的实现来看一下:
from datasketch import MinHash def mh_digest(data): m = MinHash(num_perm=512) #number of permutation for d in data: m.update(d.encode('utf8')) return m m1 = mh_digest(set(jabber_words)) m2 = mh_digest(set(packer_words)) print "Jaccard simularity %f" % m1.jaccard(m2), "estimated" s1 = set(jabber_words) s2 = set(packer_words) actual_jaccard = float(len(s1.intersection(s2)))/float(len(s1.union(s2))) print "Jaccard simularity %f" % actual_jaccard, "actual"
打印结果:
Jaccard simularity 0.060547 estimated Jaccard simularity 0.069565 actual
Frequency Summaries — count-min sketch
频率统计算法一般用在排行榜中,既当前的