Python无监督抽词 SEO如何快速正确分词

如何快速正确分词，对于seo来说，是提取tags聚合,信息关联的好帮手。
目前很多分词工具都是基于一元的分词法，需要词库来辅助。
通过对google黑板报第一章的学习，如何利用统计模型进行分词。
本方法考虑了3个维度
凝聚程度：两个字连续出现的概率并不是各自独立的程度。例如“上”出现的概率是1×10^-5,”床”出现的概率是1×10^-10，如果这两个字的凝聚程度低，则”上床”出现的概率应该和1×10^-15接近，但是事实上”上床”出现的概率在1×10^-11次方，远高于各自独立概率之积。所以我们可以认为“上床”是一个词。
左邻字集合熵：分出的词左边一个字的信息量，比如”巴掌”，基本只能用于”打巴掌”，“一巴掌”，“拍巴掌”，反之”过去”这个词，前面可以用“走过去”，“跑过去”，“爬过去”，“打过去”，“混过去”，“睡过去”，“死过去”，“飞过去”等等，信息熵就非常高。
右邻字集合熵：分出的词右边一个词的信息量，同上
下面是一个利用python实现的demo（转自：/forum.php?mod=viewthread&tid=20）
1
2
3
4
5
6
7
8
9
#!/bin/sh
python./splitstr.py>substr.freq
python./cntfreq.py>word.freq
python./findwords.py>result
sort-t-r-n-k2result>result.sort
splitstr.py，切分出字数在10以内的子字符串，计算词频，左邻字集合熵，右邻字集合熵，并输出出现10次以上的子字符串：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
importmath
defcompute_entropy(word_list):
wdict={}
tot_cnt=0
forwinword_list:
ifwnotinwdict:
wdict[w]=0
wdict[w]+=1
tot_cnt+=1
ent=0.0
fork,vinwdict.items():
p=1.0*v/tot_cnt
ent-=p*math.log(p)
returnent
defcount_substr_freq():
fp=open(./video.corpus)
str_freq={}
str_left_word={}
str_right_word={}
tot_cnt=0
forlineinfp:
line=line.strip('\n')
st=line.decode('utf-8')
l=len(st)
foriinrange(l):
forjinrange(i+1,l):
ifj-i0:
left_word=st[i-1]
else:
left_word='^'
ifj=10:
left_ent=compute_entropy(str_left_word[k])
right_ent=compute_entropy(str_right_word[k])
print%s\t%f\t%f\t%f%(k,v*1.0/tot_cnt,left_ent,right_ent)
if__name__==__main__:
count_substr_freq()
cntfreq.sh，统计每个字的字频：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
defcount_freq():
word_freq={}
fp=open(./substr.freq)
tot_cnt=0.0
forlineinfp:
line=line.split('\t')
iflen(line)<2:
continue
st=line[0].decode('utf-8')
freq=float(line[1])
forwinst:
ifwnotinword_freq:
word_freq[w]=0.0
word_freq[w]+=freq
tot_cnt+=freq
whiletrue:
try:
x,y=word_freq.popitem()
ifx:
freq=y*1.0/tot_cnt
print%s\t%f%(x.encode('utf-8'),freq)
else:
break
except:
break
if__name__==__main__:
count_freq()
findwords.py，输出凝合程度高，且左右邻字集合熵都较高的字符串：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
defload_dict(filename):
dict={}
fp=open(filename)
forlineinfp:
line=line.strip('\n')
item=line.split('\t')
iflen(item)==2:
dict[item[0]]=float(item[1])
returndict
defcompute_prob(str,dict):
p=1.0
forwinstr:
w=w.encode('utf-8')
ifwindict:
p*=dict[w]
returnp
defis_ascii(s):
returnall(ord(c)<128forcins)
deffind_compact_substr(dict):
fp=open(./substr.freq)
str_freq={}
forlineinfp:
line=line.decode('utf-8')
items=line.split('\t')
iflen(items)5.0andleft_ent>2.5andright_ent>2.5andlen(substr)>=2andnotis_ascii(substr):
print%s\t%f%(substr.encode('utf-8'),freq)
if__name__==__main__:
dict=load_dict('./word.freq')
find_compact_substr(dict)
对3万条视频的标题，抽出的频率最高的50个词如下：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
视频0.000237
轴承0.000184
北京0.000150
中国0.000134
高清0.000109
搞笑0.000101
新闻0.000100
上海0.000100
美女0.000092
演唱0.000085
音乐0.000082
——0.000082
第二0.000080
少女0.000078
最新0.000074
广场0.000070
世界0.000070
现场0.000066
娱乐0.000066
大学0.000064
公司0.000064
舞蹈0.000063
电视0.000063
教学0.000060
我们0.000060
国语0.000059
经典0.000056
字幕0.000055
宣传0.000053
钢管0.000051
游戏0.000050
电影0.000049
演唱会0.000046
日本0.000045
小学0.000045
快乐0.000044
超级0.000043
第三0.000042
宝宝0.000042
学生0.000042
广告0.000041
培训0.000041
视频0.000040
美国0.000040
爱情0.000039
老师0.000038
动画0.000038
教程0.000037
广州0.000037
学院0.000035

Python无监督抽词 SEO如何快速正确分词

VIP推荐