GitHub - charlesXu86/char_featurizer: 汉字字符特征提取工具，可以提取出字符中的字音（声母、韵母、声调）、字形（偏旁、部首）、四角编码等特征，同时可作为tensor输入到模型

char_featurizer

char_featurizer 是一个汉字字符特征提取工具，他可以提取汉字的字音（包括声母、韵母、声调）、字形（偏旁、部首）、四角符号等信息。同时可以将这些特征信息转换为tensor，作为模型的输入特征。这个项目是在安德森大佬的字符提取工具的基础上做了优化整合

目前 char_featurizer 支持的功能有：

1、字形特征提取

2、字音特征提取

3、四角编码提取

4、tensor转换

二、安装使用

1、安装

pip install char_featurizer

2、使用

1、字符特征提取

from char_featurizer import Featurizer

featurizer = Featurizer()

data = '明天去你家玩'

result = featurizer.featurize(data)
print(result)

返回结果:
([['m'], ['t'], ['q'], ['n'], ['j'], ['w']],      # 声母
[['ing'], ['ian'], ['u'], ['i'], ['ia'], ['an']], # 韵母
[['2'], ['1'], ['4'], ['3'], ['1'], ['2']],       # 声调
('6', '1', '4', '2', '3', '1'),
('7', '0', '0', '7', '0', '1'),
('0', '8', '7', '2', '2', '1'),
('2', '0', '3', '9', '3', '1'),
('0', '4', '2', '2', '2', '2'))
元祖的第一个值的组合为对应汉字的四角编码：如：明 -> 67020, 天 -> 10804

注：汉字和四角编码并非是一一对应的，一个四角编码可以对应多个汉字，但是一个汉字只有一个四角编码

2、作为特征输入模型

3、相关资源

1、汉字四角号码在线查询工具

三、Update News

2020.5.4 完成V1版本

四、TO DO LIST

1、字符相似度计算（发音相似度、字形相似度）

2、支持tf2

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
char_featurizer		char_featurizer
data		data
example		example
.gitattributes		.gitattributes
README.rst		README.rst
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

char_featurizer

二、安装使用

1、安装

2、使用

3、相关资源

三、Update News

四、TO DO LIST

五、Resources

About

Releases

Packages

Languages

charlesXu86/char_featurizer

Folders and files

Latest commit

History

Repository files navigation

char_featurizer

二、安装使用

1、安装

2、使用

3、相关资源

三、Update News

四、TO DO LIST

五、Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages