-
Notifications
You must be signed in to change notification settings - Fork 0
44 Elasticsearch索引office document
Jinxin Chen edited this page Dec 11, 2019
·
1 revision
Elasticsearch可以通过 logstash 接入很多类型的数据,但是对于 office 档案,需要额外做一些事情才能处理。
要处理office档案,可以通过如下途径:
- Ingest Attachment Plugin
- FsCrawler
- 自己写code调用sdk解析
3最灵活,但是工作量也最大;2最简单,但是限制比较多,仅支持文件系统;1为官方插件,结合了灵活性和便利性,是较为折中的方案,本文介绍这种方式。
可以使用如下命令直接安装:
sudo bin/elasticsearch-plugin install ingest-attachment
docker image方式(Dockerfile):
ARG ELK_VERSION=6.2.2
FROM docker.elastic.co/elasticsearch/elasticsearch-oss:$ELK_VERSION
ARG ELK_VERSION
RUN ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-an alysis-ik/releases/download/v$ELK_VERSION/elasticsearch-analysis-ik-$ELK_VERSION .zip && \
./bin/elasticsearch-plugin install ingest-attachment
Ingest Attachment Plugin通过pipeline来解析二进制档案,下面配置了3个processors,分别用来:解析多个二进制档案为字符串;将解析的多个字符串组合到Content栏位;移除预原base64编码的二进制数据和解析出的临时字符串:
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information from arrays",
"processors" : [
{
"foreach": {
"field": "Files",
"processor": {
"attachment": {
"target_field": "_ingest._value.file",
"field": "_ingest._value.data",
"indexed_chars": 20971520
}
}
}
},
{
"script": {
"lang": "painless",
"source": """
for (item in ctx.Files) {
ctx.Content = ctx.Content + item.file.content
}
"""
}
},
{
"foreach": {
"field": "Files",
"processor": {
"remove": {
"field": ["_ingest._value.data", "_ingest._value.file.content"]
}
}
}
}
]
}
下面配置了一个索引模板,特别要注意Content栏位的配置:"term_vector": "with_positions_offsets", 对于长度超过10000的栏位来说,如果要使用query的highlight功能,6.x版本不对栏位添加term_vector则会有一个warning,7.x版本则会直接报错,参考:Offsets Strategyedit
PUT /_template/template_1
{
"index_patterns" : ["*"],
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "ik_max_word",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"Content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"term_vector": "with_positions_offsets",
"analyzer": "my_custom_analyzer"
},
"Title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "my_custom_analyzer"
},
"Description": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "my_custom_analyzer"
}
}
}
}
}
下面是一个c#的例子:
Chilkat.FileAccess fac = new Chilkat.FileAccess();
string strBase64 = fac.ReadBinaryToEncoded(file, "base64");
fac.FileClose();
itemObj.Files = new dynamic[] { new { data = strBase64 } };
最后发往ELK的数据格式可能是这样:
{
"Files": [
{
"data": "……"
},
{
"data": "……"
}
]
}
可以通过如下指令在kibana中查看结果:
get index/_search?q=*