Skip to content

44 Elasticsearch索引office document

Jinxin Chen edited this page Dec 11, 2019 · 1 revision

Elasticsearch可以通过 logstash 接入很多类型的数据,但是对于 office 档案,需要额外做一些事情才能处理。

方案选择

要处理office档案,可以通过如下途径:

  1. Ingest Attachment Plugin
  2. FsCrawler
  3. 自己写code调用sdk解析

3最灵活,但是工作量也最大;2最简单,但是限制比较多,仅支持文件系统;1为官方插件,结合了灵活性和便利性,是较为折中的方案,本文介绍这种方式。

安装插件

可以使用如下命令直接安装:

sudo bin/elasticsearch-plugin install ingest-attachment

docker image方式(Dockerfile):

ARG ELK_VERSION=6.2.2
FROM docker.elastic.co/elasticsearch/elasticsearch-oss:$ELK_VERSION

ARG ELK_VERSION
RUN ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-an   alysis-ik/releases/download/v$ELK_VERSION/elasticsearch-analysis-ik-$ELK_VERSION   .zip && \
    ./bin/elasticsearch-plugin install ingest-attachment

配置插件及ELK mapping

Ingest Attachment Plugin通过pipeline来解析二进制档案,下面配置了3个processors,分别用来:解析多个二进制档案为字符串;将解析的多个字符串组合到Content栏位;移除预原base64编码的二进制数据和解析出的临时字符串:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information from arrays",
  "processors" : [
    {
      "foreach": {
        "field": "Files",
        "processor": {
          "attachment": {
            "target_field": "_ingest._value.file",
            "field": "_ingest._value.data",
            "indexed_chars": 20971520
          }
        }
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": """
          for (item in ctx.Files) {
            ctx.Content = ctx.Content + item.file.content
          }
        """
      }
    },
    {
      "foreach": {
        "field": "Files",
        "processor": {
          "remove": {
            "field": ["_ingest._value.data", "_ingest._value.file.content"]
          }
        }
      }
    }
  ]
}

下面配置了一个索引模板,特别要注意Content栏位的配置:"term_vector": "with_positions_offsets", 对于长度超过10000的栏位来说,如果要使用query的highlight功能,6.x版本不对栏位添加term_vector则会有一个warning,7.x版本则会直接报错,参考:Offsets Strategyedit

PUT /_template/template_1
{
  "index_patterns" : ["*"],
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom",
          "tokenizer": "ik_max_word",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "Content": { 
          "type": "text",
          "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
          },
          "term_vector": "with_positions_offsets", 
          "analyzer": "my_custom_analyzer"
        },
        "Title": { 
          "type": "text",
          "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
          },
          "analyzer": "my_custom_analyzer"
        },
        "Description": { 
          "type": "text",
          "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
          },
          "analyzer": "my_custom_analyzer"
        }
      }
    }
  }
}

编写程式将档案base64编码并送往ELK

下面是一个c#的例子:

Chilkat.FileAccess fac = new Chilkat.FileAccess();
string strBase64 = fac.ReadBinaryToEncoded(file, "base64");
fac.FileClose();

itemObj.Files = new dynamic[] { new { data = strBase64 } };

最后发往ELK的数据格式可能是这样:

{
    "Files": [
        {
            "data": "……"
        },
        {
            "data": "……"
        }
    ]
}

查看索引结果

可以通过如下指令在kibana中查看结果:

get index/_search?q=*

参考:

Clone this wiki locally