Skip to content

46 Elasticsearch 处理html tag及encoded内容

Jinxin Chen edited this page Dec 11, 2019 · 1 revision

在索引内容时,会遇到rich text存储的原始html code的栏位值,需要做一定的处理。

方案

  1. 通过 elk 的 html_strip character filter
  2. 通过 logstash 的 mutate filter 过滤 html tag

方案1的处理方式在elk的Analyze阶段,原始的html内容已经存储到source中,这样虽然 div 等不会搜索出该条资料,但是 highlight 功能会受到影响(会带出原有 html 内容的 tag)

PUT /_template/html_trip_content
{
  "index_patterns" : ["*"],
  "settings": {
    "analysis": {
      "analyzer": {
        "trip_content_analyzer": {
          "type":      "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "content": { 
          "type": "text",
          "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
          },
          "analyzer": "trip_content_analyzer"
        }
      }
    }
  }
}

方案2直接从logstash pipline把 html tag 移除,但对于 htmo encoded 的文本需要额外的处理

input {
  http_poller {
    urls => {
      http1 => ""
    }
    request_timeout => 60
    # Supports "cron", "every", "at" and "in" schedules by rufus scheduler
    schedule => { cron => "* * * * * UTC"}
    codec => "json"
    # A hash of request metadata info (timing, response headers, etc.) will be sent here
    metadata_target => "http_poller_metadata"
    id => "my_plugin_id_1"
  }
}
filter {
  mutate {
    gsub => [
      "content", "<.*?>", ""
    ]
  }
}
output {
    elasticsearch {
        index => "http"
        document_id => ""
        hosts => ["elasticsearch:9200"]
    }
}

参考:

Clone this wiki locally