-
Notifications
You must be signed in to change notification settings - Fork 0
46 Elasticsearch 处理html tag及encoded内容
Jinxin Chen edited this page Dec 11, 2019
·
1 revision
在索引内容时,会遇到rich text存储的原始html code的栏位值,需要做一定的处理。
- 通过 elk 的 html_strip character filter
- 通过 logstash 的 mutate filter 过滤 html tag
方案1的处理方式在elk的Analyze阶段,原始的html内容已经存储到source中,这样虽然 div 等不会搜索出该条资料,但是 highlight 功能会受到影响(会带出原有 html 内容的 tag)
PUT /_template/html_trip_content
{
"index_patterns" : ["*"],
"settings": {
"analysis": {
"analyzer": {
"trip_content_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "trip_content_analyzer"
}
}
}
}
}
方案2直接从logstash pipline把 html tag 移除,但对于 htmo encoded 的文本需要额外的处理
input {
http_poller {
urls => {
http1 => ""
}
request_timeout => 60
# Supports "cron", "every", "at" and "in" schedules by rufus scheduler
schedule => { cron => "* * * * * UTC"}
codec => "json"
# A hash of request metadata info (timing, response headers, etc.) will be sent here
metadata_target => "http_poller_metadata"
id => "my_plugin_id_1"
}
}
filter {
mutate {
gsub => [
"content", "<.*?>", ""
]
}
}
output {
elasticsearch {
index => "http"
document_id => ""
hosts => ["elasticsearch:9200"]
}
}