_id could become non-unique within a index when using _routing fields #3346

icedfish · 2013-07-17T14:42:03Z

Bug Description

Usually, the _id field is known as global unique in a index, right?
But I found it become non-unique when a doc's routing field is modified to another value and reindex to ES. Then, there will be two doc alive in diffrent shard but the same index.

It seems the delete operation is broadcasted to all shards, but the index operation not.

Since it's hard to monitor if the routing filed is modified, the only thing I can do is do an delete operation before each index operation, I really don't like it >_<

how to reproduce the bug

Tested under both v0.90.0

[1] Create A Index

curl -XPUT 'http://localhost:9200/user' -d '
{
    "mappings": {
        "User": {
            "store": "no",
            "_id": {
                    "type": "string",
                    "index": "not_analyzed",
                    "store": "yes"
            },
            "_type": {
                "enabled": true
            },
            "_routing": {
                "path": "tag",
                "required": true
            },
            "properties": {
                "tag": {
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        }
    }
}
'

[2] Input Data

curl -XPOST 'http://localhost:9200/user/User/123' -d '{"tag" : "good"}'

{"ok":true,"_index":"user","_type":"User","_id":"123","_version":1}

curl -XPOST 'http://localhost:9200/user/User/123' -d '{"tag" : "bad"}'

{"ok":true,"_index":"user","_type":"User","_id":"123","_version":1}

[3] Search

curl -XPOST 'http://localhost:9200/user/User/_search' -d '{
  "query": {
    "term": {
      "_id": "123"
    }
  },
  "facets": {
    "tag": {
      "terms": {
        "field": "tag"
      }
    }
  }
}'

Result:

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 0.30685282,
        "hits": [
            {
                "_index": "user",
                "_type": "User",
                "_id": "123",
                "_score": 0.30685282,
                "_source": {
                    "tag": "bad"
                }
            },
            {
                "_index": "user",
                "_type": "User",
                "_id": "123",
                "_score": 0.30685282,
                "_source": {
                    "tag": "good"
                }
            }
        ]
    },
    "facets": {
        "tag": {
            "_type": "terms",
            "missing": 0,
            "total": 2,
            "other": 0,
            "terms": [
                {
                    "term": "good",
                    "count": 1
                },
                {
                    "term": "bad",
                    "count": 1
                }
            ]
        }
    }
}

The text was updated successfully, but these errors were encountered:

kimchy · 2013-07-18T16:55:09Z

Elasticsearch will not guarantee the unique aspect of an ID when using custom routing. The unique aspect is maintained on the shard level. The cost of trying to make sure the ID is globally unique will be to expensive across the cluster.

fcrosfly · 2013-07-19T01:49:06Z

_< support

icedfish · 2013-07-19T02:29:47Z

@kimchy I known it's hard to maintan the unique globally, but could you add the issue to the doc page of routing field? It's totallly not expected to have this problem when I choose to using routing several months ago >><<

I'm trying to find a way to detect the ids that already be non-unique, and fix the non-unique issue by a batch.
But I find it hard to find them...

I try to find the duplicate by term facet count like below:

{
  "query": {
    "match_all": {}
  },
  "facets": {
    "count": {
      "terms": {
        "field": "_id",
        "size": 10,
        "order": "count"
      }
    }
  }
}

But I found it only works when I test the query on a single node cluster, but don't work on my production cluster with more nodes, all the count return 1.
Is it because the index contains too many ids? (about 10M)

And when I sepecific a duplicate id in filter query, it can give the right count nunber :

{
  "query": {
    "term": {
      "_id": "276052286"
    }
  },
.....
}

Could you give me some suggest? thanks.

javanna · 2013-07-22T12:18:26Z

@icedfish I added a "id uniqueness" section to the doc page for the _routing field.

icedfish · 2013-07-22T14:29:54Z

Hi @javanna could you told me a possible way to find the non-unique ids? Or there's no way to find them ?

spinscale · 2013-07-23T06:58:29Z

Untested, but if you have indexed the id, you could run a terms facet on it (or maybe if you have not indexed it, you can use a script_field inside your terms facet - not a hundred percent sure, but maybe worth a try).

imotov · 2013-07-23T13:47:11Z

terms facet is the simplest way to go, but in order for it to work you will have to fit all ids into memory. The reason for this is that terms facets are calculated on each shard first and then reduced on a single node. But the nature of the problem is that within a shard all ids are unique. So, each shard will have to send a complete set of ids to the reducer where they will have to be de-duplicated. To make long story short, for terms facet to work the size parameter on the facet will have to be set to the number of ids and such facet might be too big for your nodes.

A more light-weight solution would be to extract all ids, sort them and then find repeated ids. You can write a script that would use scan/scroll to retrieve all ids in your index or you can use handy es2unix script that already has this functionality. After es2unix is installed, you can find duplicate ids by simply running the following script:

es ids user User | sort | uniq -d

icedfish · 2013-07-24T08:48:14Z

Hi @imotov, thanks for your help!
I tried es2unix, but I think there may be something wrong with the scroll function ( I checked the source code es2unix ids function is using the scroll function ) .
I got much much more ids than the total ids the index should really have , I even find an id appers for 79519 times...

I'll do some more research to locate the issue. I think it's could be my second issue for project ES : )

imotov · 2013-07-25T02:41:20Z

@icedfish if you have ruby and tire installed, you can try running the following script instead of es2unix.

require 'rubygems'
require 'tire'

s = Tire.scan('user', )
s.each_document do |document|
  print document._type, " ", document.id, "\n"
end

* es/6.x: (106 commits) Revert "Fix elasticsearch-cli dependency" Fix elasticsearch-cli dependency [Watcher] Use index.auto_expand_replicas: 0-1 (#3284) [DOCS] Reformatted machine learning overview (#3346) [DOCS] Added monitoring PRs to 6.1 release notes (#3297) [DOCS] Added xpack.ml.node_concurrent_job_allocations setting (#3327) [DOCS] Fixed troubleshooting titles Watcher: Set index and type dynamically in index action (#3264) Tests: Ensure that watcher is started in HipchatServiceTests Fix test due to BytesSizeValue negative value deprecation [DOCS] Explain ML datafeed run-as integration/limitations (#3311) Monitoring: Ensure all monitoring watches filter by timestamp (#3238) Fix license messaging for Logstash functionality (#3268) [DOCS] Updated titles of ML APIs Fixes test to support BytesSizeValue changes (#3321) Revert "Fixes test to support BytesSizeValue changes (#3321)" Fixes test to support BytesSizeValue changes (#3321) Add missing import Check for existing x-pack directory when running the `users` CLI tool (#3271) [DOCS] Fixed title in 6.1.0 release notes ...

ghost assigned javanna Jul 22, 2013

javanna closed this as completed Jul 22, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_id could become non-unique within a index when using _routing fields #3346

_id could become non-unique within a index when using _routing fields #3346

icedfish commented Jul 17, 2013

kimchy commented Jul 18, 2013

fcrosfly commented Jul 19, 2013

icedfish commented Jul 19, 2013

javanna commented Jul 22, 2013

icedfish commented Jul 22, 2013

spinscale commented Jul 23, 2013

imotov commented Jul 23, 2013

icedfish commented Jul 24, 2013

imotov commented Jul 25, 2013

_id could become non-unique within a index when using _routing fields #3346

_id could become non-unique within a index when using _routing fields #3346

Comments

icedfish commented Jul 17, 2013

Bug Description

how to reproduce the bug

[1] Create A Index

[2] Input Data

[3] Search

kimchy commented Jul 18, 2013

fcrosfly commented Jul 19, 2013

icedfish commented Jul 19, 2013

javanna commented Jul 22, 2013

icedfish commented Jul 22, 2013

spinscale commented Jul 23, 2013

imotov commented Jul 23, 2013

icedfish commented Jul 24, 2013

imotov commented Jul 25, 2013