Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_id could become non-unique within a index when using _routing fields #3346

Closed
icedfish opened this issue Jul 17, 2013 · 9 comments
Closed
Assignees

Comments

@icedfish
Copy link

Bug Description

Usually, the _id field is known as global unique in a index, right?
But I found it become non-unique when a doc's routing field is modified to another value and reindex to ES. Then, there will be two doc alive in diffrent shard but the same index.

It seems the delete operation is broadcasted to all shards, but the index operation not.

Since it's hard to monitor if the routing filed is modified, the only thing I can do is do an delete operation before each index operation, I really don't like it >_<

how to reproduce the bug

Tested under both v0.90.0

[1] Create A Index

curl -XPUT 'http://localhost:9200/user' -d '
{
    "mappings": {
        "User": {
            "store": "no",
            "_id": {
                    "type": "string",
                    "index": "not_analyzed",
                    "store": "yes"
            },
            "_type": {
                "enabled": true
            },
            "_routing": {
                "path": "tag",
                "required": true
            },
            "properties": {
                "tag": {
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        }
    }
}
'

[2] Input Data

curl -XPOST 'http://localhost:9200/user/User/123' -d '{"tag" : "good"}'

{"ok":true,"_index":"user","_type":"User","_id":"123","_version":1}

curl -XPOST 'http://localhost:9200/user/User/123' -d '{"tag" : "bad"}'

{"ok":true,"_index":"user","_type":"User","_id":"123","_version":1}

[3] Search

curl -XPOST 'http://localhost:9200/user/User/_search' -d '{
  "query": {
    "term": {
      "_id": "123"
    }
  },
  "facets": {
    "tag": {
      "terms": {
        "field": "tag"
      }
    }
  }
}'

Result:

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 0.30685282,
        "hits": [
            {
                "_index": "user",
                "_type": "User",
                "_id": "123",
                "_score": 0.30685282,
                "_source": {
                    "tag": "bad"
                }
            },
            {
                "_index": "user",
                "_type": "User",
                "_id": "123",
                "_score": 0.30685282,
                "_source": {
                    "tag": "good"
                }
            }
        ]
    },
    "facets": {
        "tag": {
            "_type": "terms",
            "missing": 0,
            "total": 2,
            "other": 0,
            "terms": [
                {
                    "term": "good",
                    "count": 1
                },
                {
                    "term": "bad",
                    "count": 1
                }
            ]
        }
    }
}
@kimchy
Copy link
Member

kimchy commented Jul 18, 2013

Elasticsearch will not guarantee the unique aspect of an ID when using custom routing. The unique aspect is maintained on the shard level. The cost of trying to make sure the ID is globally unique will be to expensive across the cluster.

@fcrosfly
Copy link

_< support

@icedfish
Copy link
Author

@kimchy I known it's hard to maintan the unique globally, but could you add the issue to the doc page of routing field? It's totallly not expected to have this problem when I choose to using routing several months ago >><<

I'm trying to find a way to detect the ids that already be non-unique, and fix the non-unique issue by a batch.
But I find it hard to find them...

I try to find the duplicate by term facet count like below:

{
  "query": {
    "match_all": {}
  },
  "facets": {
    "count": {
      "terms": {
        "field": "_id",
        "size": 10,
        "order": "count"
      }
    }
  }
}

But I found it only works when I test the query on a single node cluster, but don't work on my production cluster with more nodes, all the count return 1.
Is it because the index contains too many ids? (about 10M)

And when I sepecific a duplicate id in filter query, it can give the right count nunber :

{
  "query": {
    "term": {
      "_id": "276052286"
    }
  },
.....
}

Could you give me some suggest? thanks.

@ghost ghost assigned javanna Jul 22, 2013
@javanna
Copy link
Member

javanna commented Jul 22, 2013

@icedfish I added a "id uniqueness" section to the doc page for the _routing field.

@javanna javanna closed this as completed Jul 22, 2013
@icedfish
Copy link
Author

Hi @javanna could you told me a possible way to find the non-unique ids? Or there's no way to find them ?

@spinscale
Copy link
Contributor

Untested, but if you have indexed the id, you could run a terms facet on it (or maybe if you have not indexed it, you can use a script_field inside your terms facet - not a hundred percent sure, but maybe worth a try).

@imotov
Copy link
Contributor

imotov commented Jul 23, 2013

terms facet is the simplest way to go, but in order for it to work you will have to fit all ids into memory. The reason for this is that terms facets are calculated on each shard first and then reduced on a single node. But the nature of the problem is that within a shard all ids are unique. So, each shard will have to send a complete set of ids to the reducer where they will have to be de-duplicated. To make long story short, for terms facet to work the size parameter on the facet will have to be set to the number of ids and such facet might be too big for your nodes.

A more light-weight solution would be to extract all ids, sort them and then find repeated ids. You can write a script that would use scan/scroll to retrieve all ids in your index or you can use handy es2unix script that already has this functionality. After es2unix is installed, you can find duplicate ids by simply running the following script:

es ids user User | sort | uniq -d 

@icedfish
Copy link
Author

Hi @imotov, thanks for your help!
I tried es2unix, but I think there may be something wrong with the scroll function ( I checked the source code es2unix ids function is using the scroll function ) .
I got much much more ids than the total ids the index should really have , I even find an id appers for 79519 times...

I'll do some more research to locate the issue. I think it's could be my second issue for project ES : )

@imotov
Copy link
Contributor

imotov commented Jul 25, 2013

@icedfish if you have ruby and tire installed, you can try running the following script instead of es2unix.

require 'rubygems'
require 'tire'

s = Tire.scan('user', )
s.each_document do |document|
  print document._type, " ", document.id, "\n"
end

martijnvg added a commit that referenced this issue Apr 25, 2018
* es/6.x: (106 commits)
  Revert "Fix elasticsearch-cli dependency"
  Fix elasticsearch-cli dependency
  [Watcher] Use index.auto_expand_replicas: 0-1 (#3284)
  [DOCS] Reformatted machine learning overview (#3346)
  [DOCS] Added monitoring PRs to 6.1 release notes (#3297)
  [DOCS] Added xpack.ml.node_concurrent_job_allocations setting (#3327)
  [DOCS] Fixed troubleshooting titles
  Watcher: Set index and type dynamically in index action (#3264)
  Tests: Ensure that watcher is started in HipchatServiceTests
  Fix test due to BytesSizeValue negative value deprecation
  [DOCS] Explain ML datafeed run-as integration/limitations (#3311)
  Monitoring: Ensure all monitoring watches filter by timestamp (#3238)
  Fix license messaging for Logstash functionality (#3268)
  [DOCS] Updated titles of ML APIs
  Fixes test to support BytesSizeValue changes (#3321)
  Revert "Fixes test to support BytesSizeValue changes (#3321)"
  Fixes test to support BytesSizeValue changes (#3321)
  Add missing import
  Check for existing x-pack directory when running the `users` CLI tool (#3271)
  [DOCS] Fixed title in 6.1.0 release notes
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants