Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute an ngram field for all admin data #347

Merged
merged 1 commit into from
Mar 18, 2019
Merged

Conversation

missinglink
Copy link
Member

This PR is a replacement for #345 which uses the multi-fields feature of elasticsearch in order to achieve the same thing but using less HDD space.

See linked PR for original issue notes and comments.

Closes #345

@orangejulius
Copy link
Member

This looks really promising. It looks like the disk usage is the same now, which is awesome.

I wonder if the cost of what I imagine is a fairly complex inverted index is actually in memory usage. What's the memory usage of an Elasticsearch instance running this branch, vs master?

@missinglink
Copy link
Member Author

missinglink commented Mar 7, 2019

snapshot total objects disk usage (as reported by s3)
planet-2019.02.22-022621-47f92030-aa82-4807-b5bb-3ab4d4c644bf 3361 313.4 GB
planet-admin-ngrams-2019.03.06-173155-04770bca-928f-44d9-8919-b548c2e46b65 4040  379.7 GB

@orangejulius
Copy link
Member

orangejulius commented Mar 13, 2019

I tested this full planet build out a little bit. I'd like to do a bit more testing but here are my findings so far:

  • Queries against the parent.*.ngram fields work as expected. For example this query returns results within the city of Poughkeepsie, New York:
curl localhost:9200/pelias/_search -d '{
  "query": {
    "match": {
      "parent.locality.ngram": "poughkeep"
    }
  }
}'
  • Acceptance test results are identical, which is to be expected
  • Elasticsearch memory usage is possibly higher. I'd estimate the JVM heap usage of a brand new cluster has gone up from around 500MB to 1GB. Heap usage increases as queries are sent to the cluster, so it might be worth sending some significant traffic through one of these builds to see the effect
  • Disk usage means realistically we have to allocate about 420GB of disk per set of replicas, whereas 350GB was sufficient before
  • I didn't get a chance to do any query throughput/latency testing

@missinglink
Copy link
Member Author

I spent a day investigating the memory implications of adding this additional field.
It's a fairly difficult thing to test thoroughly, but I'm 99% sure that this will not cause any significant memory problems.

I loaded a 'control' snapshot with a full planet build and also the 'admin-ngrams' snapshot to compare memory.
Without any load, using a command such as watch -d $'curl -s \'localhost:9200/_nodes/stats?pretty\' | jq \'.nodes["Vg4uF6RoSA-F0DKDvfkRFQ"].jvm.mem\'' it's possible to watch the JVM heap go up and down.

What I noticed is that the heap naturally goes through size increases followed by GC purges, these cycles are perfectly normal in a garbage-collected language and so, even on a cold-started ES server you'll see it climb to ~6% and go up to ~11% and then fall back down and repeat (I was allocating ~10GB RAM to the heap in my tests).

When I executed the acceptance-tests (no changes to API to use the new index, so the index should never be touched) I noticed that the JVM usage climbed slightly, now hitting a peak of around ~12% instead of ~11% and falling to around ~8% instead of 6% after a GC cycle.

I then change the pelias/api branch to use queries which exercised the new field. This had never little impact on the heap, almost not noticeable, it may have increased the peaks to ~13% but nothing worth worrying about.

I'm going to go ahead and merge this, the only significant impact is the additional ~20% disk space used, we'd now suggest allocating a minimum of 450GB for the full planet build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants