Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

k8s.sgdev.org: symbols pod evicted #5305

Closed
ggilmore opened this issue Aug 19, 2019 · 11 comments
Closed

k8s.sgdev.org: symbols pod evicted #5305

ggilmore opened this issue Aug 19, 2019 · 11 comments
Assignees
Labels
bug An error, flaw or fault that produces an incorrect or unexpected result, or behavior. deploy-sourcegraph Issues that affect sourcegraph/deploy-sourcegraph deployment ops & tools & dev
Milestone

Comments

@ggilmore
Copy link
Contributor

ggilmore commented Aug 19, 2019

A symbols pod was recently evicted from k8s.sgdev.org b/c it was using too much ephemeral storage. Note that we aren't currently mounting a real SSD for the symbols pod to use, and we might want to add one in the future.

However, has the symbols service cache size grown due to all of the recent changes? Is this something that we need to inform our customers about?

cc @sourcegraph/core-services


kubectl describe pod symbols-9b56c4b46-tkmnp output:

Name:           symbols-9b56c4b46-tkmnp
Namespace:      default
Priority:       0
Node:           gke-dogfood-full-k8s-dogfood-full-k8s-fe742003-jzv9/
Start Time:     Thu, 15 Aug 2019 00:21:24 -0700
Labels:         app=symbols
                pod-template-hash=561270602
Annotations:    <none>
Status:         Failed
Reason:         Evicted
Message:        The node was low on resource: ephemeral-storage. Container symbols was using 11836Ki, which exceeds its request of 0.
IP:
Controlled By:  ReplicaSet/symbols-9b56c4b46
Containers:
  symbols:
    Image:       index.docker.io/sourcegraph/symbols:3.7.0-rc.1@sha256:9d17b4c95b24996c3184ffc6970e6aa1385cc9d45cf629cd407c796ab9e41735
    Ports:       3184/TCP, 6060/TCP
    Host Ports:  0/TCP, 0/TCP
    Limits:
      cpu:     2
      memory:  2G
    Requests:
      cpu:      500m
      memory:   500M
    Liveness:   http-get http://:http/healthz delay=60s timeout=5s period=10s #success=1 #failure=3
    Readiness:  http-get http://:http/healthz delay=0s timeout=5s period=5s #success=1 #failure=3
    Environment:
      SYMBOLS_CACHE_SIZE_MB:  100000
      POD_NAME:               symbols-9b56c4b46-tkmnp (v1:metadata.name)
      CACHE_DIR:              /mnt/cache/$(POD_NAME)
    Mounts:
      /mnt/cache from cache-ssd (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-2vkvv (ro)
Volumes:
  cache-ssd:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  default-token-2vkvv:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-2vkvv
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>
@ggilmore ggilmore added bug An error, flaw or fault that produces an incorrect or unexpected result, or behavior. ops & tools & dev deployment deploy-sourcegraph Issues that affect sourcegraph/deploy-sourcegraph labels Aug 19, 2019
@ggilmore ggilmore added this to the 3.7 milestone Aug 19, 2019
@ggilmore
Copy link
Contributor Author

cc @sourcegraph/distribution

@keegancsmith
Copy link
Member

The symbols pod should have the same setup as our searcher pod. So it should have the hostpath to the SSD. If it doesn't, that is a bug.

It's likely that during our testing of global symbol search we hit symbol service a lot more in the case a repo wasn't indexed. This likely lead to much more stress than usual for the service.

@attfarhan attfarhan modified the milestones: 3.7, Backlog Aug 20, 2019
@keegancsmith keegancsmith modified the milestones: Backlog, 3.9 Sep 25, 2019
@keegancsmith
Copy link
Member

@kzh think you will get this in this week? Taking into account time for review/etc so we can land it by Monday.

@kzh
Copy link
Contributor

kzh commented Oct 11, 2019

Proposal: How does disabling global unindexed symbol search or limiting max unindexed repos sound? This would significantly reduce the risk of overburdening the symbols pod. This would mean in order for a repository to be included in a global symbol search, it would have to be indexed first. IMO, this is an okay tradeoff considering the unindexed alternative could potentially harm the pod anyways.

@keegancsmith
Copy link
Member

This gives a poor experience for a new instance / new repo. Can we instead just limit the number of repos we send to a replica in a single search request?

@tsenart
Copy link
Contributor

tsenart commented Oct 14, 2019

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.9 release is scheduled for tomorrow at 10:00 CEST.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

@kzh
Copy link
Contributor

kzh commented Oct 14, 2019

To address this issue, maybe adding a persistent volume claim to the deployment would be more appropriate. There appears to be a somewhat similar config flag that constraints the number of repositories concurrently searched for symbols (~20).

@tsenart tsenart modified the milestones: 3.9, 3.10 Oct 15, 2019
@keegancsmith
Copy link
Member

To address this issue, maybe adding a persistent volume claim to the deployment would be more appropriate. There appears to be a somewhat similar config flag that constraints the number of repositories concurrently searched for symbols (~20).

You can't do PVC's on deployments. This just requires updating the configuration so we use the local SSD. https://github.com/sourcegraph/sourcegraph/issues/5305#issuecomment-522889078 Compare the searcher deployment vs the symbols deployment. If they are the same on deploy-sourcegraph, this may be an issue specific to k8s.sgdev.org repo, so requires updating so it matches how searcher works there.

@beyang
Copy link
Member

beyang commented Nov 13, 2019

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.10 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

@kzh kzh modified the milestones: 3.10, 3.11 Nov 14, 2019
@tsenart
Copy link
Contributor

tsenart commented Nov 14, 2019

@kzh: Are you planning to work on this in 3.11? If you don't know, this should be backlogged.

@beyang
Copy link
Member

beyang commented Dec 14, 2019

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.11 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

@tsenart tsenart modified the milestones: 3.11, Backlog Dec 15, 2019
@tsenart tsenart closed this as completed Jan 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug An error, flaw or fault that produces an incorrect or unexpected result, or behavior. deploy-sourcegraph Issues that affect sourcegraph/deploy-sourcegraph deployment ops & tools & dev
Projects
None yet
Development

No branches or pull requests

6 participants