Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Custom DC] Various setup improvements #2349

Merged
merged 5 commits into from
Mar 2, 2023

Conversation

Fructokinase
Copy link
Collaborator

@Fructokinase Fructokinase commented Mar 1, 2023

Cluster

  • Migrate the cluster management from shell script to Terraform module. This allows Terraform to also delete the GKE cluster, 1 step closer for E2E test
  • Allow zonal cluster instead of enforcing regional clusters, which should cut the resource by 2/3. Location specifies whether a zonal or a regional cluster is created. ex: "us-cental1" -> regional, "us-central1-a" -> zonal. The difference is 99.95% SLA vs 99.5% SLA. I think most POCs/dev projects makes sense to be zonal.

Helm chart

  • Allow gcr repo project to be specified, this is a follow up for @shifucun's support for building images in custom dc project
  • Move helm out of Terraform management. In the past few weeks, issue 1 is that it was easy to get into a circular situation where the state needs to be read at the beginning (reading what resource a cluster has requires kube config), but the kube config is yet to be fetched because it happens in the part after reading the states. May be cleaner to separate GCP resource level and k8s resource levels.
  • This does not affect GCP level resource lifecycle, as even if pods and services are in a cluster, the cluster can still be deleted.
  • Taking the k8s resources out of the Terraform lifecycle may make things easier when integrating with Argo, as Argo changes will not touch Terraform at all.

@Fructokinase
Copy link
Collaborator Author

I used this to deploy to RFF instance, but I'm not confident enough without E2E testing so the version is not bumped yet.

@Fructokinase
Copy link
Collaborator Author

@juliawu Also adding you as the reviewer, hope you can familiarize yourself with some of the changes happening in custom DC!

Copy link
Contributor

@shifucun shifucun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate out helm and terraform is a good idea

@@ -29,10 +29,9 @@ kind: Ingress
metadata:
name: {{ .Values.ingress.name }}
namespace: {{ .Values.namespace.name }}
{{- with .Values.ingress.annotations }}
annotations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good, would be nice to have more static resources like this (so it's easier to know what it is)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will follow this style from now on!

deploy/helm_charts/dc_website/values.yaml Outdated Show resolved Hide resolved
node_count = var.num_nodes

node_config {
machine_type = "e2-highmem-4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check if we can use a weaker machine_type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following our kustomize template, I'm changing the memory from website container 8G -> 3G, ESP 2G -> 1G. Which will be something like 12.75G occupied of 15.X G available in a e2-standard-4.

However, the cost saving is only $30 or so (from ~$132 to ~$98). If there's anything that we shouldn't cheap out on imo it's the machine type, because k8s under pressure will give all sorts of weird errors. We may also need more resources for new services. I think keeping this machine type is appropriate but will leave the final desicion to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then sounds good to keep this. In that case, can revert the mem change above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the mem changes for now to conserve resources for other potential pods (like argo). Please see the response to Julia's comment.

Copy link
Contributor

@juliawu juliawu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding me, LGTM, just had some minor comments. Deferring approval to @shifucun.

memory: "8G"
memory: "3G"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, why are we lowering memory limits, both here and in line 204&206?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two way of generating these yamls. Kustomize and helm (different tools). We deploy to autopush/staging/prod/stanford using kustomize, and custom DC uses helm.

Since the kustomize template uses 3G and 1G respectively, it means that it should be enough for custom DC as well. Since we're trying to save cost for custom DC users, we only have 1 machine for the entire GKE. I would like to conserve resources in case we need to deploy more things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, the flask server won't use too much memory, only the svg mixer needs high memory.

@@ -12,8 +12,13 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [[ $LOCATION =~ ^[a-z]+-[a-z0-9]+$ ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a quick comment on what the regex is looking for

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thanks!

node_count = var.num_nodes

node_config {
machine_type = "e2-highmem-4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then sounds good to keep this. In that case, can revert the mem change above?

@Fructokinase Fructokinase merged commit ddc1dfd into datacommonsorg:master Mar 2, 2023
shifucun added a commit that referenced this pull request Mar 28, 2023
Bumps [redis](https://github.com/redis/redis-py) from 3.5.3 to 4.5.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/redis/redis-py/releases">redis's
releases</a>.</em></p>
<blockquote>
<h2>4.5.3</h2>
<h1>Changes</h1>
<p>Update urgency: HIGH: There is a critical bug that may affect a
subset of users. Upgrade!</p>
<h2>🐛 Bug Fixes</h2>
<ul>
<li><a
href="https://cwe.mitre.org/data/definitions/404.html">CWE-404</a>
AsyncIO Race Condition Fix (<a
href="https://redirect.github.com/redis/redis-py/issues/2624">#2624</a>,
<a
href="https://redirect.github.com/redis/redis-py/issues/2579">#2579</a>)</li>
</ul>
<h2>4.5.2</h2>
<h1>Changes</h1>
<h2>🚀 New Features</h2>
<ul>
<li>Introduce AbstractConnection so that UnixDomainSocketConnection can
call super().<strong>init</strong> (<a
href="https://redirect.github.com/redis/redis-py/issues/2588">#2588</a>)</li>
<li>Added queue_class to REDIS_ALLOWED_KEYS (<a
href="https://redirect.github.com/redis/redis-py/issues/2577">#2577</a>)</li>
<li>Made search document subscriptable (<a
href="https://redirect.github.com/redis/redis-py/issues/2615">#2615</a>)</li>
<li>Sped up the protocol parsing (<a
href="https://redirect.github.com/redis/redis-py/issues/2596">#2596</a>)</li>
</ul>
<h2>🐛 Bug Fixes</h2>
<ul>
<li>Fix behaviour of async PythonParser to match RedisParser as for
issue <a
href="https://redirect.github.com/redis/redis-py/issues/2349">#2349</a>
(<a
href="https://redirect.github.com/redis/redis-py/issues/2582">#2582</a>)</li>
<li>Replace async_timeout by asyncio.timeout (<a
href="https://redirect.github.com/redis/redis-py/issues/2602">#2602</a>)</li>
<li>Update json().arrindex() default values (<a
href="https://redirect.github.com/redis/redis-py/issues/2611">#2611</a>)</li>
</ul>
<h2>🧰 Maintenance</h2>
<ul>
<li>Coverage for pypy-3.9 (<a
href="https://redirect.github.com/redis/redis-py/issues/2608">#2608</a>)</li>
<li>Developer Experience: Adding redis version compatibility details to
the README (<a
href="https://redirect.github.com/redis/redis-py/issues/2621">#2621</a>)</li>
<li>Remove redundant assignment to RedisCluster.nodes_manager. (<a
href="https://redirect.github.com/redis/redis-py/issues/2620">#2620</a>)</li>
<li>Developer Experience: [types] update return type of smismember to
list[int] (<a
href="https://redirect.github.com/redis/redis-py/issues/2617">#2617</a>)</li>
<li>Developer Experience: [docs] ConnectionPool SSL example (<a
href="https://redirect.github.com/redis/redis-py/issues/2605">#2605</a>)</li>
<li>Developer Experience: Fixed CredentialsProvider examples (<a
href="https://redirect.github.com/redis/redis-py/issues/2587">#2587</a>)</li>
<li>Developer Experience: Update README to make pip install
copy-pastable on zsh (<a
href="https://redirect.github.com/redis/redis-py/issues/2584">#2584</a>)</li>
<li>Developer Experience: Fix for <code>lpop</code> and
<code>rpop</code> return typing (<a
href="https://redirect.github.com/redis/redis-py/issues/2590">#2590</a>)</li>
</ul>
<h2>Contributors</h2>
<p>We'd like to thank all the contributors who worked on this
release!</p>
<p><a
href="https://github.com/CrimsonGlory"><code>@​CrimsonGlory</code></a>,
<a href="https://github.com/Galtozzy"><code>@​Galtozzy</code></a>, <a
href="https://github.com/aksinha334"><code>@​aksinha334</code></a>, <a
href="https://github.com/barshaul"><code>@​barshaul</code></a>, <a
href="https://github.com/chayim"><code>@​chayim</code></a>, <a
href="https://github.com/davemcphee"><code>@​davemcphee</code></a>, <a
href="https://github.com/dvora-h"><code>@​dvora-h</code></a>, <a
href="https://github.com/kristjanvalur"><code>@​kristjanvalur</code></a>,
<a href="https://github.com/ryin1"><code>@​ryin1</code></a>, <a
href="https://github.com/sileht"><code>@​sileht</code></a>, <a
href="https://github.com/thebarbershop"><code>@​thebarbershop</code></a>,
<a href="https://github.com/uglide"><code>@​uglide</code></a>, <a
href="https://github.com/woutdenolf"><code>@​woutdenolf</code></a> and
<a href="https://github.com/zakaf"><code>@​zakaf</code></a></p>
<h2>4.5.1</h2>
<h1>Changes</h1>
<h2>🐛 Bug Fixes</h2>
<ul>
<li>Fix <a
href="https://redirect.github.com/redis/redis-py/issues/2581">#2581</a>
<code>UnixDomainSocketConnection</code> object has no attribute
<code>_command_packer</code> (<a
href="https://redirect.github.com/redis/redis-py/issues/2583">#2583</a>)</li>
</ul>
<h2>Contributors</h2>
<p>We'd like to thank all the contributors who worked on this
release!</p>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/redis/redis-py/blob/master/CHANGES">redis's
changelog</a>.</em></p>
<blockquote>
<pre><code>* Allow data to drain from async PythonParser when reading
during a disconnect()
* Use asyncio.timeout() instead of async_timeout.timeout() for python
&gt;= 3.11 ([#2602](redis/redis-py#2602))
* Add test and fix async HiredisParser when reading during a
disconnect() ([#2349](redis/redis-py#2349))
* Use hiredis-py pack_command if available.
* Support `.unlink()` in ClusterPipeline
* Simplify synchronous SocketBuffer state management
* Fix string cleanse in Redis Graph
* Make PythonParser resumable in case of error
([#2510](redis/redis-py#2510))
* Add `timeout=None` in `SentinelConnectionManager.read_response`
* Documentation fix: password protected socket connection
([#2374](redis/redis-py#2374))
* Allow `timeout=None` in `PubSub.get_message()` to wait forever
* add `nowait` flag to `asyncio.Connection.disconnect()`
* Update README.md links
* Fix timezone handling for datetime to unixtime conversions
* Fix start_id type for XAUTOCLAIM
* Remove verbose logging from cluster.py
* Add retry mechanism to async version of Connection
* Compare commands case-insensitively in the asyncio command parser
* Allow negative `retries` for `Retry` class to retry forever
* Add `items` parameter to `hset` signature
* Create codeql-analysis.yml
([#1988](redis/redis-py#1988)). Thanks @chayim
* Add limited support for Lua scripting with RedisCluster
* Implement `.lock()` method on RedisCluster
* Fix cursor returned by SCAN for RedisCluster &amp; change default
target to PRIMARIES
* Fix scan_iter for RedisCluster
* Remove verbose logging when initializing ClusterPubSub,
ClusterPipeline or RedisCluster
* Fix broken connection writer lock-up for asyncio
([#2065](redis/redis-py#2065))
* Fix auth bug when provided with no username
([#2086](redis/redis-py#2086))
* Fix missing ClusterPipeline._lock
([#2189](redis/redis-py#2189))
* Added dynaminc_startup_nodes configuration to RedisCluster
* Fix reusing the old nodes' connections when cluster topology refresh
is being done
* Fix RedisCluster to immediately raise AuthenticationError without a
retry
* ClusterPipeline Doesn't Handle ConnectionError for Dead Hosts
([#2225](redis/redis-py#2225))
* Remove compatibility code for old versions of Hiredis, drop Packaging
dependency
* The `deprecated` library is no longer a dependency
* Failover handling improvements for RedisCluster and Async RedisCluster
([#2377](redis/redis-py#2377))
* Fixed &quot;cannot pickle '_thread.lock' object&quot; bug
([#2354](redis/redis-py#2354),
[#2297](redis/redis-py#2297))
* Added CredentialsProvider class to support password rotation
* Enable Lock for asyncio cluster mode
* Fix Sentinel.execute_command doesn't execute across the entire
sentinel cluster bug
([#2458](redis/redis-py#2458))
* Added a replacement for the default cluster node in the event of
failure ([#2463](redis/redis-py#2463))
* Fix for Unhandled exception related to self.host with unix socket
([#2496](redis/redis-py#2496))
</code></pre>
<ul>
<li>4.1.3 (Feb 8, 2022)
<ul>
<li>Fix flushdb and flushall (<a
href="https://redirect.github.com/redis/redis-py/issues/1926">#1926</a>)</li>
<li>Add redis5 and redis4 dockers (<a
href="https://redirect.github.com/redis/redis-py/issues/1871">#1871</a>)</li>
<li>Change json.clear test multi to be up to date with redisjson (<a
href="https://redirect.github.com/redis/redis-py/issues/1922">#1922</a>)</li>
<li>Fixing volume for unstable_cluster docker (<a
href="https://redirect.github.com/redis/redis-py/issues/1914">#1914</a>)</li>
<li>Update changes file with changes since 4.0.0-beta2 (<a
href="https://redirect.github.com/redis/redis-py/issues/1915">#1915</a>)</li>
</ul>
</li>
<li>4.1.2 (Jan 27, 2022)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/redis/redis-py/commit/66a4d6b2a493dd3a20cc299ab5fef3c14baad965"><code>66a4d6b</code></a>
AsyncIO Race Condition Fix (<a
href="https://redirect.github.com/redis/redis-py/issues/2641">#2641</a>)</li>
<li><a
href="https://github.com/redis/redis-py/commit/318b114f4da9846a2a7c150e1fb702e9bebd9fdf"><code>318b114</code></a>
Version 4.5.2 (<a
href="https://redirect.github.com/redis/redis-py/issues/2627">#2627</a>)</li>
<li><a
href="https://github.com/redis/redis-py/commit/1b2f408259405d412d7530291902f9e0c8bd34b3"><code>1b2f408</code></a>
Fix behaviour of async PythonParser to match RedisParser as for issue <a
href="https://redirect.github.com/redis/redis-py/issues/2349">#2349</a>
(...</li>
<li><a
href="https://github.com/redis/redis-py/commit/7d474f90453c7b90bd06c94e0250b618120a599d"><code>7d474f9</code></a>
introduce AbstractConnection so that UnixDomainSocketConnection can call
supe...</li>
<li><a
href="https://github.com/redis/redis-py/commit/c87172347584301f453c601c483126e4800257b7"><code>c871723</code></a>
pypy-3.9 CI (<a
href="https://redirect.github.com/redis/redis-py/issues/2608">#2608</a>)</li>
<li><a
href="https://github.com/redis/redis-py/commit/d63313bf6080acaf18d61e072c78303adc0d4166"><code>d63313b</code></a>
add queue_class to REDIS_ALLOWED_KEYS (<a
href="https://redirect.github.com/redis/redis-py/issues/2577">#2577</a>)</li>
<li><a
href="https://github.com/redis/redis-py/commit/c61eeb2e3b5dff1f01eb1e665f424c7e75354f56"><code>c61eeb2</code></a>
Adding supported redis/library details (<a
href="https://redirect.github.com/redis/redis-py/issues/2621">#2621</a>)</li>
<li><a
href="https://github.com/redis/redis-py/commit/25e85e51e57b7aae9eb8fc77cfb0a45a07a501a7"><code>25e85e5</code></a>
fix: replace async_timeout by asyncio.timeout (<a
href="https://redirect.github.com/redis/redis-py/issues/2602">#2602</a>)</li>
<li><a
href="https://github.com/redis/redis-py/commit/91ab12a0f1bdf0e433131e1a51578e9fa2f89718"><code>91ab12a</code></a>
Remove redundant assignment. (<a
href="https://redirect.github.com/redis/redis-py/issues/2620">#2620</a>)</li>
<li><a
href="https://github.com/redis/redis-py/commit/8bfd492240fd33489a86cd3d353e3ece1fc94c10"><code>8bfd492</code></a>
Making search document subscriptable (<a
href="https://redirect.github.com/redis/redis-py/issues/2615">#2615</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/redis/redis-py/compare/3.5.3...v4.5.3">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=redis&package-manager=pip&previous-version=3.5.3&new-version=4.5.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/datacommonsorg/website/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Bo Xu <shifucun@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants