Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade ClickHouse server image to v23.4 #306

Merged
merged 2 commits into from
Jun 8, 2023

Conversation

heanlan
Copy link
Contributor

@heanlan heanlan commented May 24, 2023

  • Upgrade ClickHouse server to v23.4
  • Upgrade ClickHouse operator and metrics exporter to v0.21.0

Notable changes:

  • The inner table name of Materialized View has
    been changed to .inner_id.<uuid>. We are no longer able to
    find the table by self-defined Materialized View name. Thus,
    in create_table.sh, we save the data of Materialized Views
    to pre-defined underlying tables.
  • ClickHouse upgrade test used to always apply the latest
    clickhouse-operator-install-bundle.yaml. In this upgrade,
    latest operator yaml does not work with version N-1 ClickHouse
    data schema. Therefore, we change the test to use the correct
    version of operator yaml corresponding to the tested ClickHouse
    data schema.
  • We removed the theia- prefix in clickhouse operator images in
  • We increased theia-manager ping timeout when connecting to
    ClickHouse Service to fix test flakiness.
  • We added a time wait before starting iPerf traffic in e2e test to
    avoid missing the first a few records.

Signed-off-by: heanlan hanlan@vmware.com

@codecov
Copy link

codecov bot commented May 24, 2023

Codecov Report

Merging #306 (14ade5a) into main (96a2009) will increase coverage by 0.12%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #306      +/-   ##
==========================================
+ Coverage   66.56%   66.69%   +0.12%     
==========================================
  Files          38       38              
  Lines        4783     4783              
==========================================
+ Hits         3184     3190       +6     
+ Misses       1452     1444       -8     
- Partials      147      149       +2     
Flag Coverage Δ *Carryforward flag
python-coverage 57.51% <ø> (ø) Carriedforward from 99e8b52
unit-tests 69.90% <ø> (+0.16%) ⬆️

*This pull request uses carry forward flags. Click here to find out more.

Impacted Files Coverage Δ
pkg/util/clickhouse/clickhouse.go 74.28% <ø> (ø)

... and 2 files with indirect coverage changes

@elton-furtado elton-furtado added this to the Theia v0.7 release milestone May 24, 2023
@heanlan heanlan force-pushed the upgrade-clickhouse branch 5 times, most recently from 74958b4 to 40b2873 Compare May 29, 2023 21:09
test/e2e/framework.go Outdated Show resolved Hide resolved
ci/kind/CH_OPERATOR_VERSION_MAP Outdated Show resolved Hide resolved
ci/kind/test-e2e-kind.sh Outdated Show resolved Hide resolved
@heanlan heanlan force-pushed the upgrade-clickhouse branch from 9d5a1e3 to 9a53a17 Compare June 5, 2023 18:02
@heanlan heanlan marked this pull request as draft June 5, 2023 19:30
@heanlan heanlan force-pushed the upgrade-clickhouse branch 2 times, most recently from bc275de to b85b98d Compare June 5, 2023 21:09
@heanlan heanlan marked this pull request as ready for review June 5, 2023 22:53
@heanlan heanlan requested review from yanjunz97 and yuntanghsu June 5, 2023 22:53
ci/kind/test-upgrade-theia.sh Outdated Show resolved Hide resolved
ci/kind/test-upgrade-theia.sh Outdated Show resolved Hide resolved
docs/network-flow-visibility.md Show resolved Hide resolved
@heanlan heanlan force-pushed the upgrade-clickhouse branch from 13c3e37 to abb8346 Compare June 6, 2023 00:20
Copy link
Contributor

@yanjunz97 yanjunz97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@heanlan heanlan force-pushed the upgrade-clickhouse branch from abb8346 to 2b6e4b4 Compare June 6, 2023 01:09
@heanlan
Copy link
Contributor Author

heanlan commented Jun 6, 2023

/theia-test-e2e

Copy link
Contributor

@yuntanghsu yuntanghsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

test/e2e/framework.go Show resolved Hide resolved
@heanlan
Copy link
Contributor Author

heanlan commented Jun 6, 2023

Let's don't merge this PR for now as we are still trying to figure out the e2e flakiness in --run=TestFlowVisibility/IPv4/IntraNodeFlows

- Upgrade ClickHouse server to v23.4
- Upgrade ClickHouse operator and metrics exporter to v0.21.0

Notable changes:
- The inner table name of Materialized View has
been changed to `.inner_id.<uuid>`. We are no longer able to
find the table by self-defined Materialized View name. Thus,
in `create_table.sh`, we save the data of Materialized Views
to pre-defined underlying tables.
- ClickHouse upgrade test used to always apply the latest
clickhouse-operator-install-bundle.yaml. In this upgrade,
latest operator yaml does not work with version N-1 ClickHouse
data schema. Therefore, we change the test to use the correct
version of operator yaml corresponding to the tested ClickHouse
data schema.
- We removed the theia- prefix in clickhouse operator images in
- We increase theia-manager ping timeout when connecting to
ClickHouse Service to fix test flakiness.

Signed-off-by: heanlan <hanlan@vmware.com>
@heanlan heanlan force-pushed the upgrade-clickhouse branch from 2b6e4b4 to 99e8b52 Compare June 6, 2023 20:54
@heanlan heanlan marked this pull request as draft June 7, 2023 03:22
Signed-off-by: heanlan <hanlan@vmware.com>
@heanlan heanlan marked this pull request as ready for review June 8, 2023 06:35
@heanlan
Copy link
Contributor Author

heanlan commented Jun 8, 2023

Please help review if the latest commit looks good to you

Background: We found flakiness in TestFlowVisibility/IPv4/IntraNodeFlows. Sometimes it got either Error: "2" is not greater than or equal to "3". or Error: "1" is not greater than or equal to "3", which means we missed the first one or two records.

I added a debug log in ClickHouse client code and extend the ping time out from 10s to 30s for better observation. Below is an example log file when the client trying to ping the DB:

I0608 05:10:12.371000       1 clickhouse.go:48] "ClickHouse configuration" database="default" databaseURL="tcp://clickhouse-clickhouse.flow-visibility.svc:9000" debug=false compress=true commitInterval="1s"
I0608 05:10:12.393483       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:13.414479       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:14.411648       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:15.410131       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:16.409643       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:17.408258       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:18.413000       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:19.409809       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:20.411970       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:21.406826       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:22.406779       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:23.412605       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:24.411991       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:25.414337       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:26.413789       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:27.407402       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:28.414537       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"
I0608 05:10:29.405719       1 clickhouseclient.go:444] "Ping ClickHouse DB" connErr="failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host"

It turns out the testing iPerf traffic starts before the connection is ready. So we add a time wait as a workaround. Please check out the comment for details.

Copy link
Contributor

@yanjunz97 yanjunz97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, maybe we need to open an issue to track that we need another antctl command.

@heanlan
Copy link
Contributor Author

heanlan commented Jun 8, 2023

More info on FA Pod restart situation - we observe the Pod firstly in "Running" state, after some time it crashed and restarted. By checking the Pod log below, we found the error comes from timeout while pinging ClickHouse DB, and finally return to FA cmd/main.go

I0608 03:21:21.886566       1 clickhouse.go:48] "ClickHouse configuration" database="default" databaseURL="tcp://clickhouse-clickhouse.flow-visibility.svc:9000" debug=true compress=true commitInterval="1s"
F0608 03:21:31.951796       1 main.go:50] Error running flow aggregator: error when creating ClickHouse export process: failed to connect to ClickHouse after 10s: failed to ping ClickHouse: dial tcp: lookup clickhouse-clickhouse.flow-visibility.svc on 10.96.0.10:53: no such host
goroutine 1 [running]:
k8s.io/klog/v2/internal/dbg.Stacks(0x0)
	/go/pkg/mod/k8s.io/klog/v2@v2.80.1/internal/dbg/dbg.go:35 +0x89
k8s.io/klog/v2.(*loggingT).output(0x40ec120, 0x3, 0x0, 0xc000326cb0, 0x1, {0x33117a3?, 0x1?}, 0xc00007a800?, 0x0)
	/go/pkg/mod/k8s.io/klog/v2@v2.80.1/klog.go:935 +0x686
k8s.io/klog/v2.(*loggingT).printfDepth(0x40ec120, 0x0?, 0x0, {0x0, 0x0}, 0x0?, {0x27d8c57, 0x21}, {0xc000373e00, 0x1, ...})
	/go/pkg/mod/k8s.io/klog/v2@v2.80.1/klog.go:736 +0x1f3
k8s.io/klog/v2.(*loggingT).printf(...)
	/go/pkg/mod/k8s.io/klog/v2@v2.80.1/klog.go:718
k8s.io/klog/v2.Fatalf(...)
	/go/pkg/mod/k8s.io/klog/v2@v2.80.1/klog.go:1621
main.newFlowAggregatorCommand.func1(0xc0004ca000?, {0x27a259a?, 0x7?, 0x7?})
	/antrea/cmd/flow-aggregator/main.go:50 +0x1f4
github.com/spf13/cobra.(*Command).execute(0xc0004ca000, {0xc000124010, 0x7, 0x7})
	/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:944 +0x847
github.com/spf13/cobra.(*Command).ExecuteC(0xc0004ca000)
	/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
	/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992
main.main()
	/antrea/cmd/flow-aggregator/main.go:32 +0x1e

Copy link

@dreamtalen dreamtalen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@heanlan heanlan merged commit 61818ba into antrea-io:main Jun 8, 2023
@heanlan heanlan deleted the upgrade-clickhouse branch June 8, 2023 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants