Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[new feature] promtail: Add config reload endoint / signal to promtail #7247

Merged
merged 30 commits into from
Oct 11, 2022

Conversation

liguozhong
Copy link
Contributor

@liguozhong liguozhong commented Sep 26, 2022

What this PR does / why we need it:
Add config reload endoint / signal to promtail.

reload is a very dangerous feature, it is easy to make promtail panic, we can refer to the failure history of another log agent

vectordotdev/vector#10485
vectordotdev/vector#10412
vectordotdev/vector#13228

But "/reload" can make promtail more flexible, make loki easier to use with k8s, and provide a better foundation for log tail CRD (like prometheus ServiceMonitor).

config

server:
  enable_runtime_reload: true

detail config

server:
  http_listen_port: 9080
  grpc_listen_port: 0
  enable_runtime_reload: true

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs-reload-promtail-test
      __path__: /var/log/cloudprintupdate.log

Which issue(s) this PR fixes:
Fixes #16

Special notes for your reviewer:
old config
image
curl "/reload"
image
new config, "job: varlogs-reload-promtail-test" has change
image

Checklist

  • Reviewed the CONTRIBUTING.md guide
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@liguozhong liguozhong requested a review from a team as a code owner September 26, 2022 13:10
Copy link
Collaborator

@trevorwhitney trevorwhitney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this addition. I think this is a good idea for promtail, I just have a couple thoughts and questions.

return promtail, nil
}

func (p *Promtail) reloadConfig(cfg config.Config) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not a pointer to config.Config? It seems like we're mutating it within this function (ie. cfg.PositionsConfig.ReadOnly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for such fast feedback,done.

promtailServer := p.server.(*server.PromtailServer)
for {
select {
case <-hup:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, not super familiar with these live reload patterns, can you explain the SIGHUP flow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refer to the reload documentation of prometheus, we should also provide this reload way.

https://prometheus.io/docs/prometheus/latest/configuration/configuration/
A configuration reload is triggered by sending a SIGHUP.
image

@@ -693,7 +693,7 @@ func Test_DryRun(t *testing.T) {
PositionsFile: f.Name(),
SyncPeriod: time.Second,
},
}, clientMetrics, false, nil)
}, nil, clientMetrics, false, nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a test that this new configReload function is called? Not sure if we need to go so far as the whole config gets reloaded, but maybe at least the function is called, and test the flow where watch config is disabled when the function is not provided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.
hi, your review suggestion is great, through this test I found a bug that "PromtailServer.promtailCfg" was not updated correctly.

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice feature @liguozhong!

This is a very delicate feature which has to be handled and communicated very carefully. I've left a bunch of small nits relating to wording, and identified a couple areas that I'm a bit concerned about - most notably the lack of sad-day tests and the behaviour of invalid configs exiting the process on reload.

clients/pkg/promtail/promtail.go Outdated Show resolved Hide resolved
Comment on lines 178 to 181
level.Warn(p.logger).Log("msg", "disable watchConfig")
return
}
level.Warn(p.logger).Log("msg", "enable watchConfig")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These log messages should be more clear, please.

Comment on lines 188 to 195
cfg := p.newConfig()
if err := p.reloadConfig(cfg); err != nil {
level.Error(p.logger).Log("msg", "Error reloading config", "err", err)
}
case rc := <-promtailServer.Reload():
cfg := p.newConfig()
if err := p.reloadConfig(cfg); err != nil {
level.Error(p.logger).Log("msg", "Error reloading config", "err", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we centralised this logic rather than repeating it

@@ -51,6 +54,8 @@ type Config struct {
ExternalURL string `yaml:"external_url"`
HealthCheckTarget *bool `yaml:"health_check_target"`
Disable bool `yaml:"disable"`
Reload bool `yaml:"reload"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider naming this more clearly.

I would suggest enable_runtime_reload

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 77 to 81
return nil, err
}
server, err := server.New(cfg.ServerConfig, promtail.logger, promtail.targetManagers, cfg.String())
if err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap these errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -51,6 +54,8 @@ type Config struct {
ExternalURL string `yaml:"external_url"`
HealthCheckTarget *bool `yaml:"health_check_target"`
Disable bool `yaml:"disable"`
Reload bool `yaml:"reload"`
NewByReload bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant, invalid variables left over from my development process

@@ -60,6 +65,7 @@ func (cfg *Config) RegisterFlagsWithPrefix(prefix string, f *flag.FlagSet) {
cfg.Config.RegisterFlags(f)

f.BoolVar(&cfg.Disable, prefix+"server.disable", false, "Disable the http and grpc server.")
f.BoolVar(&cfg.Reload, prefix+"server.reload", false, "Enable reload via HTTP request.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also be reloaded via SIGHUP, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, the SIGHUP way in the reload of prometheus cannot be disable.
I suggest we should keep the same behavior as prometheus.

clients/pkg/promtail/promtail.go Show resolved Hide resolved
clients/pkg/promtail/promtail_test.go Show resolved Hide resolved
var config Config
if err := cfg.DefaultUnmarshal(&config, args, flag.NewFlagSet(os.Args[0], flag.ExitOnError)); err != nil {
fmt.Println("Unable to parse config:", err)
os.Exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I certainly don't think this behaviour is desired; if a runtime config cannot be reloaded, it should not kill the process.
I think we need to distinguish between the initial load and subsequent (runtime) reloads.

This will also better match how Loki works with its overrides.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@liguozhong
Copy link
Contributor Author

All the review tips, I have made targeted changes

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
- querier/queryrange	-0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@liguozhong
Copy link
Contributor Author

@dannykopping hi, this pr is ready for review .
please help me to review it again when you have time

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good @liguozhong - a few minor points to address then I think this is ready to gp

clients/cmd/promtail/main.go Outdated Show resolved Hide resolved
@@ -8,6 +8,7 @@ import (

"github.com/go-kit/log"
"github.com/go-kit/log/level"
"github.com/pkg/errors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please use the stdlib errors where possible

@@ -65,20 +74,24 @@ func New(cfg config.Config, newConfig func() *config.Config, metrics *client.Met
metrics: metrics,
dryRun: dryRun,
}
err := promtail.reg.Register(reloadTotal)
if err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap this error

@@ -87,9 +100,14 @@ func New(cfg config.Config, newConfig func() *config.Config, metrics *client.Met
}

func (p *Promtail) reloadConfig(cfg *config.Config) error {
level.Info(p.logger).Log("msg", "Loading configuration file")
level.Info(p.logger).Log("msg", "Reloading configuration file")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think let's move this into the condition on L107 so that it only logs when a change is made.
Alternatively, you could change this to a debug level and add an info level log line in the condition on L107

}
promtailServer, ok := p.server.(*server.PromtailServer)
if !ok {
level.Warn(p.logger).Log("msg", "disable watchConfig", "reason", "promtailServer cast fail")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message won't mean much to the user. Can we return the actual parse error here?

}
err = p.reloadConfig(cfg)
if err != nil {
reloadTotal.With(prometheus.Labels{"code": "500"}).Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these 200 and 500 codes meant to mimick HTTP status codes?
If so, I think that's confusing. Let's be clear here and have two separate metrics - one for successful reloads and one for failures.

liguozhong and others added 2 commits October 10, 2022 17:30
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
@liguozhong
Copy link
Contributor Author

Auto-merging clients/cmd/promtail/main.go
CONFLICT (content): Merge conflict in clients/cmd/promtail/main.go
Automatic merge failed; fix conflicts and then commit the result.

I have to merge master branch

# Conflicts:
#	clients/cmd/promtail/main.go
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
- querier/queryrange	-0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@liguozhong
Copy link
Contributor Author

Looking good @liguozhong - a few minor points to address then I think this is ready to gp

Thank you for investing so much time in helping to review this PR.
According to your suggestion, I have commit new code, and the cicd also passed.

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last couple of small nits around errors, and then we can get this merged 👍

if err != nil {
return nil, err
return nil, errors.Wrap(err, "error register prometheus collector reloadSuccessTotal")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace these with fmt.Errorf

cfg, err := p.newConfig()
if err != nil {
reloadFailTotal.Inc()
return errors.Wrap(err, "Error new Config")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use fmt.Errorf

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@liguozhong
Copy link
Contributor Author

done ,and cicd has passed

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add config reload endoint / signal to promtail
4 participants