Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event not sent when scrubbing sensitive data in before_send on sentry-ruby 5.22.2 #2526

Closed
davidwessman opened this issue Jan 27, 2025 · 11 comments · Fixed by #2529
Closed
Assignees

Comments

@davidwessman
Copy link

Issue Description

  • We have some tests checking our filtering before sending data to Sentry.
  • When upgrading to 5.22.2 we no longer get any events in our tests

Our filtering matches the documentation
https://docs.sentry.io/platforms/ruby/guides/rails/data-management/sensitive-data/#scrubbing-data

I think this change is the culprit.

Reproduction Steps

config/initializers/sentry.rb

SENTRY_FILTER = %i[
  identification_number
  ...
]

Sentry.init do |config|
  # https://docs.sentry.io/platforms/ruby/guides/rails/data-management/sensitive-data/#scrubbing-data
  filter = ActiveSupport::ParameterFilter.new(
    SENTRY_FILTER
  )

  config.before_send = lambda do |event, hint|
    filter.filter(event.to_hash)
  end
end
require "test_helper"
require "sentry/test_helper"

class SentryTest < ActiveSupport::TestCase
  include Sentry::TestHelper

  setup do
    setup_sentry_test
  end

  test "exception sends to Sentry" do
    exception = RuntimeError.new("This is a test")
    Sentry.capture_exception(exception)

    event = last_sentry_event # this returns nil on 5.22.2
    message = event[:exception][:values][0][:value]

    assert_match(/This is a test/, message)
  end
end

Expected Behavior

before_send filter should still be valid (or another way to write it)

Actual Behavior

before_send returns hash and therefore it will not send

Ruby Version

3.3.7

SDK Version

5.22.2

Integration and Its Version

Rails 7.1, Sidekiq 7.3.8

Sentry Config

Minimal example that reproduces the error

SENTRY_FILTER = %i[
  body
  identification_number
]

Sentry.init do |config|
  # https://docs.sentry.io/platforms/ruby/guides/rails/data-management/sensitive-data/#scrubbing-data
  filter = ActiveSupport::ParameterFilter.new(
    SENTRY_FILTER
  )

  config.before_send = lambda do |event, hint|
    filter.filter(event.to_hash)
  end
end
@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Jan 27, 2025
@sl0thentr0py
Copy link
Member

@st0012 @solnic this seems to be because of #2504

@st0012
Copy link
Collaborator

st0012 commented Jan 28, 2025

Sorry for the regression. I'm taking a look.

#2529

@aaronjensen
Copy link
Contributor

This has caused Sentry to be completely useless for us. We had a production error on Friday that was not recognized until today by accident because of this. This release should have been immediately pulled. It is absolutely unacceptable that a patch release would cause such a significant regression and even more unacceptable that the release was not immediately removed and all customers were not notified. I see a 5 hour old PR that's just sitting there.

What is the status of this? Right now we have to redeploy all of our web applications to ensure that they can actually report errors properly. I'm beyond frustrated with the decline in service and reliability that Sentry has continued to undergo in recent years.

@sl0thentr0py
Copy link
Member

sl0thentr0py commented Jan 29, 2025

Apologies for the oversight from our side here, 5.22.3 has been released now.

@aaronjensen Someone from the product side will follow up with you on the support ticket you opened regarding feedback on how this issue was tackled and other complaints you might have on our overall system stability.

As for

I see a 5 hour old PR that's just sitting there.

We all work internationally in many different timezones and some humans sleep while others are awake. Oversight happens in complex software with multiple collaborators and we will improve process around quicker resolution of regressions.

@davidwessman
Copy link
Author

I am glad that my colleague wrote a test that tried to scrub data.

Is there any information on what we should return in the future in our before_send?

@aaronjensen
Copy link
Contributor

@sl0thentr0py

With all due respect, I'm well aware that humans sleep. What concerns me is that someone from Sentry (you) acknowledged the issue at 5AM Pacific time yesterday but did not actually appear to address the issue or appear to take any of the actions that I would expect from a company who is meant to be the "watcher". I'm reminded of "Who watches the watchers?"

I see no explanation for why the release was not pulled, or why on call staff was not called to repair the issue and issue an emergency release. Furthermore, I know you have data on what versions people are using, so why didn't I get an email telling me the version I was using was faulty? Why didn't your operations team notice a drop in error report rates associated with that particular version?

When your response to me pointing out the harm that you (the company, Sentry) caused me and my clients is to patronize me, it does not restore a lick of confidence.

@sl0thentr0py
Copy link
Member

I see you are (understandably) annoyed, but I will give you clear and transparent answers to your questions as an engineer.

I see no explanation for why the release was not pulled

Our library release flow involves a bunch of compliance and process and defined by that process, we do not revoke releases but fix and release an updated version.

why on call staff was not called to repair the issue

We do not have on call staff for SDKs (which are things users install and run on their machines), but only for our SaaS offering.
It is my (and other SDK engineers) responsibility to ensure SDK stability and I take responsibility for this particular fuck up.

I know you have data on what versions people are using, so why didn't I get an email telling me the version I was using was faulty? Why didn't your operations team notice a drop in error report rates associated with that particular version?

At the scale we operate, a broken before_send even for a few hundreds of customers is unfortunately a drop in the ocean and there is no feasible statistical way I can see to detect something broken at this size.
We get a lot of alerts when something is seriously broken, but I would be lying by saying it is possible to detect every problem of every shape and size across SDKs and our product. That is a serious engineering challenge.
Just to give you an example, it would be very hard for us to distinguish a drop in events due to this reason and due to some other reason like when spike protection kicks in for a particular user. We can only assume why a particular client stopped sending us events suddenly, that is not a deterministic process.
What we do is try to be fully transparent about event drops in our Stats page.

As an open source company, especially for SDKs, we absolutely rely on users like you to notify us on github when something is wrong and to fix it as soon as possible.

@aaronjensen
Copy link
Contributor

I see you are (understandably) annoyed, but

As a point of order, you are continuing to be patronizing. My annoyance or lack thereof is irrelevant. What is relevant is that a vendor that I have been using and recommending for years dropped the ball in a major way and now I am forced to reevaluate my stance on that vendor. I have to decide whether or not I am going to recommend to my clients that we transition 100s of apps from Sentry to something else. I have to stand by my next recommendation and I have to eat crow for the current one. That is, unless I can get some sense from Sentry that this is highly unlikely to happen again.

I spent yesterday, instead of assisting the rest of my team, focused on diagnosing why it appeared errors were not being tracked, then updating 30+ web applications to have a pinned version of Sentry.

I'm not annoyed. I'm affected and continue to be so.

Our library release flow involves a bunch of compliance and process and defined by that process, we do not revoke releases but fix and release an updated version.

I would personally call this into question at this point if I were in your shoes. I recognize that I am not and I don't know everything about your situation. I'm not trying to. I'm trying to point to expectation as a customer/user/recommender of your product.

...is no feasible statistical way I can see to detect something broken at this size.

If the impact was that small, then I totally understand this.

We do not have on call staff for SDKs...

Is this a "yet"?

I haven't heard anything about the testing lapse here, the cultural concerns, or anything else that lets me know that this isn't going to just happen again in the next release.

@sbellware
Copy link

sbellware commented Jan 29, 2025

Irrespective of particularities of normal operating procedures, this was categorically not a matter of normal operations. It's a matter of emergency operations, which require emergency procedures.

You don't need to execute elaborate compliance protocols in order to revert to a previous version of a client unless that previous version is incompatible with the prior version. Data about compatibility is something that anyone should reasonably expect a company of Sentry's scale to have and to maintain, and that data is a critical input to its emergency operations.

Failing to be able to execute an emergency reversion of a client package betrays a lack of preparedness and thoughtfulness toward emergency situations and operations. As bad as that is, what's more shocking in all this is the tacit admission of engineering cultural problems evidenced by the equivocation evident in the responses from Sentry's staff.

This incident isn't a matter of organizational scale, of open source, of off-shoring, of fault detection, and it's barely an issue of compliance processes. It's a straightforward matter of executing a transition to an emergency response posture in the face of an emergency, and key to emergency response processes, of decentralizing decision-making to trusted lieutenant during the emergency so that they can respond quickly and sensibly based on their experience, training, and the trust that Sentry has invested in them.

The root cause in all of this is a serious cultural dysfunction that would not just allow for the defect to escape, but that would seek to defer accountability, and to even victim-shame users and customers.

I suspect that this matter is being contained by lower-and-mid-level people at Sentry, and being hidden from senior leadership. Either that, or the cultural problem is rooted in the senior leadership itself.

Either way, Sentry's response to this is wholly the wrong response. On the other hand, it's a response in public which allows the public to witness the deflection and equivocation indulged in addressing a problem that only every needed a response that conveys accountability, and clear communication of the commitment and ways and means of impending remediation.

And still, it's pretty typical of mid-curve software development culture, so at least Sentry isn't really displaying out-of-norm character and behavior in all of this.

Nevertheless, we know that Sentry could have done better, could still do better, has a clear path to doing better–and has a ready issue-at-hand that it can use to shape the needed changes–but refuses to acknowledge the fullness of the harm that it has invited into its users' lives.

When Sentry advances beyond the impulse to evade and to equivocate, it will have also matured its engineering processes beyond the middling point where failing to have a practiced and practicable emergency response plan in-place to respond to the kind of failings illustrated in this issue.

We don't want to know that you face challenges. Challenges are a common denominator for every single person and organization in the audience here. We want to know why you didn't switch to an emergency response posture resulting–at the very least–in the immediate evaluation of the option to yank the defective client package release.

From a business perspective, Sentry has squandered hard-won good will of customers and users. It will be far more expensive to recover that asset than it would have to have invested in a more salient emergency preparedness regime.

@azranel
Copy link

azranel commented Jan 29, 2025

@sl0thentr0py and @st0012 thanks for the fix!

@davidwessman
Copy link
Author

I absolutely prefer a tool that develops and follows the trends of web development. I can recommend the test included in the issue 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants