Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Http/sURLConnection auto instrumentation #133

Conversation

surbhiia
Copy link
Contributor

@surbhiia surbhiia commented Oct 30, 2023

What's covered in this PR

As the Android SDK classes can't be instrumented with ByteBuddy Android Gradle Plugin, this PR instruments all calls to relevant URLConnection/HttpURLConnection/HttpsURLConnection APIs in android applications source code.

APIs Instrumented:

APIs that connect but do not read bytes:

  • connect()
  • getOutputStream()

APIs that connect and read bytes:

  • getContent()
  • getContent(Class[])
  • getContentEncoding()
  • getContentLength()
  • getContentType()
  • getDate()
  • getExpiration()
  • getLastModified()
  • getHeaderFieldDate(name, Default)
  • getHeaderFields()
  • getHeaderFieldKey(int)
  • getHeaderField(int)
  • getHeaderField(name)
  • getHeaderFieldInt(name, Default)
  • getResponseMessage()
  • getResponseCode()
  • getInputStream()
  • getErrorStream()
  • getContentLengthLong() ( added in API level 24)
  • getHeaderFieldLong() ( added in API level 24)

Approach Taken

  • Request header can only be added before connecting. So, trace context header should be added before any of the above API calls are made.
  • All above API calls are replaced with our own replacement methods which do the following:
  1. Add connection to our ongoing URLConnections map if not added already. If it is the first time connecting, trace context is added as a header to the outgoing request
  2. Original API call is made. If method can throw an exception, we wrap the method around a try/catch, so we can close the ongoing spans with exception reporting before re-throwing exception back to client.
  3. We update the last seen timestamp for the connection. If API is amongst the ones that read bytes also, we mark this URLConnection as harvestable.

Harvestable meaning here - if harvestable and if not already reported and if left idle for more than CONNECTION_KEEP_ALIVE_INTERVAL ms, connection is reported by a connection killer thread. So, once connection is made and any data is read from the connection, we mark it as harvestable because we don't anticipate connection to be sitting idle for longer after this.

As noted above, We keep track of all ongoing URLConnections and three other properties for it - last seen time, harvestable (boolean), reported (boolean)

When are spans closed and reported:

  • If any exception is thrown anywhere, spans are closed and reported.
  • As a general behavior, either getInputStream or getErrorStream is always called and called in the end. We close spans and report when these two methods are called.
  • A connection reporter is scheduled which periodically check if any tracked connection, which is marked harvestable and not reported yet, was last seen more than CONNECTION_KEEP_ALIVE_INTERVAL ms before, then it closes the spans and reports the connection. [Currently, couldn't think of a way to proactively schedule this for clients, customers can manually call our API to schedule this runnable for now.]

Sample App to test this PR
Added an android sample app with many test scenarios for this PR here.

If you would like to test this PR using the above sample app:

  • Pull this PR in your local repo.
  • Update the return value to some dummy value for the API you would like to test in HttpUrlReplacements.java
  • Update gradle.properties with relevant version number
  • Use ./gradlew publishToMavenLocal

In sample app:

  • Update the version of httpURLConnection agent and lib dependencies to the version number you just published
  • Build and run the app. Click the "ClickMe" button.
  • See output in logcat.

@surbhiia surbhiia requested a review from a team October 30, 2023 23:10
@LikeTheSalad
Copy link
Contributor

Thanks @surbhiia I think it's a good start, although I noticed that there are some comments and todos in the wrapper class that seem like we should discuss further, probably in a call if needed.

I've also created this issue to prevent API level restrictions on this module. I think that if we can verify that no runtime issues would happen when the wrapper class is loaded in envs where some methods aren't available, then it should be fine to ignore API restrictions for these cases.

@surbhiia
Copy link
Contributor Author

surbhiia commented Nov 2, 2023

I've also created this issue to prevent API level restrictions on this module. I think that if we can verify that no runtime issues would happen when the wrapper class is loaded in envs where some methods aren't available, then it should be fine to ignore API restrictions for these cases.

Thanks @LikeTheSalad! I added those 2 APIs in the wrapper class and instrumented the sample app with that and ran the instrumented app on an API level 21 image and it worked perfectly! No issues at runtime!

@surbhiia surbhiia closed this Nov 2, 2023
@surbhiia
Copy link
Contributor Author

surbhiia commented Nov 2, 2023

Oops, mistakenly closed, re-opening again!

@surbhiia surbhiia reopened this Nov 2, 2023
@breedx-splk
Copy link
Contributor

Thanks @surbhiia for the contribution!
Please add a README.md for the instrumentation. At a minimum it should contain:

  • Status: Experimental
  • What a user needs to do to opt into this instrumentation
    • Build changes
    • Starting the harvester
  • A very brief description of what this does (bytecode instrumentation wrapping user calls to the http call sites)

@breedx-splk
Copy link
Contributor

As discussed, I've created #148 so that we can defer adding test to this (large!) PR. I'll assign it to you @surbhiia if you think you'd like to also take that on after this goes in. 🙏🏻 Thanks.

@surbhiia
Copy link
Contributor Author

surbhiia commented Nov 15, 2023

Thanks @breedx-splk! I've added a readme now! Do let me know if anything looks amiss.

Thanks for creating the issue. Do assign it to me. I'll add the tests in next PR. :)

@surbhiia surbhiia force-pushed the develop/HttpURLConnection-auto-instrumentation branch from 15504e8 to fa389ed Compare November 15, 2023 02:03
@breedx-splk
Copy link
Contributor

@surbhiia @LikeTheSalad We chatted about some edge cases, specifically related to use cases where the client just pumps data (like a POST) and doesn't bother to read the response. I have put together a test gist that does exactly this: https://gist.github.com/breedx-splk/246962f7d1c180f924a530630a966c9f

In my testing with that, I run the collector locally and then I used netcat (nc) to see the request payload. I then ran that main method with the java agent attached and observed the following:

  • No traces sent to the collector (debug logger enabled in the collector)
  • No input seen by nc.

This indicates to me that the upstream instrumentation is behaving correctly -- it doesn't create a span because the request/response has not taken place. I think the HttpURLConnection "apis" (if you can even call them that!) expect the user to read the response in some way -- it is very very lazy.

If you add int responseCode = conn.getResponseCode(); after the flush() for example, the request is actually sent.

With that in mind, I definitely don't think it's something we should worry about covering this early in the instrumentation development.

@breedx-splk
Copy link
Contributor

breedx-splk commented Nov 15, 2023

Thanks for creating the issue. Do assign it to me. I'll add the tests in next PR. :)

😎 @surbhiia You have to comment on the issue before I can assign. Just a simple "assign me please" will do. 👍🏻

@surbhiia
Copy link
Contributor Author

surbhiia commented Nov 15, 2023

@surbhiia @LikeTheSalad We chatted about some edge cases, specifically related to use cases where the client just pumps data (like a POST) and doesn't bother to read the response. I have put together a test gist that does exactly this: https://gist.github.com/breedx-splk/246962f7d1c180f924a530630a966c9f
...

@breedx-splk Thanks for testing this! I created this issue about it in that repo yesterday and wanted to test this.

Other than getInputStream/getErrorStream, getResponseCode and getResponseMessage are other two methods generally always called. Java instrumentation is closing spans for the latter two as well, as those methods inherently call getInputStream. We're instrumenting the call sites itself and we've not ended spans for the latter two right now. We can perhaps do that but that isn't a good idea as then we will close early in many cases and miss out on any exceptions occurring after that as usually first getResponseCode is checked and based on that either getInputStream/getErrorStream is called and we want to capture any IOExceptions there. Sort of a trade-off between waiting to close in order to capture any possible exceptions vs closing early in order to have a span ended for more possible scenarios. I still think we should go with the former as we can have the correct telemetry reported.

Right now in the readme we've mentioned scheduling of the harvester as a required configuration. We can perhaps make it optional and call out how it affects the behavior. And in future we can figure out how to schedule that too automatically if possible :)

@trask
Copy link
Member

trask commented Nov 15, 2023

We can perhaps do that but that isn't a good idea as then we will close early in many cases and miss out on any exceptions occurring after that

is this any worse than the Java agent instrumentation? (I'm guessing we're probably closing the span early in these cases too?)

@surbhiia
Copy link
Contributor Author

is this any worse than the Java agent instrumentation? (I'm guessing we're probably closing the span early in these cases too?)

@trask No, we're not closing spans early, instead waiting till the end and we're scheduling a thread to periodically look for idle connections and close spans.

@breedx-splk breedx-splk removed the stale label Jun 6, 2024
Comment on lines 31 to 34
static {
activeURLConnections = new ConcurrentHashMap<>();
logger = Logger.getLogger("HttpUrlReplacements");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just move these to the corresponding declaration lines.

@breedx-splk
Copy link
Contributor

Thanks again @surbhiia for circling back on this. I think it's looking like a great start. I had a few comments, but I think it's close.

surbhiia and others added 3 commits June 7, 2024 11:07
…ntelemetry/instrumentation/library/httpurlconnection/internal/HttpUrlConnectionSingletons.java


Update instrumentation name

Co-authored-by: jason plumb <75337021+breedx-splk@users.noreply.github.com>
@surbhiia
Copy link
Contributor Author

surbhiia commented Jun 7, 2024

Thanks @breedx-splk for such a detailed review! I've addressed your comments now. Do take a look and let me know if anything seems missing. :)

@breedx-splk
Copy link
Contributor

@open-telemetry/android-approvers Please have a look!

Copy link
Contributor

@bidetofevil bidetofevil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there tests forthcoming?


> The minimum supported Android SDK version is 21, though it will also instrument APIs added in the Android SDK version 24 when running on devices with API level 24 and above.

> If your project's minSdk is lower than 26, then you must enable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Desugaring is necessary for minSdk 24 now because of the dependency on the semconv module. You may just want to point to the reference in main README instead instead of putting this and the AGP requirements here?

Copy link
Contributor Author

@surbhiia surbhiia Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think of that, but I was/am more inclined to keep this here as this module can be used separately to the main module. Customers do not need to navigate to and read the other readme.

If not, maybe we can pull out the similar text mentioned in main readme to a separate section in main readme itself, something like "Critical Notes"(there is a slight more addition to that we can make - for AGP < 8.3.0, the property to be used for correct dexing is different). And link to that section from this readme. But then, all the incremental callouts that would go in that section might not apply to this module always.

Which one sounds better to you? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a similar duplication in the Java sem conv repo earlier but was advised to put the details in one place so if we need to update, we didn't have to do it multiple places. But I think both ways have advantages, so it's up to you which one you prefer. I think merging as is and figuring it out later is perfectly reasonable.


// Time (ms) to wait before assuming that an idle connection is no longer
// in use and should be reported.
private static final long CONNECTION_INACTIVITY_TIMEOUT = 10000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to include the unit in the constant name to make it clear, especially since OTel is mostly in nanoseconds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, 10s is actually not a very long period for mobile apps - requests routinely exceed this. If this value is looking at the time between the connection was first attempted to when it's checked, this will timeout legit requests that are still going.

Do we want to up this value substantially, or perhaps do this timeout in a different way?

Copy link
Contributor Author

@surbhiia surbhiia Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, 10s is the time for which connection has been found to be idle after becoming harvestable (last time it successfully read from the server). Every time any successful read API (getContent etc or even the inputStream individual read APIs) is triggered the last seen time is updated. So, the idea was - only those connections that get stuck in between - not reading anything from server for 10s, no throwable, not disconnected - would be picked up after 10s of being idle.

But I see your point, there could be cases when there could be longer than 10s delay between two read APIs (specially the consecutive inputStream.read calls ) due to various reasons.

I did plan on discussing about making it configurable, so consumers can set it closer to their connection_timeout setting or based on their application expected behavior. Will bring it up in Android sig meeting, we can perhaps add that in next PRs along with making the span name extractor function configurable.

We can still keep 10s as the default or increase it if you think it's very less. The downside of going with default for the consumer right now, would be that they will miss capturing any throwable that would have occurred from consecutive reads from the connection, as the connection span will get reported as successful by the harvester after 10s of being idle. There's always option for consumer to not schedule this thread if they have good implementation where they finally call disconnect on the connection. That will always make sure the span gets reported. The only span that will not be closed for them then would be for cases when the connection is legit stuck. If they're handling such scenarios too in their application code nicely, by disconnecting, they don't have to worry about the harvester. For scenarios, when they're not able to disconnect a stuck connection, they will need to schedule harvester with a timeout of their choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per offline discussion, aligning this with the default would be nice, and making it configurable would be even nicer. That said, it's not blocking, so feel free to take care of this as you please.

@bidetofevil
Copy link
Contributor

Hi @surbhiia - sorry for taking so long for adding my comments. My main question is regarding the timeout value - 10 seconds doesn't seem sufficient if it's for the duration of the entire request-response cycle for an Android app. What would you think about a higher value, or maybe make it configurable?

@surbhiia
Copy link
Contributor Author

Are there tests forthcoming?

Yes! I have the test module ready already in my local in the same branch. As soon as this PR is merged, will put up another PR for that.

@surbhiia
Copy link
Contributor Author

Hi @surbhiia - sorry for taking so long for adding my comments. My main question is regarding the timeout value - 10 seconds doesn't seem sufficient if it's for the duration of the entire request-response cycle for an Android app. What would you think about a higher value, or maybe make it configurable?

No problem at all! I'm glad it's getting such good reviews. :) I've answered your question about timeout in previous comment. We can continue the discussion further in our Android Sig this week.

Copy link
Contributor

@bidetofevil bidetofevil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Comments added for potential future work, but nothing blocking.

Copy link
Contributor

@LikeTheSalad LikeTheSalad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@LikeTheSalad LikeTheSalad merged commit 475f08f into open-telemetry:main Jun 25, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants