Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overly detailed User-Agent #4265

Closed
jwilk opened this issue Feb 4, 2017 · 11 comments
Closed

Overly detailed User-Agent #4265

jwilk opened this issue Feb 4, 2017 · 11 comments
Labels
auto-locked Outdated issues that have been locked by automation

Comments

@jwilk
Copy link
Contributor

jwilk commented Feb 4, 2017

pip's User-Agent field looks like this:

pip/9.0.1 {"cpu":"i686","distro":{"id":"jessie","libc":{"lib":"glibc","version":"2.19"},"name":"Debian GNU/Linux","version":"8"},"implementation":{"name":"CPython","version":"3.4.2"},"installer":{"name":"pip","version":"9.0.1"},"openssl_version":"OpenSSL 1.0.1t  3 May 2016","python":"3.4.2","system":{"name":"Linux","release":"3.16.0-4-686-pae"}}

That's a lot of information about my system.
PyPI has no business knowing my OpenSSL version, my libc version, or my kernel version.
Please trim this down.

@dstufft
Copy link
Member

dstufft commented Feb 4, 2017

This information is used to provide metrics to figure out what folks are using in order to make informed decisions about where we draw lines of support for a variety of features. For example, you called out the libc version, and that is used when deciding where to draw the lines of support for features like manylinux1 and such. Removing data from this would make it harder to progress python Packaging (and is unlikely to actually be useful to anyone else, particularly since pip makes it difficult to accidentally send this information cleartext).

@dstufft
Copy link
Member

dstufft commented Mar 31, 2017

I'm going to close this since I don't think we're going to be changing this at this time.

@lkarsten
Copy link

lkarsten commented Sep 8, 2018

Hello.

I came across this today, and I'm very surprised that this is the case. This is a terrible default setting, and it should definitively be changed.

At the very least it must be clear from the --help text that any command run will leak details to third parties that you have no control over.

If you need analytics, make it opt in. Ask people if it is ok that this is sent. You have enough traffic that sampled traffic will be good enough for this use.

Please reconsider this horrible practice. Make it right.

@supakeen
Copy link

supakeen commented Sep 8, 2018

I don't agree with @lkarsten 's vibe: you are installing software from that exact same 3rd party you have no control over.

However, I do agree with the general idea, pypa probably wants to know this information as it is useful but a default opt-out and a question on pip install or first-run if you want to submit your system information is probably a good idea and seems to be the way many package managers and distro's tackle this same problem.

@lkarsten
Copy link

lkarsten commented Sep 8, 2018

@pfmoore Hi. I think you're reading too much into my comment. Also, since you said you don't care discussing it or my opinion on it, why are you chiming in at all? Not very constructive. :-(

I'm happy to contribute a patch fixing this if it is wanted.

@Ekultek
Copy link

Ekultek commented Sep 8, 2018

Cool can’t wait to use this to my advantage. Thanks for ignoring the fact that you’re releasing information into the wild that could potentially compromise a computer. That’s a ridiculous amount of information to put into a UA.

@supakeen
Copy link

supakeen commented Sep 8, 2018

@Ekultek How is this released into the wild? I assume PyPI (which pip connects to) still uses TLS so that would need to be subverted as well. Or do you mean that anyone can have this information if someone installs from a direct URL?

@dstufft
Copy link
Member

dstufft commented Sep 8, 2018

So there are two potential arguments I see here.

One is some nebulous concerns of security. To the absolute best of my knowledge there is absolutely no real security concern here other than FUDish nonsense. If you disagree, then I challenge you to come up with a coherent threat model where this information can be used to attack someone, that removing it would prevent attack-- remembering that the primary use case for pip contacting a remote service is that it's about to download and execute arbitrary Python code-- thus anyone in a position to see this information, to my knowledge, is already capable of directly executing arbitrary Python on a user's machine if they were nefarious. Present a means of attack where that isn't true, or stop with the FUD.

The other is that of privacy concerns, that some of this information is more than some subset of users would want to share with a third party service, and thus those users should have a means to control what information is sent. To my mind, this is the only concern that's actually valid here. If someone wants to make a privacy related argument, they're free to do so, but like any feature they'll have to make a case not only that someone would want this, but that enough people want it to add the maintenance burden of a new feature, and that the other possible mechanisms for solving the problem are not suitable and why.

I'm going to otherwise ignore any security related arguments as well as the implication that people are going to hack other commenters until someone comes forward with a meaningful, coherent threat model where a reduction in this information helps.

@lkarsten
Copy link

lkarsten commented Sep 9, 2018

@dstufft Hi. Thanks for the comprehensive answer. My concern is the privacy aspect. I've been using pip for many years, and I had no idea this was in place. I expect this to be the case for the majority of other pip users as well. I just assumed that some of the supposedly many eyes reviewing open source would catch this much earlier. Perhaps this latest interest are those very eyes, we'll see.

The privacy concern goes on that pip is sharing data pip don't own, by default, for some perceived benefit that the end user don't get any part of.

This is the same tracking/ad argument we've seen a thousand times before, and probably will see a lot more in the future. Standards are slipping; "Everyone is doing it, so we can too". It is so easy, just collect and store everything! It is still wrong. It must be possible to run a client somewhere without the computer telling everyone what you are doing.

I'm curious what kind of reports you are doing with this data. Is it actually used, or is it just kept around until someone wants to play with data mining? For how long are you storing this information, and have you thought about what happens when you lose it?

https://github.com/pypa/pip/blob/master/src/pip/_internal/download.py#L67 looks to be the code that collects the information. What possible use do you have for the exact kernel version I'm running?

You collected this data without user consent. You did not tell pip users that it was being done. The right thing to do here is to make it opt-in (ask once), and to publicly state what steps you will take to clean this unnecessary data from your logs.

Please make this right.

@dstufft
Copy link
Member

dstufft commented Sep 13, 2018

I don't have time right now for a full response, but I wanted to quickly answer this:

I'm curious what kind of reports you are doing with this data. Is it actually used, or is it just kept around until someone wants to play with data mining? For how long are you storing this information, and have you thought about what happens when you lose it?

https://github.com/pypa/pip/blob/master/src/pip/_internal/download.py#L67 looks to be the code that collects the information. What possible use do you have for the exact kernel version I'm running?

For PyPI itself, we store this information in our public BigQuery database to allow people to query it. That table is effectively just the data you see in the UA, plus the URL, a timestamp, some data about the file you're downloading, the TLS protocol and cipher that was used, as well as the country code for the country that the IP address being used geolocated.

Thus there is nothing linking this data back to a specific person in any specificity other than "someone in this country has these values".

As far as what other people are querying this data for, there are a wide range of uses, some I know off the top of my head:

  • Deciding what versions of Python to support.
  • Prioritizing different platforms for producing wheels and/or supporting them at all.
  • Deciding what kernel or libc features a C library can depend on existing.
  • Deciding where to focus efforts in terms of supporting certain platforms/features/etc in things like manylinux support.

@lock
Copy link

lock bot commented May 30, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label May 30, 2019
@lock lock bot locked as resolved and limited conversation to collaborators May 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation
Projects
None yet
Development

No branches or pull requests

5 participants