-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
yarn classic: Offline mirror collision #786
base: main
Are you sure you want to change the base?
Conversation
Based on the commit message (and a missing issue) it is difficult to understand why we're trying to provide a hotfix for a problem that at first glance appears to be a project configuration error, so I'd suggest creating an issue first with all the necessary context and a reproducer to shed more light on this PR since the commit message also hints at a genuine bug in our tool. |
I changed the logic completely by disabling exception raise only for two identical registry packages. That should not be a problem in my opinion. I would finish this problem here and now, rather than just create a temporary solution with a promise of a proper fix in the future. |
Heavy +1 to this! I'll have a look some time this week (earlier rather than later). |
all_tarballs_counter: Counter[str] = Counter() | ||
registry_tarballs_counter: Counter[str] = Counter() | ||
|
||
for p in packages: | ||
if isinstance(p, (RegistryPackage, UrlPackage)): | ||
tarball_name = get_tarball_mirror_name(p.url) | ||
tarballs.append(tarball_name) | ||
all_tarballs_counter[tarball_name] += 1 | ||
if isinstance(p, RegistryPackage): | ||
registry_tarballs_counter[tarball_name] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid you don't leverage the benefit of counters this way since you're using them as plain dictionaries, by initializing the counter with an iterable it'll do the heavy lifting for you.
pytest.param( | ||
[ | ||
RegistryPackage( | ||
name="foo", | ||
version="1.0.0", | ||
url="https://registry.yarnpkg.com/same/-/same-1.0.0.tgz", | ||
), | ||
RegistryPackage( | ||
name="bar", | ||
version="1.0.0", | ||
url="https://registry.yarnpkg.com/same/-/same-1.0.0.tgz", | ||
), | ||
], | ||
id="registry_packages", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm confused here, this definitely seems like it's going to cover the alias problem, but we'd be neglecting other (potentially malicious) registry packages that would yield the same tarball name as a regular dependency and IIUC that could potentially lead to the "other" dependency overwriting the tarball from a "legit" dependency silently in the offline mirror, wouldn't it? I guess I don't understand the NPM ecosystem enough to make an educated call on whether this fix over one scanning package aliases is desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by regular dependency ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A legit one... I was merely wondering if and why would a legitimate dependency have a collision in the tarball name, but then again, it is possible even with legit dependencies and this proposed solution is somewhat greedy that it not only takes aliases into consideration and allows that but it also waves other registry package conflicts, that's all.
# flag if there is at least 1 non-registry package OR a cross-type collision | ||
if non_registry_count >= 1: | ||
duplicate_tarballs.append(f"{name} ({count}x)") | ||
|
||
if len(duplicate_tarballs) > 0: | ||
raise PackageManagerError(f"Duplicate tarballs detected: {', '.join(duplicate_tarballs)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although preexisting, as I'm looking at it this isn't a very useful error message because the user may not easily know which package the tarball from the error message belongs to, especially with registry package types.
c = Counter(tarballs) | ||
duplicate_tarballs = [f"{name} ({count}x)" for name, count in c.most_common() if count > 1] | ||
# check for duplicates, considering cross-type collisions | ||
duplicate_tarballs = [] | ||
for name, count in all_tarballs_counter.items(): | ||
if count <= 1: | ||
continue | ||
|
||
registry_count = registry_tarballs_counter.get(name, 0) | ||
non_registry_count = count - registry_count | ||
|
||
# flag if there is at least 1 non-registry package OR a cross-type collision | ||
if non_registry_count >= 1: | ||
duplicate_tarballs.append(f"{name} ({count}x)") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite my other comment on my uncertainty of the correctness of this fix (i.e. ignoring tarball collisions for registry packages completely), the whole logic in this function can be replaced and simplified with the following:
diff --git a/cachi2/core/package_managers/yarn_classic/main.py b/cachi2/core/package_managers/yarn_classic/main.py
index d445c01a..7478612f 100644
--- a/cachi2/core/package_managers/yarn_classic/main.py
+++ b/cachi2/core/package_managers/yarn_classic/main.py
@@ -1,7 +1,6 @@
import logging
-from collections import Counter
from pathlib import Path
-from typing import Any, Iterable
+from typing import Any, Iterable, Union
from cachi2.core.errors import PackageManagerError, PackageRejected
from cachi2.core.models.input import Request
@@ -201,41 +200,31 @@ def _verify_corepack_yarn_version(source_dir: RootedPath, env: dict[str, str]) -
log.info("Processing the request using yarn@%s", installed_yarn_version)
+def _get_tarball_mirror_name(obj: Union[RegistryPackage, UrlPackage, GitPackage]) -> str:
+ if isinstance(obj, GitPackage):
+ return get_git_tarball_mirror_name(obj.url)
+ return get_tarball_mirror_name(obj.url)
+
+
def _verify_no_offline_mirror_collisions(packages: Iterable[YarnClassicPackage]) -> None:
"""Verify that there are no duplicate tarballs in the offline mirror."""
- all_tarballs_counter: Counter[str] = Counter()
- registry_tarballs_counter: Counter[str] = Counter()
-
- for p in packages:
- if isinstance(p, (RegistryPackage, UrlPackage)):
- tarball_name = get_tarball_mirror_name(p.url)
- all_tarballs_counter[tarball_name] += 1
- if isinstance(p, RegistryPackage):
- registry_tarballs_counter[tarball_name] += 1
-
- elif isinstance(p, GitPackage):
- tarball_name = get_git_tarball_mirror_name(p.url)
- all_tarballs_counter[tarball_name] += 1
+ _pkgs = [p for p in packages if isinstance(p, (RegistryPackage, UrlPackage, GitPackage))]
+ tarball_collisions: dict[str, list] = {}
+ for pkg in _pkgs:
+ tarball_name = _get_tarball_mirror_name(pkg)
+ if tarball_collisions.get(tarball_name) is None:
+ tarball_collisions[tarball_name] = [pkg]
else:
- # file, link, and workspace packages are not copied to the offline mirror
- continue
+ tarball_collisions[tarball_name].append(pkg)
- # check for duplicates, considering cross-type collisions
- duplicate_tarballs = []
- for name, count in all_tarballs_counter.items():
- if count <= 1:
+ for tarball_name, collisions in tarball_collisions.items():
+ if len(collisions) < 2 or all((isinstance(obj, RegistryPackage) for obj in collisions)):
continue
- registry_count = registry_tarballs_counter.get(name, 0)
- non_registry_count = count - registry_count
-
- # flag if there is at least 1 non-registry package OR a cross-type collision
- if non_registry_count >= 1:
- duplicate_tarballs.append(f"{name} ({count}x)")
-
- if len(duplicate_tarballs) > 0:
- raise PackageManagerError(f"Duplicate tarballs detected: {', '.join(duplicate_tarballs)}")
-
+ names = ', '.join((obj.name for obj in collisions))
+ raise PackageManagerError(
+ f"Offline mirror tarball collision found with dependencies: {names}"
+ )
# References
# [yarn_classic_trait]: https://github.com/yarnpkg/berry/blob/13d5b3041794c33171808fdce635461ff4ab5c4e/packages/yarnpkg-core/sources/Project.ts#L434
it's a little bit shorter and more importantly it is more expressive compared to plain counters because it'll report the actual package names instead of just tarball names, see my other comment on the pre-existing logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff above also uncovered an interesting bug we have in the Yarn v1 code. If you try running it with the integration tests (this one: https://github.com/cachito-testing/cachi2-yarn/tree/offline-mirror-collision) without first modifying the test, you'll see an error on mismatching error strings. The important bit in that error log will be this:
PackageManagerError: Offline mirror tarball collision found with dependencies: package-b, package-b
weird huh? Why is it reporting the same package twice? Then you look at the input to the collision detecting function and you'll see:
ipdb> p packages
[WorkspacePackage(path=<RootedPath root='/var/tmp/cachi2-yarn/tmpfqlgte43.cachi2-source-copy' subpath='.'>, name='yarn-test-project', version='1.0.0', integrity=None, dev=False), UrlPackage(url='https://github.com/cachito-testing/cachi2-yarn/raw/refs/heads/offline-mirror-collision/package-a/example.tgz#8622b0e42e8fc610d63f95ca2127c1a954f1f4bb', name='package-a', version='1.0.0', integrity=None, dev=False), UrlPackage(url='https://github.com/cachito-testing/cachi2-yarn/raw/refs/heads/offline-mirror-collision/package-b/example.tgz#e457c96791a275050811ad23720ff770185a5c78', name='package-a', version='1.0.0', integrity=None, dev=False)]
so there must be something shady going on between our package factory function and this collision function because we used the same package name for both URL packages despite the URL being different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by chime-in:
If you replace a regular dictionary with a defaultdict primed with list here:
tarball_collisions: dict[str, list] = {}
you won't need to explicitly check if a key is present, just append.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird huh? Why is it reporting the same package twice? Then you look at the input to the collision detecting function and you'll see:
So I run the fetch-deps
multiple times, and the output is changing all the time.
Explanation:
For certain packages we read the "real" name from the actual tarball - we unpack the tarball, read package.json, and get the "real" name from that, e.g - https://github.com/containerbuildsystem/cachi2/blob/main/cachi2/core/package_managers/yarn_classic/resolver.py#L264
Since both URL packages have the same tarball name, the second one overwrites the first one and there will be only one tarball in the offline mirror. Either package-a and its tarball or package-b and its tarball. That's why they share the same name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I run the fetch-deps multiple times, and the output is changing all the time.
Expected, the order in which this gets processed IMO isn't strictly defined in our code.
Since both URL packages have the same tarball name, the second one overwrites the first one and there will be only one tarball in the offline mirror. Either package-a and its tarball or package-b and its tarball. That's why they share the same name.
Spot on! I have to say I doubted that the correlation would imply the causality like this, but you're absolutely right, that's the reason! In that case, I'm afraid our code is incorrect, we can't first download/overwrite stuff already in the offline cache like that and only THEN use that information to extract package names leading to incorrect data representation on our side, we have to detect the collisions sooner, before they happen not after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, based on output from pyarn
the packages in question (from the linked issue) are represented as follows:
alias: wrap-ansi-cjs
checksum: sha512-YVGIj2kamLSTxw6NsZjoBxfSwsn0ycdesmc4p+Q21c5zPuZ1pl+NfxVdxPtdHvmNVOQ6XSYG4AUtyt/Fi7D16Q==
dependencies: {'ansi-styles': '^4.0.0', 'string-width': '^4.1.0', 'strip-ansi': '^6.0.0'}
from_dict: <bound method Package.from_dict of <class 'pyarn.lockfile.Package'>>
get_path_from_version_specifier: <function Package.get_path_from_version_specifier at 0x7f9d328f5080>
name: wrap-ansi
path: None
relpath: None
url: https://registry.yarnpkg.com/wrap-ansi/-/wrap-ansi-7.0.0.tgz#67e145cff510a6a6984bdf1152911d69d2eb9e43
version: 7.0.0
alias: None
checksum: sha512-r6lPcBGxZXlIcymEu7InxDMhdW0KDxpLgoFLcguasxCaJ/SOIZwINatK9KY/tf+ZrlywOKU0UDj3ATXUBfxJXA==
dependencies: {'ansi-styles': '^4.0.0', 'string-width': '^4.1.0', 'strip-ansi': '^6.0.0'}
from_dict: <bound method Package.from_dict of <class 'pyarn.lockfile.Package'>>
get_path_from_version_specifier: <function Package.get_path_from_version_specifier at 0x7f9d328f5080>
name: wrap-ansi
path: None
relpath: None
url: https://registry.yarnpkg.com/wrap-ansi/-/wrap-ansi-6.2.0.tgz#e9393ba07102e6c91a3b221478f0257cd2856e53
version: 6.2.0
alias: None
checksum: sha512-YVGIj2kamLSTxw6NsZjoBxfSwsn0ycdesmc4p+Q21c5zPuZ1pl+NfxVdxPtdHvmNVOQ6XSYG4AUtyt/Fi7D16Q==
dependencies: {'ansi-styles': '^4.0.0', 'string-width': '^4.1.0', 'strip-ansi': '^6.0.0'}
from_dict: <bound method Package.from_dict of <class 'pyarn.lockfile.Package'>>
get_path_from_version_specifier: <function Package.get_path_from_version_specifier at 0x7f9d328f5080>
name: wrap-ansi
path: None
relpath: None
url: https://registry.yarnpkg.com/wrap-ansi/-/wrap-ansi-7.0.0.tgz#67e145cff510a6a6984bdf1152911d69d2eb9e43
version: 7.0.0
alias: None
checksum: sha512-si7QWI6zUMq56bESFvagtmzMdGOtoxfR+Sez11Mobfc7tm+VkUckk9bW2UeffTGVUbOksxmSw0AA2gs8g71NCQ==
dependencies: {'ansi-styles': '^6.1.0', 'string-width': '^5.0.1', 'strip-ansi': '^7.0.1'}
from_dict: <bound method Package.from_dict of <class 'pyarn.lockfile.Package'>>
get_path_from_version_specifier: <function Package.get_path_from_version_specifier at 0x7f9d328f5080>
name: wrap-ansi
path: None
relpath: None
url: https://registry.yarnpkg.com/wrap-ansi/-/wrap-ansi-8.1.0.tgz#56dc22368ee570face1b49819975d9b9a5ead214
version: 8.1.0
which means that we could either base our detection logic on the alias
field OR (maybe more preferable) defining our package models as frozen and hashable so that we could construct a set from results yielded by _YarnClassicPackageFactory
and perform basic filtering early on.
@@ -203,20 +203,36 @@ def _verify_corepack_yarn_version(source_dir: RootedPath, env: dict[str, str]) - | |||
|
|||
def _verify_no_offline_mirror_collisions(packages: Iterable[YarnClassicPackage]) -> None: | |||
"""Verify that there are no duplicate tarballs in the offline mirror.""" | |||
tarballs = [] | |||
all_tarballs_counter: Counter[str] = Counter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the commit message needs a little more information on why registry packages are sort of exempt from the collision (if the collision is only among the same type), but other packages are not. I'm saying this because the issue you linked only describes the problem from user perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will improve on that, but it is still a draft so I didn't bother with it
tarball_collisions[tarball_name].append(p) | ||
|
||
for tarball_name, collisions in tarball_collisions.items(): | ||
if all(pkg == collisions[0] for pkg in collisions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean that all packages in a collisions list must be identical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, otherwise raise an exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a simple function wrapper around the check would be more readable 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[non-blocking]
If you decide reworking this consider making collisions a set and then checking that it is a unit-set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you decide reworking this consider making collisions a set and then checking that it is a unit-set.
Maybe slightly preferred alternative that may pay out in the long run would be to start the package filtering early on - in the _YarnPackageFactory
factory - by turning the models into hashables combined with some minor tweaks to this function we'd gain a native fast filtering mechanism as well as keeping this function only as a safety-net for cross-type collisions.
f"Tarball collision in the offline mirror: {tarball_name} ({len(collisions)}x)" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is an improvement over the existing solution, you're still only reporting a tarball name instead of the package that caused the collision (actually ALL packages that happen to be in a collision) - that doesn't seem to be a helpful user-facing error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to print package names even though it does not work as it should?
#786 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if we know it doesn't work as it's supposed to, then we should IMO fix that, because emitting a seemingly unhelpful error message won't make the underlying problem go away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see too much value in that. Users can find the reported tarball name in the yarn.lock
just as the package name.
So, just to clarify my understanding, I'll paste an example when processing kiali. The
We're assuming it is safe because we realize it is exactly the same tarball, source url and hash included. Even if we had a Github tarball, for example, we would still verify all the details to assume this "collison" was safe (that is, we are downloading the exact tarball from the same place twice). IIUIC, this solution LGTM. I'd just expand the docstring of the function to make this behavior clear, I'll certainly forget it soon 😅 |
Replace the previous duplicate tarball check with a more accurate collision detection mechanism. The new implementation verifies that all packages mapped to the same tarball name in the offline mirror are actually identical. The previous Counter-based approach only checked for duplicate filenames without validating package equivalence. The new defaultdict-based solution properly detects cases where different packages would overwrite each other's tarballs in the mirror. Update test cases to cover valid duplicate scenarios (identical packages) and invalid collisions (different packages with same tarball name). Adjust error messages to better reflect the nature of detected issues. Signed-off-by: Michal Šoltis <msoltis@redhat.com>
Added more descriptive docstring. |
Maintainers will complete the following section
Note: if the contribution is external (not from an organization member), the CI
pipeline will not run automatically. After verifying that the CI is safe to run:
/ok-to-test
(as is the standard for Pipelines as Code)