Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yarn classic: Offline mirror collision #786

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

slimreaper35
Copy link
Member

@slimreaper35 slimreaper35 commented Jan 17, 2025

Maintainers will complete the following section

  • Commit messages are descriptive enough
  • Code coverage from testing does not decrease and new code is covered
  • Docs updated (if applicable)
  • Docs links in the code are still valid (if docs were updated)

Note: if the contribution is external (not from an organization member), the CI
pipeline will not run automatically. After verifying that the CI is safe to run:

@eskultety
Copy link
Member

Based on the commit message (and a missing issue) it is difficult to understand why we're trying to provide a hotfix for a problem that at first glance appears to be a project configuration error, so I'd suggest creating an issue first with all the necessary context and a reproducer to shed more light on this PR since the commit message also hints at a genuine bug in our tool.

@slimreaper35 slimreaper35 changed the title yarn classic: Drop offline mirror collision exception raise yarn classic: Offline mirror collision Jan 20, 2025
@slimreaper35
Copy link
Member Author

I changed the logic completely by disabling exception raise only for two identical registry packages. That should not be a problem in my opinion. I would finish this problem here and now, rather than just create a temporary solution with a promise of a proper fix in the future.

@eskultety
Copy link
Member

I would finish this problem here and now, rather than just create a temporary solution with a promise of a proper fix in the future.

Heavy +1 to this! I'll have a look some time this week (earlier rather than later).

Comment on lines 206 to 214
all_tarballs_counter: Counter[str] = Counter()
registry_tarballs_counter: Counter[str] = Counter()

for p in packages:
if isinstance(p, (RegistryPackage, UrlPackage)):
tarball_name = get_tarball_mirror_name(p.url)
tarballs.append(tarball_name)
all_tarballs_counter[tarball_name] += 1
if isinstance(p, RegistryPackage):
registry_tarballs_counter[tarball_name] += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid you don't leverage the benefit of counters this way since you're using them as plain dictionaries, by initializing the counter with an iterable it'll do the heavy lifting for you.

Comment on lines 256 to 269
pytest.param(
[
RegistryPackage(
name="foo",
version="1.0.0",
url="https://registry.yarnpkg.com/same/-/same-1.0.0.tgz",
),
RegistryPackage(
name="bar",
version="1.0.0",
url="https://registry.yarnpkg.com/same/-/same-1.0.0.tgz",
),
],
id="registry_packages",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm confused here, this definitely seems like it's going to cover the alias problem, but we'd be neglecting other (potentially malicious) registry packages that would yield the same tarball name as a regular dependency and IIUC that could potentially lead to the "other" dependency overwriting the tarball from a "legit" dependency silently in the offline mirror, wouldn't it? I guess I don't understand the NPM ecosystem enough to make an educated call on whether this fix over one scanning package aliases is desirable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by regular dependency ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A legit one... I was merely wondering if and why would a legitimate dependency have a collision in the tarball name, but then again, it is possible even with legit dependencies and this proposed solution is somewhat greedy that it not only takes aliases into consideration and allows that but it also waves other registry package conflicts, that's all.

# flag if there is at least 1 non-registry package OR a cross-type collision
if non_registry_count >= 1:
duplicate_tarballs.append(f"{name} ({count}x)")

if len(duplicate_tarballs) > 0:
raise PackageManagerError(f"Duplicate tarballs detected: {', '.join(duplicate_tarballs)}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although preexisting, as I'm looking at it this isn't a very useful error message because the user may not easily know which package the tarball from the error message belongs to, especially with registry package types.

Comment on lines 218 to 228
c = Counter(tarballs)
duplicate_tarballs = [f"{name} ({count}x)" for name, count in c.most_common() if count > 1]
# check for duplicates, considering cross-type collisions
duplicate_tarballs = []
for name, count in all_tarballs_counter.items():
if count <= 1:
continue

registry_count = registry_tarballs_counter.get(name, 0)
non_registry_count = count - registry_count

# flag if there is at least 1 non-registry package OR a cross-type collision
if non_registry_count >= 1:
duplicate_tarballs.append(f"{name} ({count}x)")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite my other comment on my uncertainty of the correctness of this fix (i.e. ignoring tarball collisions for registry packages completely), the whole logic in this function can be replaced and simplified with the following:

diff --git a/cachi2/core/package_managers/yarn_classic/main.py b/cachi2/core/package_managers/yarn_classic/main.py
index d445c01a..7478612f 100644
--- a/cachi2/core/package_managers/yarn_classic/main.py
+++ b/cachi2/core/package_managers/yarn_classic/main.py
@@ -1,7 +1,6 @@
 import logging
-from collections import Counter
 from pathlib import Path
-from typing import Any, Iterable
+from typing import Any, Iterable, Union
 
 from cachi2.core.errors import PackageManagerError, PackageRejected
 from cachi2.core.models.input import Request
@@ -201,41 +200,31 @@ def _verify_corepack_yarn_version(source_dir: RootedPath, env: dict[str, str]) -
     log.info("Processing the request using yarn@%s", installed_yarn_version)
 
 
+def _get_tarball_mirror_name(obj: Union[RegistryPackage, UrlPackage, GitPackage]) -> str:
+    if isinstance(obj, GitPackage):
+        return get_git_tarball_mirror_name(obj.url)
+    return get_tarball_mirror_name(obj.url)
+
+
 def _verify_no_offline_mirror_collisions(packages: Iterable[YarnClassicPackage]) -> None:
     """Verify that there are no duplicate tarballs in the offline mirror."""
-    all_tarballs_counter: Counter[str] = Counter()
-    registry_tarballs_counter: Counter[str] = Counter()
-
-    for p in packages:
-        if isinstance(p, (RegistryPackage, UrlPackage)):
-            tarball_name = get_tarball_mirror_name(p.url)
-            all_tarballs_counter[tarball_name] += 1
-            if isinstance(p, RegistryPackage):
-                registry_tarballs_counter[tarball_name] += 1
-
-        elif isinstance(p, GitPackage):
-            tarball_name = get_git_tarball_mirror_name(p.url)
-            all_tarballs_counter[tarball_name] += 1
+    _pkgs = [p for p in packages if isinstance(p, (RegistryPackage, UrlPackage, GitPackage))]
+    tarball_collisions: dict[str, list] = {}
+    for pkg in _pkgs:
+        tarball_name = _get_tarball_mirror_name(pkg)
+        if tarball_collisions.get(tarball_name) is None:
+            tarball_collisions[tarball_name] = [pkg]
         else:
-            # file, link, and workspace packages are not copied to the offline mirror
-            continue
+            tarball_collisions[tarball_name].append(pkg)
 
-    # check for duplicates, considering cross-type collisions
-    duplicate_tarballs = []
-    for name, count in all_tarballs_counter.items():
-        if count <= 1:
+    for tarball_name, collisions in tarball_collisions.items():
+        if len(collisions) < 2 or all((isinstance(obj, RegistryPackage) for obj in collisions)):
             continue
 
-        registry_count = registry_tarballs_counter.get(name, 0)
-        non_registry_count = count - registry_count
-
-        # flag if there is at least 1 non-registry package OR a cross-type collision
-        if non_registry_count >= 1:
-            duplicate_tarballs.append(f"{name} ({count}x)")
-
-    if len(duplicate_tarballs) > 0:
-        raise PackageManagerError(f"Duplicate tarballs detected: {', '.join(duplicate_tarballs)}")
-
+        names = ', '.join((obj.name for obj in collisions))
+        raise PackageManagerError(
+            f"Offline mirror tarball collision found with dependencies: {names}"
+        )
 
 # References
 # [yarn_classic_trait]:  https://github.com/yarnpkg/berry/blob/13d5b3041794c33171808fdce635461ff4ab5c4e/packages/yarnpkg-core/sources/Project.ts#L434

it's a little bit shorter and more importantly it is more expressive compared to plain counters because it'll report the actual package names instead of just tarball names, see my other comment on the pre-existing logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff above also uncovered an interesting bug we have in the Yarn v1 code. If you try running it with the integration tests (this one: https://github.com/cachito-testing/cachi2-yarn/tree/offline-mirror-collision) without first modifying the test, you'll see an error on mismatching error strings. The important bit in that error log will be this:

PackageManagerError: Offline mirror tarball collision found with dependencies: package-b, package-b

weird huh? Why is it reporting the same package twice? Then you look at the input to the collision detecting function and you'll see:

ipdb> p packages
[WorkspacePackage(path=<RootedPath root='/var/tmp/cachi2-yarn/tmpfqlgte43.cachi2-source-copy' subpath='.'>, name='yarn-test-project', version='1.0.0', integrity=None, dev=False), UrlPackage(url='https://github.com/cachito-testing/cachi2-yarn/raw/refs/heads/offline-mirror-collision/package-a/example.tgz#8622b0e42e8fc610d63f95ca2127c1a954f1f4bb', name='package-a', version='1.0.0', integrity=None, dev=False), UrlPackage(url='https://github.com/cachito-testing/cachi2-yarn/raw/refs/heads/offline-mirror-collision/package-b/example.tgz#e457c96791a275050811ad23720ff770185a5c78', name='package-a', version='1.0.0', integrity=None, dev=False)]

so there must be something shady going on between our package factory function and this collision function because we used the same package name for both URL packages despite the URL being different.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by chime-in:
If you replace a regular dictionary with a defaultdict primed with list here:

tarball_collisions: dict[str, list] = {}

you won't need to explicitly check if a key is present, just append.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird huh? Why is it reporting the same package twice? Then you look at the input to the collision detecting function and you'll see:

So I run the fetch-deps multiple times, and the output is changing all the time.

Explanation:
For certain packages we read the "real" name from the actual tarball - we unpack the tarball, read package.json, and get the "real" name from that, e.g - https://github.com/containerbuildsystem/cachi2/blob/main/cachi2/core/package_managers/yarn_classic/resolver.py#L264

Since both URL packages have the same tarball name, the second one overwrites the first one and there will be only one tarball in the offline mirror. Either package-a and its tarball or package-b and its tarball. That's why they share the same name.

Copy link
Member

@eskultety eskultety Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I run the fetch-deps multiple times, and the output is changing all the time.

Expected, the order in which this gets processed IMO isn't strictly defined in our code.

Since both URL packages have the same tarball name, the second one overwrites the first one and there will be only one tarball in the offline mirror. Either package-a and its tarball or package-b and its tarball. That's why they share the same name.

Spot on! I have to say I doubted that the correlation would imply the causality like this, but you're absolutely right, that's the reason! In that case, I'm afraid our code is incorrect, we can't first download/overwrite stuff already in the offline cache like that and only THEN use that information to extract package names leading to incorrect data representation on our side, we have to detect the collisions sooner, before they happen not after.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, based on output from pyarn the packages in question (from the linked issue) are represented as follows:

alias: wrap-ansi-cjs                                                                                      
checksum: sha512-YVGIj2kamLSTxw6NsZjoBxfSwsn0ycdesmc4p+Q21c5zPuZ1pl+NfxVdxPtdHvmNVOQ6XSYG4AUtyt/Fi7D16Q==
dependencies: {'ansi-styles': '^4.0.0', 'string-width': '^4.1.0', 'strip-ansi': '^6.0.0'}
from_dict: <bound method Package.from_dict of <class 'pyarn.lockfile.Package'>>
get_path_from_version_specifier: <function Package.get_path_from_version_specifier at 0x7f9d328f5080>
name: wrap-ansi
path: None
relpath: None
url: https://registry.yarnpkg.com/wrap-ansi/-/wrap-ansi-7.0.0.tgz#67e145cff510a6a6984bdf1152911d69d2eb9e43
version: 7.0.0

alias: None                                                                                               
checksum: sha512-r6lPcBGxZXlIcymEu7InxDMhdW0KDxpLgoFLcguasxCaJ/SOIZwINatK9KY/tf+ZrlywOKU0UDj3ATXUBfxJXA== 
dependencies: {'ansi-styles': '^4.0.0', 'string-width': '^4.1.0', 'strip-ansi': '^6.0.0'}                 
from_dict: <bound method Package.from_dict of <class 'pyarn.lockfile.Package'>>                           
get_path_from_version_specifier: <function Package.get_path_from_version_specifier at 0x7f9d328f5080>     
name: wrap-ansi                                                                                           
path: None                                                                                                
relpath: None                                                                                             
url: https://registry.yarnpkg.com/wrap-ansi/-/wrap-ansi-6.2.0.tgz#e9393ba07102e6c91a3b221478f0257cd2856e53
version: 6.2.0

alias: None                                                                                               
checksum: sha512-YVGIj2kamLSTxw6NsZjoBxfSwsn0ycdesmc4p+Q21c5zPuZ1pl+NfxVdxPtdHvmNVOQ6XSYG4AUtyt/Fi7D16Q== 
dependencies: {'ansi-styles': '^4.0.0', 'string-width': '^4.1.0', 'strip-ansi': '^6.0.0'}                 
from_dict: <bound method Package.from_dict of <class 'pyarn.lockfile.Package'>>                           
get_path_from_version_specifier: <function Package.get_path_from_version_specifier at 0x7f9d328f5080>     
name: wrap-ansi                                                                                           
path: None
relpath: None
url: https://registry.yarnpkg.com/wrap-ansi/-/wrap-ansi-7.0.0.tgz#67e145cff510a6a6984bdf1152911d69d2eb9e43
version: 7.0.0

alias: None                                          
checksum: sha512-si7QWI6zUMq56bESFvagtmzMdGOtoxfR+Sez11Mobfc7tm+VkUckk9bW2UeffTGVUbOksxmSw0AA2gs8g71NCQ==
dependencies: {'ansi-styles': '^6.1.0', 'string-width': '^5.0.1', 'strip-ansi': '^7.0.1'}                 
from_dict: <bound method Package.from_dict of <class 'pyarn.lockfile.Package'>>
get_path_from_version_specifier: <function Package.get_path_from_version_specifier at 0x7f9d328f5080>
name: wrap-ansi                                                                                           
path: None
relpath: None            
url: https://registry.yarnpkg.com/wrap-ansi/-/wrap-ansi-8.1.0.tgz#56dc22368ee570face1b49819975d9b9a5ead214
version: 8.1.0

which means that we could either base our detection logic on the alias field OR (maybe more preferable) defining our package models as frozen and hashable so that we could construct a set from results yielded by _YarnClassicPackageFactory and perform basic filtering early on.

@@ -203,20 +203,36 @@ def _verify_corepack_yarn_version(source_dir: RootedPath, env: dict[str, str]) -

def _verify_no_offline_mirror_collisions(packages: Iterable[YarnClassicPackage]) -> None:
"""Verify that there are no duplicate tarballs in the offline mirror."""
tarballs = []
all_tarballs_counter: Counter[str] = Counter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the commit message needs a little more information on why registry packages are sort of exempt from the collision (if the collision is only among the same type), but other packages are not. I'm saying this because the issue you linked only describes the problem from user perspective.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will improve on that, but it is still a draft so I didn't bother with it

tarball_collisions[tarball_name].append(p)

for tarball_name, collisions in tarball_collisions.items():
if all(pkg == collisions[0] for pkg in collisions):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean that all packages in a collisions list must be identical?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, otherwise raise an exception

Copy link
Member Author

@slimreaper35 slimreaper35 Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a simple function wrapper around the check would be more readable 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[non-blocking]
If you decide reworking this consider making collisions a set and then checking that it is a unit-set.

Copy link
Member

@eskultety eskultety Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you decide reworking this consider making collisions a set and then checking that it is a unit-set.

Maybe slightly preferred alternative that may pay out in the long run would be to start the package filtering early on - in the _YarnPackageFactory factory - by turning the models into hashables combined with some minor tweaks to this function we'd gain a native fast filtering mechanism as well as keeping this function only as a safety-net for cross-type collisions.

Comment on lines 224 to 231
f"Tarball collision in the offline mirror: {tarball_name} ({len(collisions)}x)"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is an improvement over the existing solution, you're still only reporting a tarball name instead of the package that caused the collision (actually ALL packages that happen to be in a collision) - that doesn't seem to be a helpful user-facing error message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to print package names even though it does not work as it should?
#786 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if we know it doesn't work as it's supposed to, then we should IMO fix that, because emitting a seemingly unhelpful error message won't make the underlying problem go away.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see too much value in that. Users can find the reported tarball name in the yarn.lock just as the package name.

@brunoapimentel
Copy link
Contributor

So, just to clarify my understanding, I'll paste an example when processing kiali. The strip-ansi-6.0.1 tarball is duplicated in that repo:

[RegistryPackage(url='https://registry.yarnpkg.com/strip-ansi/-/strip-ansi-6.0.1.tgz#9e26c63d30f53443e9489495b2105d37b67a85d9', name='strip-ansi', version='6.0.1', integrity='sha512-Y38VPSHcqkFrCpFnQ9vuSXmquuv5oXOKpGeT6aGrr3o3Gc9AlVa6JBfUSOCnbxGGZF+/0ooI7KrPuUSztUdU5A==', dev=True),

RegistryPackage(url='https://registry.yarnpkg.com/strip-ansi/-/strip-ansi-6.0.1.tgz#9e26c63d30f53443e9489495b2105d37b67a85d9', name='strip-ansi', version='6.0.1', integrity='sha512-Y38VPSHcqkFrCpFnQ9vuSXmquuv5oXOKpGeT6aGrr3o3Gc9AlVa6JBfUSOCnbxGGZF+/0ooI7KrPuUSztUdU5A==', dev=True)]

We're assuming it is safe because we realize it is exactly the same tarball, source url and hash included. Even if we had a Github tarball, for example, we would still verify all the details to assume this "collison" was safe (that is, we are downloading the exact tarball from the same place twice).

IIUIC, this solution LGTM. I'd just expand the docstring of the function to make this behavior clear, I'll certainly forget it soon 😅

@slimreaper35 slimreaper35 marked this pull request as ready for review January 24, 2025 13:45
Replace the previous duplicate tarball check with a more accurate collision
detection mechanism. The new implementation verifies that all packages mapped
to the same tarball name in the offline mirror are actually identical.

The previous Counter-based approach only checked for duplicate filenames
without validating package equivalence. The new defaultdict-based solution
properly detects cases where different packages would overwrite each other's
tarballs in the mirror.

Update test cases to cover valid duplicate scenarios (identical packages) and
invalid collisions (different packages with same tarball name). Adjust error
messages to better reflect the nature of detected issues.

Signed-off-by: Michal Šoltis <msoltis@redhat.com>
@slimreaper35
Copy link
Member Author

IIUIC, this solution LGTM. I'd just expand the docstring of the function to make this behavior clear, I'll certainly forget it soon 😅

Added more descriptive docstring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Offline mirror collision with npm aliased package name
4 participants