Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are we managing extension files efficiently #980

Open
kineticsquid opened this issue Sep 5, 2024 · 14 comments
Open

Are we managing extension files efficiently #980

kineticsquid opened this issue Sep 5, 2024 · 14 comments

Comments

@kineticsquid
Copy link

The origin of this is EclipseFdn/open-vsx.org#2317. Ultimately our objective is, for a variety of reasons, to keep the size of our DB manageable. The file_resource table is by far the largest. I poked at this a bit in a Gitpod workspace. In that sample workspace, there are 22 extensions, 40 versions, and a little over 99K entries in the file table. It looks like, in addition to the files listed in the extension package.json file, e.g. readme, download, license, icon, all the files included in the .vsix file are also listed in this table. The type is resource: https://raw.githubusercontent.com/kineticsquid/openvsx/master/output.txt.

Separately, looking at a sample of the open-vsx.org access logs, I can see /file API calls requesting these files.

I can understand the access to the icon, license, readme files for all the versions of an extension. I don't understand the logic that results in access requests to these other files. Unfortunately, I don't understand the code enough to figure out where these calls are coming from, UI or server.

@amvanbaren @filiptronicek @spoenemann Any insight or background on this?

@amvanbaren
Copy link
Contributor

amvanbaren commented Sep 9, 2024

Some context: #432

I've deployed a proof of concept to staging where resource files are extracted from the vsix package on the fly. The response to the initial request is slower, but reasonable (2 - 3 seconds). The response is cached for 30 days, so subsequent requests are faster.
The upside is less rows in the file_resource table and less files in blob storage. The downside is slower response times and most likely higher bandwidth usage. This can be an acceptable trade-off if only a limited set of resource files are requested, e.g. 80% cached responses and 20% on the fly generated responses.

@kineticsquid
Copy link
Author

@amvanbaren Thanks, this seems a reasonable approach. But before we go there, I'd like to understand a bit more about the use case(s).

  • In Some extension resource url return 404 or 500 #432, I see calls to \open-vsx.org\*\asset\... and \open-vsx.org\*\gallery\.... Where are these calls coming from and why is the info returned by our API insufficient?
  • In this comment you refer to a web type of extension. Is this specified in package.json like this?
        "extensionKind": [
            "workspace",
            "web"
        ],
  • What web resource files are extracted (presumably and entered in the file_resource table) and not extracted for other types of extensions?
  • I could imagine imagine optimizations for extensions management in IDE UIs, but wouldn't that require only the files from the latest version?
  • In the sample of rows from file_resource table in https://raw.githubusercontent.com/kineticsquid/openvsx/master/output.txt, I can see things like Javascript dependencies. Is there a use case for this, or is it a byproduct of how we're processing the extensions?

@amvanbaren
Copy link
Contributor

Where are these calls coming from and why is the info returned by our API insufficient?

These calls are coming from VS Code based editors. The info returned by the API was insufficient, because it was only returned for extensions with web in their tag list.

Is this specified in package.json like this?

The tags list in the extension.vsixmanifest file was used.

What web resource files are extracted (presumably and entered in the file_resource table) and not extracted for other types of extensions?

All files in the extension folder were recursively added to the file_resource table. So basically all files in the vsix package.

I could imagine imagine optimizations for extensions management in IDE UIs, but wouldn't that require only the files from the latest version?

This is not on the extension level. A resource is requested for a specific version of an extension.

Is there a use case for this?

Yes, it is to keep feature parity with the MS VS Code API:
https://ms-python.vscode-unpkg.net/ms-python/python/2024.14.0/extension/out/client/node_modules/
https://open-vsx.org/vscode/unpkg/ms-python/python/2024.14.0/extension/out/client/node_modules/

@kineticsquid
Copy link
Author

@amvanbaren Thanks for the additional info, this is helpful. A couple of follow up questions.

  • Once an extension is installed, an editor presumably has all the files, so are these calls made for information for extensions that are not installed in the IDE?
  • Does Theia make similar calls?

@amvanbaren
Copy link
Contributor

Once an extension is installed, an editor presumably has all the files.

Yes, I think the desktop editor uses local files. It looks like this functionality is used by VS Code server deployments, like the Gitpod openvscode-server.

so are these calls made for information for extensions that are not installed in the IDE?

That could be possible too.

Does Theia make similar calls?

It can through the /api/file/... endpoints by file path, but a quick look at the Theia source code makes me think it only uses predefined file urls (download, icon, manifest, etc.) https://github.com/eclipse-theia/theia/blob/19556f4d90c1b661ba53caea9b6a035a714e112d/dev-packages/ovsx-client/src/ovsx-types.ts#L198

@kineticsquid
Copy link
Author

@amvanbaren we believe we're seeing calls (not through the /file API from Gitpod. What about VS Codium?

Ultimately I think we're going to need to sample our access logs again to get a better picture. Can you recommend a text filter to limit the entries we're looking for?

@amvanbaren
Copy link
Contributor

File endpoints
  • /api/{namespace}/{extension}/{version}/file/**
  • /api/{namespace}/{extension}/{targetPlatform}/{version}/file/**

The last part of the url can be a file type (download, icon, license) or a file path (e.g. extension/package.json). Here it is pretty hard to make a distinction between calls that return a resource and calls that return another file type.
regex:

\/api\/[\w\-\+\$~]+\/[\w\-\+\$~]+(\/[\w\-\+\$~]+)?\/[\w\-\+\$\.~]+\/file\/.*
Resource endpoint
  • /vscode/unpkg/{namespaceName}/{extensionName}/{version}/**

Every call to this endpoint uses the resource file type.
regex:

\/vscode\/unpkg\/.*
VSIX package download endpoint
  • /vscode/gallery/publishers/{namespaceName}/vsextensions/{extensionName}/{version}/vspackage

Returns redirect to download vsix package.
regex:

\/vscode\/gallery\/publishers\/[\w\-\+\$~]+\/vsextensions\/[\w\-\+\$~]+\/[\w\-\+\$\.~]+\/vspackage
Asset endpoint
  • /vscode/asset/{namespaceName}/{extensionName}/{version}/{assetType}/**

Returns asset file.
regex to get any asset:

\/vscode\/asset\/[\w\-\+\$~]+\/[\w\-\+\$~]+\/[\w\-\+\$\.~]+\/Microsoft\.VisualStudio\.((Services\.((Content\.(Details|Changelog|License))|Icons\.Default|VSIXPackage|VsixManifest|VsixSignature|PublicKey))|(Code\.(Manifest|WebResources)))\/.*

regex to get only resources:

\/vscode\/asset\/[\w\-\+\$~]+\/[\w\-\+\$~]+\/[\w\-\+\$\.~]+\/Microsoft\.VisualStudio\.Code\.WebResources\/.*

other asset types:

  • Microsoft.VisualStudio.Services.Content.Details
  • Microsoft.VisualStudio.Services.Content.Changelog
  • Microsoft.VisualStudio.Services.Content.License
  • Microsoft.VisualStudio.Services.Icons.Default
  • Microsoft.VisualStudio.Services.VSIXPackage
  • Microsoft.VisualStudio.Services.VsixManifest
  • Microsoft.VisualStudio.Services.VsixSignature
  • Microsoft.VisualStudio.Services.PublicKey
  • Microsoft.VisualStudio.Code.Manifest
  • Microsoft.VisualStudio.Code.WebResources

@kineticsquid
Copy link
Author

@amvanbaren Thanks, this is really helpful. It looks like these paths are defined here: https://github.com/eclipse/openvsx/blob/master/server/src/main/java/org/eclipse/openvsx/web/WebConfig.java.

Do all of these URLs and asset types result in references to the file_resource table?

I also noticed a path \documents. I can't figure out where that's processed. What does it return and does it also hit the file_resource table?

It seems like to get a handle on URLs (not part of the API) that cause references to the file_resources table would be to filter the access logs to \vscode and (maybe) \documents. That right?

@amvanbaren
Copy link
Contributor

WebConfig is to configure extra features on top, like CORS and interceptors for mirror mode.
You can find the actual endpoints defined in: VSCodeAPI and RegistryAPI

The /documents endpoint serves static content, like the publisher agreement and terms of use.

@kineticsquid
Copy link
Author

@amvanbaren I think I understand. Ultimate goal is to reduce the size of the file_resource table. The next step will be to get another sample of the access logs that cause references to this table. Based on the above, I think what we want are all references to \api\...file and \vscode. That sound right?

@amvanbaren
Copy link
Contributor

Yes, sounds right. You can further narrow down /vscode requests to /vscode/asset and /vscode/unpkg.

@kineticsquid
Copy link
Author

What about? Do these not cause a file lookup?

                            "/vscode/item",
                            "/vscode/gallery/publishers/**",

@amvanbaren
Copy link
Contributor

/vscode/item redirects to the extension page in the webui:

return UrlUtil.createApiUrl(webuiUrl, "extension", extension.getNamespace().getName(), extension.getName());

/vscode/gallery/publishers/** returns a link to a vsix package. You could include it, but extension downloads are pretty non-negotiable.

return storageUtil.getLocation(resource).toString();

@kkistm
Copy link

kkistm commented Sep 30, 2024

As it was discussed with @kineticsquid, I am putting my thoughts about caching implementation here. Just to save them in some place where everybody can see it.

The question is about getting rid of the file_resource table and also necessity to unpack .vsix every time a new file is needed from it. I think we can reduce (or even avoid) necessity to unpack the extension several times if the files from it are requested in a short period of time. The idea is obviously to use some form of caching on Java side. I see two viable options:

  • The first is in-memory cache. It can be quite easily done using GuavaCache: https://github.com/google/guava/wiki/CachesExplained. It easily allows to specify eviction policies to keep the cache small. The cache will be fast and fully under our control. The only drawback which is see is necessity to use cache per Java application instance, so potentially .vsix could be unpacked several times.
  • The other option is to use Postgres as a cache. It can be done using UNLOGGED tables (https://www.crunchydata.com/blog/postgresl-unlogged-tables). In this case the cache will be shared among the instances. The eviction could be implemented with a store procedure run by pg_cron , for example. The drawback is a certain speed penalty, but it looks like UNLOGGED tables performance is quite good.

I haven't looked at scenarios to add cache as a separate application, like Redis or memcached, because it might make the setup unnecessary complex. I aldo don't know if Elastic could be used as a key/value storage.
A separate topic is to how to populate the cache(s), but there some ExecutorService instance could help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants