Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPM package is very huge #68

Open
anzemur opened this issue Aug 29, 2023 · 6 comments
Open

NPM package is very huge #68

anzemur opened this issue Aug 29, 2023 · 6 comments

Comments

@anzemur
Copy link

anzemur commented Aug 29, 2023

I just noticed that this package is around 13MB when unpackaged and I reached my AWS Lambda package size limit. This is absolutely too big for serverless deployment.

So my questions are:

  • Why are there saved encoder files inside the package (and in 3 different formats! - Is this really necessary?) ranging up to 1MB each? Does the Python version handle encoders similarly or are they always downloaded from the repository?
  • Which one of the three files is actually used for encoding (js, cjs or json) so I can manually remove the others from the build?
  • Is there a possibility to create a smaller package with only encoders that the user needs?
@anzemur anzemur changed the title NPM packaga is very huge NPM package is very huge Aug 29, 2023
@anzemur
Copy link
Author

anzemur commented Aug 30, 2023

@dqbd Even in JS version, there is an index file with all of the encoders inside (3MB+) and there are also separate encoder files in ranks directory and chunks.. why is that so? The "light" version doesn't do nothing about package size - the light version should include only the code for the tokenizer not the actual encoders - they can be loaded from CDN or from local directory..

@dqbd
Copy link
Owner

dqbd commented Aug 30, 2023

Hi @anzemur!
Regarding the size of the dependency, the default entrypoint does include all BPE ranks for each of the encoder, whereas the js-tiktoken/lite and tiktoken/lite includes only the core logic without the ranks.

The unpackaged size reported by npm takes into account the raw size of the package folder found in node_modules, which may not represent the actual size being used in your projects. Your bundler should be able to perform basic tree shaking to avoid importing unnecessary code.

Consider the following code snippet, which can be successfully deployed Vercel on Hobby plan with 1 MB code size limit (as of 30/08/2023).

import { Tiktoken } from "js-tiktoken/lite";
import cl100k_base from "js-tiktoken/ranks/cl100k_base";

export const config = { runtime: "edge" };

export default async function () {
  const encoding = new Tiktoken(cl100k_base);
  const tokens = encoding.encode("hello world");
  return new Response(`${tokens}`);
}

Ideally though, the ranks should be fetched via CDN, as seen in Langchain PR: langchain-ai/langchainjs#1239, which drops the bundle size down to 4.5kB (using esbuild, which is also internally used by Vercel dev command)

@dqbd
Copy link
Owner

dqbd commented Aug 30, 2023

Regarding extensions .js, .cjs files for ranks are mostly there for compatibility reasons with interop between ESM modules and CJS module, while .json is offered for users who might want to fetch the BPE ranks from other CDNs such as esm.sh.

Assuming your initial question, you might be just zipping the entire project with node_modules. You might want to minify your code before, as seen in AWS samples repo: https://github.com/aws-samples/lambda-nodejs-esbuild

@seyfer
Copy link

seyfer commented Sep 26, 2023

@anzemur there is also another package you might consider to use https://github.com/niieani/gpt-tokenizer

@ajayvignesh01
Copy link

For anyone coming to this now, this is the new way to do it:

import { Tiktoken } from 'js-tiktoken/lite'

const getTokenModel = async () => {
  const response = await fetch('https://tiktoken.pages.dev/js/cl100k_base.json')
  return await response.json()
}
const rank = await getTokenModel()
const tokenizer = new Tiktoken(rank)
const tokens = tokenizer.encode('Hello World').length

You could also save the json file to your app directory and import it into the function.
Works on Vercel Hobby plan.

@dingyi222666
Copy link

Could we release a lightweight js-tiktoken/lite package without the necessary tokenization tables?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants