NPM package is very huge #68

anzemur · 2023-08-29T12:18:08Z

I just noticed that this package is around 13MB when unpackaged and I reached my AWS Lambda package size limit. This is absolutely too big for serverless deployment.

So my questions are:

Why are there saved encoder files inside the package (and in 3 different formats! - Is this really necessary?) ranging up to 1MB each? Does the Python version handle encoders similarly or are they always downloaded from the repository?
Which one of the three files is actually used for encoding (js, cjs or json) so I can manually remove the others from the build?
Is there a possibility to create a smaller package with only encoders that the user needs?

anzemur · 2023-08-30T10:44:44Z

@dqbd Even in JS version, there is an index file with all of the encoders inside (3MB+) and there are also separate encoder files in ranks directory and chunks.. why is that so? The "light" version doesn't do nothing about package size - the light version should include only the code for the tokenizer not the actual encoders - they can be loaded from CDN or from local directory..

dqbd · 2023-08-30T17:27:27Z

Hi @anzemur!
Regarding the size of the dependency, the default entrypoint does include all BPE ranks for each of the encoder, whereas the js-tiktoken/lite and tiktoken/lite includes only the core logic without the ranks.

The unpackaged size reported by npm takes into account the raw size of the package folder found in node_modules, which may not represent the actual size being used in your projects. Your bundler should be able to perform basic tree shaking to avoid importing unnecessary code.

Consider the following code snippet, which can be successfully deployed Vercel on Hobby plan with 1 MB code size limit (as of 30/08/2023).

import { Tiktoken } from "js-tiktoken/lite";
import cl100k_base from "js-tiktoken/ranks/cl100k_base";

export const config = { runtime: "edge" };

export default async function () {
  const encoding = new Tiktoken(cl100k_base);
  const tokens = encoding.encode("hello world");
  return new Response(`${tokens}`);
}

Ideally though, the ranks should be fetched via CDN, as seen in Langchain PR: langchain-ai/langchainjs#1239, which drops the bundle size down to 4.5kB (using esbuild, which is also internally used by Vercel dev command)

dqbd · 2023-08-30T17:38:07Z

Regarding extensions .js, .cjs files for ranks are mostly there for compatibility reasons with interop between ESM modules and CJS module, while .json is offered for users who might want to fetch the BPE ranks from other CDNs such as esm.sh.

Assuming your initial question, you might be just zipping the entire project with node_modules. You might want to minify your code before, as seen in AWS samples repo: https://github.com/aws-samples/lambda-nodejs-esbuild

seyfer · 2023-09-26T09:33:44Z

@anzemur there is also another package you might consider to use https://github.com/niieani/gpt-tokenizer

ajayvignesh01 · 2024-03-15T12:27:56Z

For anyone coming to this now, this is the new way to do it:

import { Tiktoken } from 'js-tiktoken/lite'

const getTokenModel = async () => {
  const response = await fetch('https://tiktoken.pages.dev/js/cl100k_base.json')
  return await response.json()
}
const rank = await getTokenModel()
const tokenizer = new Tiktoken(rank)
const tokens = tokenizer.encode('Hello World').length

You could also save the json file to your app directory and import it into the function.
Works on Vercel Hobby plan.

dingyi222666 · 2024-09-28T12:40:26Z

Could we release a lightweight js-tiktoken/lite package without the necessary tokenization tables?

anzemur changed the title ~~NPM packaga is very huge~~ NPM package is very huge Aug 29, 2023

dqbd mentioned this issue Aug 30, 2023

Use a JS based tokenizer for token counting langchain-ai/langchainjs#1239

Merged

3 tasks

cctv1005s mentioned this issue Sep 4, 2023

Feat: use js-tiktoken/lite to reduce module size transitive-bullshit/agentic#610

Closed

luizzappa mentioned this issue Feb 26, 2024

avoid expensive initialization niieani/gpt-tokenizer#18

Closed

transitive-bullshit mentioned this issue Apr 23, 2024

Improve install size dexaai/dexter#32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPM package is very huge #68

NPM package is very huge #68

anzemur commented Aug 29, 2023 •

edited

Loading

anzemur commented Aug 30, 2023

dqbd commented Aug 30, 2023 •

edited

Loading

dqbd commented Aug 30, 2023

seyfer commented Sep 26, 2023

ajayvignesh01 commented Mar 15, 2024

dingyi222666 commented Sep 28, 2024

NPM package is very huge #68

NPM package is very huge #68

Comments

anzemur commented Aug 29, 2023 • edited Loading

anzemur commented Aug 30, 2023

dqbd commented Aug 30, 2023 • edited Loading

dqbd commented Aug 30, 2023

seyfer commented Sep 26, 2023

ajayvignesh01 commented Mar 15, 2024

dingyi222666 commented Sep 28, 2024

anzemur commented Aug 29, 2023 •

edited

Loading

dqbd commented Aug 30, 2023 •

edited

Loading