Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getFile fails on large files #6

Closed
sirinsidiator opened this issue Jan 6, 2024 · 3 comments · Fixed by #7
Closed

getFile fails on large files #6

sirinsidiator opened this issue Jan 6, 2024 · 3 comments · Fixed by #7
Labels
enhancement New feature or request

Comments

@sirinsidiator
Copy link

I was trying to list files and their sizes in a directory with some large (>1GB) files and noticed that the extension throws an exception even though I didn't really attempt to load any content and simply wanted to get the size via getFile (which works fine in Chrome):
DOMException: The requested file could not be read, the file size exceeded the allowed limit.

After looking through the extension code I noticed that in https://github.com/ichaoX/ext-file/blob/main/src/lib/api/fs.js#L302C27-L302C27 the whole file is loaded into memory and there is a config value to limit it. Removing that value gave me another error which seems to come from the python side:
Error Too much data for base64 line

I'm not entirely sure what Chrome is doing internally, but seeing how getFile takes literally no time to return regardless of the file size, I feel it doesn't actually load the content until it is accessed via text(), stream() or arrayBuffer().

So I played around a bit attempting to create modified File objects (seeing how it is actually just an interface) and while it does allow to replace the functions above with custom code, it seems that Chrome still somehow loads the content when trying to store the file in an indexDB without using any of the custom functions. Guess in the end it would again require special care in the application code to ensure that it behaves the same across browsers if it was done in this fashion.

Here is the code I wrote to see if I can replace the functions with my own:

const fileContents = new Map<string, string>();
fileContents.set('test.txt', 'Hello World');
const backingFileCache = new Map<string, BackingFile>();

interface BackingFile {
    buffer: ArrayBuffer;
    position: number;
    size: number;
}

function getFileSize(path: string): number {
    return fileContents.get(path)?.length ?? 0;
}

function getBackingFile(path: string): BackingFile {
    let backingFile = backingFileCache.get(path);
    if (!backingFile) {
        const size = getFileSize(path);
        backingFile = {
            buffer: new ArrayBuffer(size),
            position: 0,
            size
        };
        backingFileCache.set(path, backingFile);
    }
    return backingFile;
}

const CHUNK_SIZE = 2;

async function loadChunk(path: string): Promise<[boolean, Uint8Array | null]> {
    const backingFile = getBackingFile(path);
    const start = backingFile.position;
    const end = Math.min(start + CHUNK_SIZE, backingFile.size);
    backingFile.position = end;
    const value = fileContents.get(path)?.slice(start, end);
    if (value) {
        const buffer = new TextEncoder().encode(value);
        new Uint8Array(backingFile.buffer).set(buffer, start);
        return [false, buffer];
    }
    return [true, null];
}

function createProxyFile(path: string, start?: number, end?: number, contentType?: string): File {
    const backingFile = getBackingFile(path);
    const buffer = backingFile.buffer.slice(start ?? 0, end ?? backingFile.buffer.byteLength);
    const file = new File([buffer], path, { type: contentType });
    file.text = async () => {
        const stream = file.stream().pipeThrough(new TextDecoderStream());
        const reader = stream.getReader();
        let result = '';
        // eslint-disable-next-line no-constant-condition
        while (true) {
            const { done, value } = await reader.read();
            if (done) {
                break;
            }
            result += value;
        }
        return result;
    };
    file.stream = () => {
        const reader = new ReadableStream({
            start(controller) {
                return pump();
                async function pump() {
                    const [done, value] = await loadChunk(path);
                    if (done) {
                        controller.close();
                        return;
                    }
                    controller.enqueue(value);
                    return pump();
                }
            }
        });
        return reader;
    };
    file.slice = (begin?: number, end?: number, contentType?: string) => {
        return createProxyFile(path, begin, end, contentType);
    };
    file.arrayBuffer = async () => {
        if (backingFile.position !== backingFile.size) {
            await file.text();
        }
        return backingFile.buffer;
    };

    return file;
}

(async () => {
    const file = createProxyFile('test.txt');
    console.log("proxyFile", file);

    const content = await file.text();
    console.log("content", content);
})();

Maybe you could still try loading the content in getFile in smaller segments to avoid the base64 error.

@ichaoX
Copy link
Owner

ichaoX commented Jan 7, 2024

The error on the Python side is likely because helper-app-full-windows was built in a 32-bit environment.
According to the code in https://github.com/python/cpython/blob/main/Modules/binascii.c#L102C9-L102C49, BASE64_MAXBIN is probably less than 1GB. While using helper-app-lite with 64-bit Python may eliminate this limitation, but it may encounter other limitations, such as Firefox throwing an exception when concatenating strings exceeding 1GB:
InternalError: allocation size overflow

There are at least two implementations of File within the browser. One is based on memory, and the other is based on a snapshot of the file's metadata.

I have considered a similar approach to yours, which can be useful in some cases, but many APIs use internal methods to read data from File instances, which can lead to exceptions or silent data corruption.

await new Response(new Blob([NONSTANDARD_File,'data'])).text()

Therefore, if the feature is implemented, it may be required to call handle.getFile({_allowNonNative:true}) in this way.

Do you need to read the contents of large files for your use case? If not, you can skip processing those files.

@sirinsidiator
Copy link
Author

Sorry for the late reply. Yes, I plan to read those files and upload them somewhere, but since it's just for a small local experiment, if it's not an easy solution I'll just live with having to use chrome or edge for now.

@ichaoX ichaoX added the enhancement New feature or request label Jan 24, 2024
ichaoX added a commit that referenced this issue Jan 24, 2024
ichaoX added a commit that referenced this issue Jan 24, 2024
ichaoX added a commit that referenced this issue Jan 24, 2024
@ichaoX ichaoX mentioned this issue Jan 24, 2024
@ichaoX ichaoX closed this as completed in #7 Jan 26, 2024
@ichaoX
Copy link
Owner

ichaoX commented Jan 26, 2024

Starting from v0.9.4, there will be three ways to read large files:

  1. Web developers can use handle.getFile({_allowNonNative:true}) to obtain the File.
  2. Change Content Script's FS_CONFIG.NON_NATIVE_FILE value to 'auto' or 'always'.
  3. Change Content Script's FS_CONFIG.FILE_SIZE_LIMIT value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants