getFile fails on large files #6

sirinsidiator · 2024-01-06T18:05:53Z

I was trying to list files and their sizes in a directory with some large (>1GB) files and noticed that the extension throws an exception even though I didn't really attempt to load any content and simply wanted to get the size via getFile (which works fine in Chrome):
DOMException: The requested file could not be read, the file size exceeded the allowed limit.

After looking through the extension code I noticed that in https://github.com/ichaoX/ext-file/blob/main/src/lib/api/fs.js#L302C27-L302C27 the whole file is loaded into memory and there is a config value to limit it. Removing that value gave me another error which seems to come from the python side:
Error Too much data for base64 line

I'm not entirely sure what Chrome is doing internally, but seeing how getFile takes literally no time to return regardless of the file size, I feel it doesn't actually load the content until it is accessed via text(), stream() or arrayBuffer().

So I played around a bit attempting to create modified File objects (seeing how it is actually just an interface) and while it does allow to replace the functions above with custom code, it seems that Chrome still somehow loads the content when trying to store the file in an indexDB without using any of the custom functions. Guess in the end it would again require special care in the application code to ensure that it behaves the same across browsers if it was done in this fashion.

Here is the code I wrote to see if I can replace the functions with my own:

const fileContents = new Map<string, string>();
fileContents.set('test.txt', 'Hello World');
const backingFileCache = new Map<string, BackingFile>();

interface BackingFile {
    buffer: ArrayBuffer;
    position: number;
    size: number;
}

function getFileSize(path: string): number {
    return fileContents.get(path)?.length ?? 0;
}

function getBackingFile(path: string): BackingFile {
    let backingFile = backingFileCache.get(path);
    if (!backingFile) {
        const size = getFileSize(path);
        backingFile = {
            buffer: new ArrayBuffer(size),
            position: 0,
            size
        };
        backingFileCache.set(path, backingFile);
    }
    return backingFile;
}

const CHUNK_SIZE = 2;

async function loadChunk(path: string): Promise<[boolean, Uint8Array | null]> {
    const backingFile = getBackingFile(path);
    const start = backingFile.position;
    const end = Math.min(start + CHUNK_SIZE, backingFile.size);
    backingFile.position = end;
    const value = fileContents.get(path)?.slice(start, end);
    if (value) {
        const buffer = new TextEncoder().encode(value);
        new Uint8Array(backingFile.buffer).set(buffer, start);
        return [false, buffer];
    }
    return [true, null];
}

function createProxyFile(path: string, start?: number, end?: number, contentType?: string): File {
    const backingFile = getBackingFile(path);
    const buffer = backingFile.buffer.slice(start ?? 0, end ?? backingFile.buffer.byteLength);
    const file = new File([buffer], path, { type: contentType });
    file.text = async () => {
        const stream = file.stream().pipeThrough(new TextDecoderStream());
        const reader = stream.getReader();
        let result = '';
        // eslint-disable-next-line no-constant-condition
        while (true) {
            const { done, value } = await reader.read();
            if (done) {
                break;
            }
            result += value;
        }
        return result;
    };
    file.stream = () => {
        const reader = new ReadableStream({
            start(controller) {
                return pump();
                async function pump() {
                    const [done, value] = await loadChunk(path);
                    if (done) {
                        controller.close();
                        return;
                    }
                    controller.enqueue(value);
                    return pump();
                }
            }
        });
        return reader;
    };
    file.slice = (begin?: number, end?: number, contentType?: string) => {
        return createProxyFile(path, begin, end, contentType);
    };
    file.arrayBuffer = async () => {
        if (backingFile.position !== backingFile.size) {
            await file.text();
        }
        return backingFile.buffer;
    };

    return file;
}

(async () => {
    const file = createProxyFile('test.txt');
    console.log("proxyFile", file);

    const content = await file.text();
    console.log("content", content);
})();

Maybe you could still try loading the content in getFile in smaller segments to avoid the base64 error.

The text was updated successfully, but these errors were encountered:

ichaoX · 2024-01-07T05:30:09Z

The error on the Python side is likely because helper-app-full-windows was built in a 32-bit environment.
According to the code in https://github.com/python/cpython/blob/main/Modules/binascii.c#L102C9-L102C49, BASE64_MAXBIN is probably less than 1GB. While using helper-app-lite with 64-bit Python may eliminate this limitation, but it may encounter other limitations, such as Firefox throwing an exception when concatenating strings exceeding 1GB:
InternalError: allocation size overflow

There are at least two implementations of File within the browser. One is based on memory, and the other is based on a snapshot of the file's metadata.

I have considered a similar approach to yours, which can be useful in some cases, but many APIs use internal methods to read data from File instances, which can lead to exceptions or silent data corruption.

await new Response(new Blob([NONSTANDARD_File,'data'])).text()

Therefore, if the feature is implemented, it may be required to call handle.getFile({_allowNonNative:true}) in this way.

Do you need to read the contents of large files for your use case? If not, you can skip processing those files.

sirinsidiator · 2024-01-24T16:45:12Z

Sorry for the late reply. Yes, I plan to read those files and upload them somewhere, but since it's just for a small local experiment, if it's not an easy solution I'll just live with having to use chrome or edge for now.

relate to: #6

ichaoX · 2024-01-26T13:33:08Z

Starting from v0.9.4, there will be three ways to read large files:

Web developers can use handle.getFile({_allowNonNative:true}) to obtain the File.
Change Content Script's FS_CONFIG.NON_NATIVE_FILE value to 'auto' or 'always'.
Change Content Script's FS_CONFIG.FILE_SIZE_LIMIT value.

ichaoX added the enhancement New feature or request label Jan 24, 2024

ichaoX added a commit that referenced this issue Jan 24, 2024

feat(fs): read chunk

51c87d0

relate to: #6

ichaoX added a commit that referenced this issue Jan 24, 2024

feat(fs): handle.getFile({_allowNonNative:true})

f2ad073

relate to: #6

ichaoX added a commit that referenced this issue Jan 24, 2024

feat(fs): FS_CONFIG.NON_NATIVE_FILE

3d20f6f

relate to: #6

ichaoX mentioned this issue Jan 24, 2024

v0.9.4 #7

Merged

ichaoX closed this as completed in #7 Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getFile fails on large files #6

getFile fails on large files #6

sirinsidiator commented Jan 6, 2024

ichaoX commented Jan 7, 2024 •

edited

Loading

sirinsidiator commented Jan 24, 2024

ichaoX commented Jan 26, 2024

getFile fails on large files #6

getFile fails on large files #6

Comments

sirinsidiator commented Jan 6, 2024

ichaoX commented Jan 7, 2024 • edited Loading

sirinsidiator commented Jan 24, 2024

ichaoX commented Jan 26, 2024

ichaoX commented Jan 7, 2024 •

edited

Loading