Updating to use compositional generator/parser pairs #2

aryanjassal · 2025-02-12T03:21:39Z

Description

Implement a custom solution for generating tar archives using compositional generator/parser pairs. The primary use case of this is to transfer file tree data from the efs to the local system and vice-versa for Polykey.

Issues Fixed

Relates to Use tar archival format to transfer file tree structure and data between Polykey vaults and local file systems Polykey#811
Fixes Update implementation to use compositional generator/parser pairs #1

Tasks

1. Update the library to new templating
2. Create compositional generators for archiving file
3. Create parsers to extract file tree data from archive
4. Add options for deterministic tar
5. Add options to provide custom implementation of fs
6. Add rigorous tests for everything
7. Add header for extended metadata

Final checklist

aryanjassal · 2025-02-12T07:15:03Z

When creating a file in the archive, if the header contains nested paths like /home/aryanj/Downloads/FILE.txt as the file name, then that is traditionally an incorrect archive. Some stricter archival tools (like perhaps WinRAR/7z) might fail to parse this. However, I tested this on File Roller (Ubuntu's default archival software) and GNU tar. Supposedly it's also meant to work on BSD tar, but I haven't tested that yet. I also haven't tested it on WinRAR/7z either.

How do I know about it working on File Roller and GNU tar? Because I've tested it and it works!! For now, I have only implemented a simple implementation which takes in a file, chunks it by 512 bytes, then writes the headers and contents using composable async generators.

It's still basic in complexity, but it works and can be tested out by using the system's archival tools to check for the validity of the archives. This will also be implemented as part of some tests.

aryanjassal · 2025-02-12T07:55:02Z

The default limit of tar file names is 255 bytes. I don't think such a limitation is imposed on the file names in the efs for Polykey, so this might be an issue.

This is solved by GNU tar (and also one of the most widespread implementation) by using a special header for longer file names.
https://stackoverflow.com/questions/2078778/what-exactly-is-the-gnu-tar-longlink-trick

Basically, for file names longer than 255 characters, a new block is created just before the header block, which can store upto 8,589,934,591 bytes, or approximately 8.59 GiB of file path. This is achieved by the following structure.

[Long Filename Header]
[Long Filename Data Block]
...
[Actual File Header]
[File Content]
...

In the long file name header, most of the bits are ignored except the filename, which must be ././@LongLink, the size, which must be properly calculated, the magic values of ustar and 00, and the type being L signifying a long filename header. All other values can be null.

After this header block, the long file name follows, padded to 512 bytes. Note that this is the full path, and using a long file name header means the file name fields in the actual file header will be null, as the file name came just before the header. Everything else is the same.

Should I also implement this, or make an issue for this and leave it for later @tegefaulkes @CMCDragonkai?

aryanjassal · 2025-02-12T08:28:09Z

I have also implemented support for relative paths now. This is mostly complete. With a parser and some updates to the generator code, this can be finished rather quickly now.

The performance is also very good. I have tested this on the local file system (as efs isn't available for running tests at the moment) and I was able to archive a 600 MiB directory within 50 seconds. This is without any sort of optimisation and with a lot of debug code, so this number might drop yet further as this approaches completion.

aryanjassal · 2025-02-12T09:40:18Z

Should I write benches for this? I've never written them or looked into it so I'm not sure how I would approach it, or if I even need to for this repo.

src/types.ts

CMCDragonkai · 2025-02-12T16:29:32Z

Should I write benches for this? I've never written them or looked into it so I'm not sure how I would approach it, or if I even need to for this repo.

Just look at how it's done in js-logger and js-id or similar. Copy those into chatgpt and ask how best to bench the code you have.

Btw the new copilot extension to vscode supports computer operating so it can kind of inspect the full context and code up things like this.

CMCDragonkai · 2025-02-12T19:16:32Z

I suspect you need a utility function to deterministically operate over a filesystem to get a sorted list of files. Again that might not exactly be a virtualtar problem, so you should keep that in mind in the case of PK though.

src/Generator.ts

src/types.ts

src/Generator.ts

src/types.ts

aryanjassal · 2025-02-17T05:41:56Z

After discussing the structure of this project further, we have decided to pursue a functional style instead of a generator style. Instead of being provided a directory and performing all the operations ourselves, small utilities will be written instead.

For the generator, this utility will take in a path, a type, and optionally the file data. For directories, a single header will be returned. For files, a header will be returned populated with the data size and other metadata.

For the parser, a Uint8Array will be passed in by reference, then a 512-byte chunk will be sliced out of the array and a token will be generated. I can use a DataView here.

Basically, js-virtualtar will consist of two utility functions, one which generates headers and one which parses them into tokens. Is this the direction we want js-virtualtar to go in? It will become difficult to make changes to the structure down the line, so I'm making this clear in advance. What do you think, @tegefaulkes @CMCDragonkai?

CMCDragonkai · 2025-02-17T16:15:52Z

After discussing the structure of this project further, we have decided to pursue a functional style instead of a generator style. Instead of being provided a directory and performing all the operations ourselves, small utilities will be written instead.

For the generator, this utility will take in a path, a type, and optionally the file data. For directories, a single header will be returned. For files, a header will be returned populated with the data size and other metadata.

For the parser, a Uint8Array will be passed in by reference, then a 512-byte chunk will be sliced out of the array and a token will be generated. I can use a DataView here.

Basically, js-virtualtar will consist of two utility functions, one which generates headers and one which parses them into tokens. Is this the direction we want js-virtualtar to go in? It will become difficult to make changes to the structure down the line, so I'm making this clear in advance. What do you think, @tegefaulkes @CMCDragonkai?

You should be able to export multiple functions not just 2. Plus their types.

aryanjassal · 2025-02-19T02:23:39Z

There are some operations which would require logging or communicating with the user. For example, a header can be of type directory but can have a non-zero size. While this breaks convention, it is technically allowed and is parsable by some tar software. In cases like this, I would like to log out a warning.

The idea I have is to pass in a parameter for logger. If it is unset, then all logging will be disabled. Otherwise, the logger will be used for logging out messages. Should I implement this?

aryanjassal · 2025-02-19T04:09:43Z

I have updated the structure for the generator to expose utility functions which generate the relevant blocks instead. I still haven't implemented proper CI tests, but I have tested this out by archiving a local directory on my system and viewing and unarchiving it using other utilities.

I ran into two major issues when doing this. Firstly, I got some weird issues with the headers containing invalid data, but I then realised that instead of converting the digits to radix-8, I was converting them based on the number of bytes the chunk would occupy. This was because of a typo when extracting that method into its own utility.

Another issue was me forgetting that a buffer is memory in heap, so I tried to add buffers to an array (for later writing to a tar file) but was using the same buffer to write multiple blocks, which was overwriting data and resulting in the files corrupting.

After some debugging, I finally got this working perfectly. After some cleaning up of the Generator and the test, I will start work on the parser.

aryanjassal · 2025-02-19T04:42:41Z

After discussing with Brian, then conclusion we have now is that the generator/parser is too low-level to deal with logging, and it would also impact performance. So, if the structure deviates too much from expectation, the parser should throw, and if it is a minor issue like the size field being set for directories, that is out of scope for the parser, which just generates tokens. It is the responsibility of whatever is consuming the tokens to analyse if the tokens contain invalid data and deal with it accordingly. As such, no logger is required for virtualtar. It needs to be as performant as possible.

This probably also means writing benchmarks. I will reference other repos and ask chatgpt about writing benchmarks.

aryanjassal · 2025-02-19T08:53:12Z

Just added simple parsing. I tested parsing by archiving a directory using the generator then parsing using the parser. It was able to do the conversion both ways pretty well.

Next steps include cleaning up the code and bring it up to library standards, adding proper tests, and adding benchmarks.

CMCDragonkai · 2025-02-20T15:41:04Z

After discussing with Brian, then conclusion we have now is that the generator/parser is too low-level to deal with logging, and it would also impact performance. So, if the structure deviates too much from expectation, the parser should throw, and if it is a minor issue like the size field being set for directories, that is out of scope for the parser, which just generates tokens. It is the responsibility of whatever is consuming the tokens to analyse if the tokens contain invalid data and deal with it accordingly. As such, no logger is required for virtualtar. It needs to be as performant as possible.

This probably also means writing benchmarks. I will reference other repos and ask chatgpt about writing benchmarks.

Parsers should just parse. It's not really stateful. Therefore logging is not required. Logging is best needed for stateful systems.

src/Generator.ts

src/Parser.ts

tegefaulkes · 2025-02-24T01:50:35Z

src/utils.ts

+function extractOctal(
+  view: DataView,
+  offset?: number,
+  length?: number,
+): number {
+  const value = extractChars(view, offset, length);
+  return value.length > 0 ? parseInt(value, 8) : 0;
+}


I'm pretty sure parseInt can return NaN if it wasn't an actual number. Be careful of that.

tests/index.test.ts

CMCDragonkai · 2025-02-24T05:53:23Z

We are ok using Buffer. Because of the feross/buffer. It's fine.

CMCDragonkai · 2025-02-24T05:54:19Z

In fact Buffer will be faster than Uint8Array.

Please talk to chatgpt about how to improve your buffer speeds with better concatenation and avoiding memory allocation.

CMCDragonkai · 2025-02-24T05:54:36Z

Reduce function calling and jumps in your hot path.

CMCDragonkai · 2025-02-25T21:09:28Z

Some feedback on type naming.

Prefer ObjectVerb. Or GenericSpecific over SpecificGeneric.

aryanjassal · 2025-02-26T06:46:14Z

We are ok using Buffer. Because of the feross/buffer. It's fine.

In fact Buffer will be faster than Uint8Array.

I have investigated feross/buffer and while it is an interesting project, I don't think it works for our use case.

The performance tests show that on browser, Uint8Array are actually faster in most cases, reaching upto 200% faster than the buffer implementation. For node usage, the difference increased yet further, reaching upto 500% slower than other methods for some operations. Note that while using Uint8Array is pretty fast, many operations were actually faster when Node's buffer was used by upto 200% than with Uint8Array.

Due to these reasons, I think that using Uint8Array and DataView, which I already have been doing now, is sufficient for this project, and switching to web buffer won't give me much of a performance boost.

Please talk to chatgpt about how to improve your buffer speeds with better concatenation and avoiding memory allocation.

I am doing only a single allocation for generating headers. There is no concatenation for buffers going on. I wrap the raw bytestream in a data view if I want to interact with it, and I do it by specifying the buffer address and the offset/length, avoiding needless copying.

I am yet to run performance tests and analysing hot paths, so that will come when I write performance benchmarks to investigate hot paths.

CMCDragonkai · 2025-02-26T16:22:43Z

There are some operations which would require logging or communicating with the user. For example, a header can be of type directory but can have a non-zero size. While this breaks convention, it is technically allowed and is parsable by some tar software. In cases like this, I would like to log out a warning.

The idea I have is to pass in a parameter for logger. If it is unset, then all logging will be disabled. Otherwise, the logger will be used for logging out messages. Should I implement this?

Generally for libraries like this, no logging should be done.

Remember there is NO IO here.

So you should be able to communicate purely through construction and deconstruction. It is pure and easy to test.

CMCDragonkai · 2025-02-26T16:23:58Z

We are ok using Buffer. Because of the feross/buffer. It's fine.
In fact Buffer will be faster than Uint8Array.

I have investigated feross/buffer and while it is an interesting project, I don't think it works for our use case.

The performance tests show that on browser, Uint8Array are actually faster in most cases, reaching upto 200% faster than the buffer implementation. For node usage, the difference increased yet further, reaching upto 500% slower than other methods for some operations. Note that while using Uint8Array is pretty fast, many operations were actually faster when Node's buffer was used by upto 200% than with Uint8Array.

Due to these reasons, I think that using Uint8Array and DataView, which I already have been doing now, is sufficient for this project, and switching to web buffer won't give me much of a performance boost.

Please talk to chatgpt about how to improve your buffer speeds with better concatenation and avoiding memory allocation.

I am doing only a single allocation for generating headers. There is no concatenation for buffers going on. I wrap the raw bytestream in a data view if I want to interact with it, and I do it by specifying the buffer address and the offset/length, avoiding needless copying.

I am yet to run performance tests and analysing hot paths, so that will come when I write performance benchmarks to investigate hot paths.

Weird, Buffer uses Uint8Array. But if you don't need the complex buffer concatenation and slicing operations, then you don't need it.

However generally if you were in fact operation/mutating over the same buffer, and slicing out parts then it would generally more easier. It might depend on how you coded this.

aryanjassal · 2025-02-27T08:43:37Z

So far, I have added basic but functional tests. On the generator, they check the ability to generate valid file and directory headers. On the parser, I check if the parser can parse headers properly, then check if it can parse headers with data correctly.

I am aware that these tests aren't robust enough to properly test out everything in virtual tar. I am working on that. However, I am using fastcheck to generate the header blocks for parser and virtual file systems for generator.

I have also written an integration testing where I create a deeply nested virtual file system as a JSON object (of course, using fast check), then archive it using the generator and unarchive it using the parser. This took me most of today.

My next steps is to improve the generators and parsers even further. Currently, it does not verify the header validity, which it needs to do by checking for magic values and checksum. The generator needs to support extended metadata, which also needs its own support in the parser.

Currently, virtual tar is perfectly functional, but not yet complete for advanced usage. I will knock these tasks off next week. The current ETA for the completion of this PR which matches the library standards of Matrix AI is approximately two weeks.

CMCDragonkai · 2025-03-02T03:56:11Z

Ask chatgpt to generate tests using fastcheck and ask it to test bad inputs and fuzz test it with all the potential security risks. Use o1 and provide the repo as the entire context. Look online how to provide repo as context. These sorts of programs the AI is superior at generating code.

I am aware that these tests aren't robust enough to properly test out everything in virtual tar. I am working on that. However, I am using fastcheck to generate the header blocks for parser and virtual file systems for generator.

aryanjassal · 2025-03-03T05:32:04Z

I have realised that I don't actually use DataViews all that much. It is useful for extracting singular values or converting formats, like parsing a Uint8 value or Float16 value, but that is irrelevant in this case. As such, using DataView is only adding to the confusion in the code. All the operations are being done more cleanly when I use Uint8Arrays directly, as they have better utility functions like reduce which allows for simple checksum calculation or slice for easy subarraying.

aryanjassal · 2025-03-03T05:37:23Z

The parser has become a class now, as it is a stateful machine which changes parsing behaviour based on a couple variables stored as internal state. The generators can just return a single header block, so they don't need such state, but this creates a difference for the generator and the parser. For the parser, you need to instantiate a class then write to it which will instead give you tokens in return. For generators, you call a method, pass it the relevant data, and get a full block out. I would prefer if they both were classes. Making the generator into a class would unlock future possibilities for complex generations (so far, I have not seen a need for stateful generation). Should I make this change or leave it as-is?

Side note: should I also create a transform stream for the parser which takes in 512-byte blocks and does the same thing as the parser, or is that the concern of the user and I shouldn't bother with it?

@CMCDragonkai @tegefaulkes

CMCDragonkai · 2025-03-04T23:53:36Z

What took the longest time to do this?

aryanjassal · 2025-03-05T01:06:39Z

What took the longest time to do this?

Mostly the time taken is due to me writing a library for the first time, so I'm figuring some stuff as I go. However, if we're talking strictly about the code portion, then it was probably either the parser or the tests. The tests took some time as it was difficult to figure out a good way to test out the code completely virtually, and creating arbitraries for fuzz testing took a while even with chatgpt's help. It could be much faster if I was using fixtures but we would lose fuzz testing.

chore: updating to new templating

0573cd4

aryanjassal self-assigned this Feb 12, 2025

aryanjassal and others added 3 commits February 12, 2025 14:23

chore: updated LICENSE

cb62b01

chore: added flake.nix

b0db06e

chore: added placeholder tests

80947ab

feat: added generators for archiving a single file

f364d20

aryanjassal added 2 commits February 12, 2025 19:29

feat: added recursive directory walking

f10e844

chore: fixed up header type mappings

d706025

aryanjassal commented Feb 12, 2025

View reviewed changes

src/types.ts Outdated Show resolved Hide resolved

feat: added options to tar generator

166c2e9