-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating to use compositional generator/parser pairs #2
base: master
Are you sure you want to change the base?
Conversation
When creating a file in the archive, if the header contains nested paths like How do I know about it working on File Roller and GNU tar? Because I've tested it and it works!! For now, I have only implemented a simple implementation which takes in a file, chunks it by 512 bytes, then writes the headers and contents using composable async generators. It's still basic in complexity, but it works and can be tested out by using the system's archival tools to check for the validity of the archives. This will also be implemented as part of some tests. |
The default limit of tar file names is 255 bytes. I don't think such a limitation is imposed on the file names in the efs for Polykey, so this might be an issue. This is solved by GNU tar (and also one of the most widespread implementation) by using a special header for longer file names. Basically, for file names longer than 255 characters, a new block is created just before the header block, which can store upto 8,589,934,591 bytes, or approximately 8.59 GiB of file path. This is achieved by the following structure.
In the long file name header, most of the bits are ignored except the filename, which must be After this header block, the long file name follows, padded to 512 bytes. Note that this is the full path, and using a long file name header means the file name fields in the actual file header will be null, as the file name came just before the header. Everything else is the same. Should I also implement this, or make an issue for this and leave it for later @tegefaulkes @CMCDragonkai? |
I have also implemented support for relative paths now. This is mostly complete. With a parser and some updates to the generator code, this can be finished rather quickly now. The performance is also very good. I have tested this on the local file system (as |
Should I write benches for this? I've never written them or looked into it so I'm not sure how I would approach it, or if I even need to for this repo. |
Just look at how it's done in js-logger and js-id or similar. Copy those into chatgpt and ask how best to bench the code you have. Btw the new copilot extension to vscode supports computer operating so it can kind of inspect the full context and code up things like this. |
I suspect you need a utility function to deterministically operate over a filesystem to get a sorted list of files. Again that might not exactly be a virtualtar problem, so you should keep that in mind in the case of PK though. |
After discussing the structure of this project further, we have decided to pursue a functional style instead of a generator style. Instead of being provided a directory and performing all the operations ourselves, small utilities will be written instead. For the generator, this utility will take in a path, a type, and optionally the file data. For directories, a single header will be returned. For files, a header will be returned populated with the data size and other metadata. For the parser, a Basically, |
You should be able to export multiple functions not just 2. Plus their types. |
There are some operations which would require logging or communicating with the user. For example, a header can be of type directory but can have a non-zero size. While this breaks convention, it is technically allowed and is parsable by some tar software. In cases like this, I would like to log out a warning. The idea I have is to pass in a parameter for logger. If it is unset, then all logging will be disabled. Otherwise, the logger will be used for logging out messages. Should I implement this? |
I have updated the structure for the generator to expose utility functions which generate the relevant blocks instead. I still haven't implemented proper CI tests, but I have tested this out by archiving a local directory on my system and viewing and unarchiving it using other utilities. I ran into two major issues when doing this. Firstly, I got some weird issues with the headers containing invalid data, but I then realised that instead of converting the digits to radix-8, I was converting them based on the number of bytes the chunk would occupy. This was because of a typo when extracting that method into its own utility. Another issue was me forgetting that a buffer is memory in heap, so I tried to add buffers to an array (for later writing to a tar file) but was using the same buffer to write multiple blocks, which was overwriting data and resulting in the files corrupting. After some debugging, I finally got this working perfectly. After some cleaning up of the Generator and the test, I will start work on the parser. |
After discussing with Brian, then conclusion we have now is that the generator/parser is too low-level to deal with logging, and it would also impact performance. So, if the structure deviates too much from expectation, the parser should throw, and if it is a minor issue like the size field being set for directories, that is out of scope for the parser, which just generates tokens. It is the responsibility of whatever is consuming the tokens to analyse if the tokens contain invalid data and deal with it accordingly. As such, no logger is required for virtualtar. It needs to be as performant as possible. This probably also means writing benchmarks. I will reference other repos and ask chatgpt about writing benchmarks. |
Just added simple parsing. I tested parsing by archiving a directory using the generator then parsing using the parser. It was able to do the conversion both ways pretty well. Next steps include cleaning up the code and bring it up to library standards, adding proper tests, and adding benchmarks. |
Parsers should just parse. It's not really stateful. Therefore logging is not required. Logging is best needed for stateful systems. |
src/utils.ts
Outdated
function extractOctal( | ||
view: DataView, | ||
offset?: number, | ||
length?: number, | ||
): number { | ||
const value = extractChars(view, offset, length); | ||
return value.length > 0 ? parseInt(value, 8) : 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure parseInt
can return NaN
if it wasn't an actual number. Be careful of that.
We are ok using Buffer. Because of the feross/buffer. It's fine. |
In fact Buffer will be faster than Uint8Array. Please talk to chatgpt about how to improve your buffer speeds with better concatenation and avoiding memory allocation. |
Reduce function calling and jumps in your hot path. |
Some feedback on type naming. Prefer ObjectVerb. Or GenericSpecific over SpecificGeneric. |
I have investigated The performance tests show that on browser, Due to these reasons, I think that using
I am doing only a single allocation for generating headers. There is no concatenation for buffers going on. I wrap the raw bytestream in a data view if I want to interact with it, and I do it by specifying the buffer address and the offset/length, avoiding needless copying. I am yet to run performance tests and analysing hot paths, so that will come when I write performance benchmarks to investigate hot paths. |
Generally for libraries like this, no logging should be done. Remember there is NO IO here. So you should be able to communicate purely through construction and deconstruction. It is pure and easy to test. |
Weird, Buffer uses Uint8Array. But if you don't need the complex buffer concatenation and slicing operations, then you don't need it. However generally if you were in fact operation/mutating over the same buffer, and slicing out parts then it would generally more easier. It might depend on how you coded this. |
So far, I have added basic but functional tests. On the generator, they check the ability to generate valid file and directory headers. On the parser, I check if the parser can parse headers properly, then check if it can parse headers with data correctly. I am aware that these tests aren't robust enough to properly test out everything in virtual tar. I am working on that. However, I am using fastcheck to generate the header blocks for parser and virtual file systems for generator. I have also written an integration testing where I create a deeply nested virtual file system as a JSON object (of course, using fast check), then archive it using the generator and unarchive it using the parser. This took me most of today. My next steps is to improve the generators and parsers even further. Currently, it does not verify the header validity, which it needs to do by checking for magic values and checksum. The generator needs to support extended metadata, which also needs its own support in the parser. Currently, virtual tar is perfectly functional, but not yet complete for advanced usage. I will knock these tasks off next week. The current ETA for the completion of this PR which matches the library standards of Matrix AI is approximately two weeks. |
Ask chatgpt to generate tests using fastcheck and ask it to test bad inputs and fuzz test it with all the potential security risks. Use o1 and provide the repo as the entire context. Look online how to provide repo as context. These sorts of programs the AI is superior at generating code.
|
I have realised that I don't actually use DataViews all that much. It is useful for extracting singular values or converting formats, like parsing a Uint8 value or Float16 value, but that is irrelevant in this case. As such, using DataView is only adding to the confusion in the code. All the operations are being done more cleanly when I use Uint8Arrays directly, as they have better utility functions like |
The parser has become a class now, as it is a stateful machine which changes parsing behaviour based on a couple variables stored as internal state. The generators can just return a single header block, so they don't need such state, but this creates a difference for the generator and the parser. For the parser, you need to instantiate a class then write to it which will instead give you tokens in return. For generators, you call a method, pass it the relevant data, and get a full block out. I would prefer if they both were classes. Making the generator into a class would unlock future possibilities for complex generations (so far, I have not seen a need for stateful generation). Should I make this change or leave it as-is? Side note: should I also create a transform stream for the parser which takes in 512-byte blocks and does the same thing as the parser, or is that the concern of the user and I shouldn't bother with it? |
What took the longest time to do this? |
Mostly the time taken is due to me writing a library for the first time, so I'm figuring some stuff as I go. However, if we're talking strictly about the code portion, then it was probably either the parser or the tests. The tests took some time as it was difficult to figure out a good way to test out the code completely virtually, and creating arbitraries for fuzz testing took a while even with chatgpt's help. It could be much faster if I was using fixtures but we would lose fuzz testing. |
Description
Implement a custom solution for generating tar archives using compositional generator/parser pairs. The primary use case of this is to transfer file tree data from the efs to the local system and vice-versa for Polykey.
Issues Fixed
tar
archival format to transfer file tree structure and data between Polykey vaults and local file systems Polykey#811Tasks
fs
Final checklist