Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using different dumpers with yaml.dump, disk flushing behavior is inconsistent. #831

Open
mark007 opened this issue Sep 10, 2024 · 1 comment

Comments

@mark007
Copy link

mark007 commented Sep 10, 2024

I notice when I have a snippet of code that writes an especially small piece of data to disk using dump or dump_all, when reading that file immediately afterwards, sometimes the contents are not yet there, unless I do a file.flush().

I can reproduce this issue only when specifying a Dumper, for example the yaml.CSafeDumper. If I don't specify a Dumper, the contents of the file are always there immediately after the dump call is completed.
eg
This snippet of code results in the file contents showing as empty after its loaded.

output_file=output.yaml
with open(output_file, 'w') as file:
        data="x: 1"
        yaml.dump_all(data, stream=file, Dumper=yaml.CSafeDumper)
        with open(file.name, 'r') as temp_file_read:
            contents = temp_file_read.read()
            print(f"File Contents:\n{contents}")

This snippet of code, where the Dumper is not given, results in the contents of the file being there as expected.

output_file=output.yaml
with open(output_file, 'w') as file:
        data="x: 1"
        yaml.dump_all(data, stream=file)
        with open(file.name, 'r') as temp_file_read:
            contents = temp_file_read.read()
            print(f"File Contents:\n{contents}")

This snippet where I use both the Dumper and a file.flush(), the data is also seen to be there.

output_file=output.yaml
with open(output_file, 'w') as file:
        data="x: 1"
        yaml.dump_all(data, stream=file)
        file.flush()
        with open(file.name, 'r') as temp_file_read:
            contents = temp_file_read.read()
            print(f"File Contents:\n{contents}")

If the file read happens after the with block / context manager, its fine, the file must get flushed in that case. However there can be many cases, especially when dealing with tempfiles, that we would want to write data to disk, and then use it/read it, within the same context manager, so the consistency of any pyyaml dump flushing behavior would be important.

I can't find this documented anywhere. Is it expected, or is it a bug that can be resolved.

@nitzmahone
Copy link
Member

Yeah, I'm noting a distinct lack of flush() in the CEmitter output handler or anywhere else in CEmitter, where the pure-Python Emitter flushes after writing the stream end.

Digging around in libyaml's flush code, it looks like it's managing its own layer of internal buffering in its "flush" impl, but in neither place do I see any explicit flushing of the underlying stream handle.

Assuming that's accurate, it might be reasonable to add an explicit flush to at least StreamEndEvent for the buffered + streamed cases in the Cython wrapper, but that code is currently blissfully unaware of the buffering (since libyaml's driving the stream interactions through its write_handler callback). Adding an unconditional flush in the write handler would be easy, but waaaay overkill.

I'll mark this as something to look into for the next release, but in the meantime, if you really need unbuffered IO behavior, I'd suggest requesting an unbuffered binary stream when you call open().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants