-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic on non-UTF-8 files #18
Comments
Some thoughts below, from a semi-random passer-by: For other text encodings: Ropey (the rope library Helix is using) is designed for streaming transcoding on both load and save. So using it in combination with something like the encoding_rs crate to do the transcoding, and some reasonable auto-detection of text encodings, would handle most valid text files you might encounter. (It won't handle all esoteric corner-cases, of course, because the text encoding landscape is... complicated. But it will at least handle most common encodings in a reasonable way.) For binary files: there's not really an obvious "right" thing for a text editor to do, so it's more about just picking an option and going with it. Some possibilities include:
|
Optimally it would show it as UTF-8, but show some placeholder for incorrect encodings like in kakoune. Lossless round-trips aren't necessary IMO since often you don't want to edit the file, and if you do, it's often because you want to remove the incorrectly encoded parts. |
(Hey @cessen 👋🏻 Thank you for building Ropey, it's great!) So far I've made no effort to support anything other than UTF-8. I think moving forward we could use encoding_rs + hsivonen/chardetng for encoding detection, but I'd probably maintain a whitelist of encodings we'd allow. I've gotten hit by esoteric edge cases before where the encoding was under-specified. |
You're very welcome! If you end up running into any problems with it, please don't hesitate to file an issue. :-)
Yeah, I think that's a very reasonable approach. Supporting a curated set of encodings that you're confident work correctly is better than trying to support everything unreliably. |
I'll take a stab at this. |
A discussion happened at cessen/ropey#40, @cessen suggested that we use |
To expand on what @pickfire said: I think using For example, another good use-case for |
Even loading sqlite.c (5.2MB - 5385123 loc) is just taking roughly one second to load the file and start even in debug build. I wonder if we will get into cases where we need a progress bar for it. Anyone have any idea which file ls larger to test? |
Maybe a huge XML file? |
Note: we're also discussing large file loading with regards to Led here: #219 |
Here are a couple of test files I use when testing my own editor: 100mb.txt.zip (download is ~450 KB, unzipped ~100 MB) Having said that, even the 1 GB file loads in ~1 second in my editor (which also uses Ropey). But I've heard that some people open 10-20GB (or even larger) log files sometimes, so that would start to become pretty significant wait times. And when transcoding on load because a file isn't utf8, those times will almost certainly go up quite a bit as well. |
Seems like we might want a wrapper over |
This is completely new to me, so I have a few stupid questions to ask if you don't mind. How would you use |
The short answer is simply that For example, you could do something like this (pseudo-code): let mut reader = FancyBufReader::new(my_file);
let mut builder = RopeBuilder::new();
while let Ok(text_chunk) = reader.read_next_str_chunk_of_max_length(4096) {
builder.append(text_chunk);
ui_update_progress_bar();
if user_input_cancel() {
return None;
}
}
return Some(builder.finish()); This does IO incrementally, a chunk at a time, and updates the UI and checks user input as it goes. But
I'm not sure exactly which example you're referring to, but if it calls Hope that helps! |
For example trying to open the
hx
binary:I don't know if there is a plan to support other text encodings but it sure would be nice to just be able to view a raw binary file as a fallback. The current implementation of
Document
usesRope
internally, from what I've seen it accepts only valid UTF-8 so not really sure how it could be handled here.The text was updated successfully, but these errors were encountered: