-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce :trim_bom option for File.stream!/2 #5702
Introduce :trim_bom option for File.stream!/2 #5702
Conversation
Thanks @AndrewDryga. I would still like a discussion if this should be opt-in or not. I believe it should because otherwise File.read! will be different from File.stream!. UNLESS |
@josevalim there are no |
@AndrewDryga and i assume we don't strip it even on |
@AndrewDryga it seems from your comment though that we always strip it? |
@josevalim Looks like we don't, this test passes:
I guess your were right that behaviors of |
@AndrewDryga I believe the next step is to find out where stream is stripping out the bom, so we can consider the available options. |
It is probably something related to IO, such as IO.read. |
@josevalim looks like it's in file.ex.
|
vs
|
Or maybe not, i'll investigate further. |
@AndrewDryga FIle.stream! does not use utf8 by default though, does it? |
@josevalim sorry, spent a while to figure this out. My test was initially wrong, because of copy-pasting text from source I've managed to put BOM into test assertion. Both function (stream!/1 and read!/1) are returning strings with BOM. |
@josevalim I guess it would be better to implement
|
I would prefer |
@josevalim There are should not be multiple BOM's (second one should be part of binary document), so we need to remove it only once. However there are many different kinds of BOM's for different types of UTF encodings. We can add line counter/boolean flag to |
I guess test needs to be restarted, since it looks like a random VM issue. |
@@ -73,7 +73,8 @@ defmodule File.Stream do | |||
start_fun = | |||
fn -> | |||
case :file.open(path, read_modes(modes)) do | |||
{:ok, device} -> device | |||
{:ok, device} -> | |||
if :strip_bom in modes, do: strip_bom(device), else: device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be done better, I don't like nesting conditions. Should I match option value as second argument in strip_boom
fn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small possible refactoring:
strip_bom? = :strip_bom in modes
case :file.open(path, read_modes(modes)) do
{:ok, device} when strip_bom? ->
strip_bom(device)
{:ok, device} ->
device
{:error, reason} ->
...
end
Maybe better, but maybe not ¯\_(ツ)_/¯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is counter-intuitive usage of guard. What do you think about something like this:
case :file.open(path, read_modes(modes)) do
{:ok, device} ->
maybe_strip_boom(device)
{:error, reason} ->
...
end
and
def maybe_strip_boom(device, modes) when :strip_bom in modes, do: ..
def maybe_strip_boom(device, _modes), do: device
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code is fine. Unfortunately we cannot write the proposed approach because that's not an valid guard. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this solution. You don't need to know the logic of whether to stripping BOM in a function whose purpose is to strip the BOM, IMO. This usage of guards is not rare in Elixir though :) But if you don't want to go with when strip_bom?
, I think the if
solution looks alright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there are any best practice on "when we should split function" in Elixir source? Level of nesting, ABC complexity, etc..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iex> strip_bom? = true
iex> case {:ok, 1} do
...> {:ok, _} when not strip_bom? -> :a
...> {:ok, _} -> :b
...> end
:b
🤔
@@ -73,7 +73,8 @@ defmodule File.Stream do | |||
start_fun = | |||
fn -> | |||
case :file.open(path, read_modes(modes)) do | |||
{:ok, device} -> device | |||
{:ok, device} -> | |||
if :strip_bom in modes, do: strip_bom(device), else: device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small possible refactoring:
strip_bom? = :strip_bom in modes
case :file.open(path, read_modes(modes)) do
{:ok, device} when strip_bom? ->
strip_bom(device)
{:ok, device} ->
device
{:error, reason} ->
...
end
Maybe better, but maybe not ¯\_(ツ)_/¯
defp read_modes(modes) do | ||
for mode <- modes, mode not in [:write, :append], do: mode | ||
for mode <- modes, mode not in [:write, :append, :strip_bom], do: mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're touching this, would Enum.filter/2
be a better fit here? 🤔
Enum.filter(modes, &(&1 not in [:write, :append, :strip_bom]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally prefer Enum.uniq(modes) -- [:write, :append, :strip_bom]
A version with just --
would be ideal, but there's a possibility of duplicate entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the current version. ;)
@@ -1258,6 +1258,11 @@ defmodule File do | |||
One may also consider passing the `:delayed_write` option if the stream | |||
is meant to be written to under a tight loop. | |||
|
|||
## Byte order marks | |||
|
|||
If you pass `:skip_bom` in the modes parameter, the stream will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:strip_bom
. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering, maybe we should use trim_bom
instead? This prefix we use in the String
module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @lexmag, it feels slightly inconsistent now that we removed all the "strip" stuff in String
😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, :trim_bom
. @AndrewDryga can you please amend the PR accordingly? Sorry for the late minute change. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll change it.
@@ -1388,4 +1393,18 @@ defmodule File do | |||
do: path | |||
defp maybe_to_string(path), | |||
do: IO.chardata_to_string(path) | |||
|
|||
@doc false | |||
def bom_length(<<239, 187, 191, _rest::binary>>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are not using this function here, we can move it to the File.Stream module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to place it here, because there may be other places that may use this function, for example some IO/File.read
fun may also benefit from :trim_bom
mode. But thats my bad, "premature refactoring".. :)
@@ -73,7 +73,8 @@ defmodule File.Stream do | |||
start_fun = | |||
fn -> | |||
case :file.open(path, read_modes(modes)) do | |||
{:ok, device} -> device | |||
{:ok, device} -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are changing this line, can you please remove the indentation of the ->
? We no longer do such alignments in Elixir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally +1 for that.
I have added some final comments and this should be good to go! |
@@ -1322,6 +1322,26 @@ defmodule FileTest do | |||
assert Enum.count(stream) == 2 | |||
end | |||
|
|||
test "stream keeps BOM" do | |||
src = fixture_path("utf8_bom.txt") | |||
bom_line = src |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always move initial argument to the next line:
bom_line =
src
|> File.stream!()
|> Enum.take(1)
|> List.first()
|> Enum.take(1) | ||
|> List.first() | ||
|
||
assert <<239, 187, 191>> <> "Русский\n" = bom_line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use ==
operator in assertion, it gives better message on failure than with =
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same works for another test.
bom_line = src | ||
|> File.stream!() | ||
|> Enum.take(1) | ||
|> List.first() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we know the shape of this, I think the right way to do this is either use hd/1
in place of List.first/1
, or even do
[bom_line] =
src
|> File.stream!()
|> Enum.take(1)
Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for [bom_line] =
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for [bom_line]
. I'd rather wrap something with list, than argue how to extract value from it :).
@whatyouhide why hd/1
is better than List functions?
@josevalim everything is fixed 👍 |
❤️ 💚 💙 💛 💜 |
Relates to #5695.