Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce :trim_bom option for File.stream!/2 #5702

Merged
merged 7 commits into from
Jan 29, 2017

Conversation

AndrewDryga
Copy link
Contributor

Relates to #5695.

@josevalim
Copy link
Member

Thanks @AndrewDryga. I would still like a discussion if this should be opt-in or not. I believe it should because otherwise File.read! will be different from File.stream!. UNLESS File.read!("path/to/file", [:utf8]) already strips bom... but i don't think that's the case?

@AndrewDryga
Copy link
Contributor Author

@josevalim there are no File.read/2. Docs.

@josevalim
Copy link
Member

@AndrewDryga and i assume we don't strip it even on File.read!/1, right?

@josevalim
Copy link
Member

@AndrewDryga it seems from your comment though that we always strip it?

@AndrewDryga
Copy link
Contributor Author

@josevalim Looks like we don't, this test passes:

    test "read with UTF-8 BOM" do
      assert {:ok, <<239, 187, 191>> <> "Русский\n日\n"} = File.read(Path.expand('fixtures/utf8_bom.txt', __DIR__))
    end

I guess your were right that behaviors of File.stream/1 and File.read/1 don't match, since first one will skip BOM's, and second one don't. I would appreciate any thoughts how core-team want to see it fixed.

@josevalim
Copy link
Member

@AndrewDryga I believe the next step is to find out where stream is stripping out the bom, so we can consider the available options.

@josevalim
Copy link
Member

It is probably something related to IO, such as IO.read.

@AndrewDryga
Copy link
Contributor Author

AndrewDryga commented Jan 25, 2017

@josevalim looks like it's in file.ex. File.stream/1 uses functions to prepare modes, and they adding :utf8 to modes (and File.read doesn't call this funcs):

  defp normalize_modes([:utf8 | rest], binary?) do
    [encoding: :utf8] ++ normalize_modes(rest, binary?)
  end
  defp normalize_modes([:read_ahead | rest], binary?) do
    [read_ahead: @read_ahead_size] ++ normalize_modes(rest, binary?)
  end
  # TODO: Remove :char_list mode by 2.0
  defp normalize_modes([mode | rest], _binary?) when mode in [:charlist, :char_list] do
    if mode == :char_list do
      IO.warn "the :char_list mode is deprecated, use :charlist"
    end
    normalize_modes(rest, false)
  end
  defp normalize_modes([mode | rest], binary?) do
    [mode | normalize_modes(rest, binary?)]
  end
  defp normalize_modes([], true), do: [:binary]
  defp normalize_modes([], false), do: []

@AndrewDryga
Copy link
Contributor Author

  def stream!(path, modes \\ [], line_or_bytes \\ :line) do
    modes = normalize_modes(modes, true)
    File.Stream.__build__(IO.chardata_to_string(path), modes, line_or_bytes)
  end

vs

  def read(path) do
    F.read_file(IO.chardata_to_string(path))
  end

@AndrewDryga
Copy link
Contributor Author

Or maybe not, i'll investigate further.

@josevalim
Copy link
Member

@AndrewDryga FIle.stream! does not use utf8 by default though, does it?

@AndrewDryga
Copy link
Contributor Author

@josevalim sorry, spent a while to figure this out.

My test was initially wrong, because of copy-pasting text from source I've managed to put BOM into test assertion. Both function (stream!/1 and read!/1) are returning strings with BOM.

@AndrewDryga
Copy link
Contributor Author

AndrewDryga commented Jan 27, 2017

@josevalim I guess it would be better to implement :strip_bom modifier, so we won't break backwards compatibility. Any suggestions on it's implementation? At the moment I can see two ways:

  1. Pattern match BOM's in reduction function of Enumerable implementation.
  2. Use {encoding, utf8} while opening file and then read lines with something familiar to io:get_line(Pid, x). Possible to put it instead of IO.each_binstream/2 usage in :lines mode.

@josevalim
Copy link
Member

josevalim commented Jan 27, 2017

I would prefer 1. but ideally we want to do it only on the first line. How is UTF-8 encoding supposed to behave in face of multiple BOMs?

@AndrewDryga
Copy link
Contributor Author

AndrewDryga commented Jan 27, 2017

@josevalim There are should not be multiple BOM's (second one should be part of binary document), so we need to remove it only once. However there are many different kinds of BOM's for different types of UTF encodings.

We can add line counter/boolean flag to acc and remove them only once.

@AndrewDryga AndrewDryga changed the title [WIP] Do not deliver BOM when using File.stream/1 Introduce :strip_bom option for File.stream!/2 Jan 28, 2017
@AndrewDryga
Copy link
Contributor Author

I guess test needs to be restarted, since it looks like a random VM issue.

@@ -73,7 +73,8 @@ defmodule File.Stream do
start_fun =
fn ->
case :file.open(path, read_modes(modes)) do
{:ok, device} -> device
{:ok, device} ->
if :strip_bom in modes, do: strip_bom(device), else: device
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be done better, I don't like nesting conditions. Should I match option value as second argument in strip_boom fn?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small possible refactoring:

strip_bom? = :strip_bom in modes
case :file.open(path, read_modes(modes)) do
  {:ok, device} when strip_bom? ->
    strip_bom(device)
  {:ok, device} ->
    device
  {:error, reason} ->
    ...
end

Maybe better, but maybe not ¯\_(ツ)_/¯

Copy link
Contributor Author

@AndrewDryga AndrewDryga Jan 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is counter-intuitive usage of guard. What do you think about something like this:

case :file.open(path, read_modes(modes)) do
  {:ok, device} ->
    maybe_strip_boom(device)
  {:error, reason} ->
    ...
end

and

def maybe_strip_boom(device, modes) when :strip_bom in modes, do: ..
def maybe_strip_boom(device, _modes), do: device

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code is fine. Unfortunately we cannot write the proposed approach because that's not an valid guard. :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this solution. You don't need to know the logic of whether to stripping BOM in a function whose purpose is to strip the BOM, IMO. This usage of guards is not rare in Elixir though :) But if you don't want to go with when strip_bom?, I think the if solution looks alright.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there are any best practice on "when we should split function" in Elixir source? Level of nesting, ABC complexity, etc..?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim

iex> strip_bom? = true
iex> case {:ok, 1} do
...>   {:ok, _} when not strip_bom? -> :a
...>   {:ok, _} -> :b
...> end
:b

🤔

@@ -73,7 +73,8 @@ defmodule File.Stream do
start_fun =
fn ->
case :file.open(path, read_modes(modes)) do
{:ok, device} -> device
{:ok, device} ->
if :strip_bom in modes, do: strip_bom(device), else: device
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small possible refactoring:

strip_bom? = :strip_bom in modes
case :file.open(path, read_modes(modes)) do
  {:ok, device} when strip_bom? ->
    strip_bom(device)
  {:ok, device} ->
    device
  {:error, reason} ->
    ...
end

Maybe better, but maybe not ¯\_(ツ)_/¯

defp read_modes(modes) do
for mode <- modes, mode not in [:write, :append], do: mode
for mode <- modes, mode not in [:write, :append, :strip_bom], do: mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're touching this, would Enum.filter/2 be a better fit here? 🤔

Enum.filter(modes, &(&1 not in [:write, :append, :strip_bom]))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally prefer Enum.uniq(modes) -- [:write, :append, :strip_bom]

A version with just -- would be ideal, but there's a possibility of duplicate entries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the current version. ;)

@@ -1258,6 +1258,11 @@ defmodule File do
One may also consider passing the `:delayed_write` option if the stream
is meant to be written to under a tight loop.

## Byte order marks

If you pass `:skip_bom` in the modes parameter, the stream will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:strip_bom. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering, maybe we should use trim_bom instead? This prefix we use in the String module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @lexmag, it feels slightly inconsistent now that we removed all the "strip" stuff in String 😕

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, :trim_bom. @AndrewDryga can you please amend the PR accordingly? Sorry for the late minute change. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll change it.

@@ -1388,4 +1393,18 @@ defmodule File do
do: path
defp maybe_to_string(path),
do: IO.chardata_to_string(path)

@doc false
def bom_length(<<239, 187, 191, _rest::binary>>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not using this function here, we can move it to the File.Stream module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to place it here, because there may be other places that may use this function, for example some IO/File.read fun may also benefit from :trim_bom mode. But thats my bad, "premature refactoring".. :)

@@ -73,7 +73,8 @@ defmodule File.Stream do
start_fun =
fn ->
case :file.open(path, read_modes(modes)) do
{:ok, device} -> device
{:ok, device} ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are changing this line, can you please remove the indentation of the ->? We no longer do such alignments in Elixir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally +1 for that.

@josevalim
Copy link
Member

I have added some final comments and this should be good to go!

@@ -1322,6 +1322,26 @@ defmodule FileTest do
assert Enum.count(stream) == 2
end

test "stream keeps BOM" do
src = fixture_path("utf8_bom.txt")
bom_line = src
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always move initial argument to the next line:

 bom_line =
  src
  |> File.stream!()
  |> Enum.take(1)
  |> List.first()

|> Enum.take(1)
|> List.first()

assert <<239, 187, 191>> <> "Русский\n" = bom_line
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use == operator in assertion, it gives better message on failure than with =.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same works for another test.

bom_line = src
|> File.stream!()
|> Enum.take(1)
|> List.first()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we know the shape of this, I think the right way to do this is either use hd/1 in place of List.first/1, or even do

[bom_line] =
  src
  |> File.stream!()
  |> Enum.take(1)

Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for [bom_line] =.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for [bom_line]. I'd rather wrap something with list, than argue how to extract value from it :).

@whatyouhide why hd/1 is better than List functions?

@AndrewDryga
Copy link
Contributor Author

@josevalim everything is fixed 👍

@josevalim josevalim changed the title Introduce :strip_bom option for File.stream!/2 Introduce :trim_bom option for File.stream!/2 Jan 29, 2017
@josevalim josevalim merged commit 8a8e367 into elixir-lang:master Jan 29, 2017
@josevalim
Copy link
Member

❤️ 💚 💙 💛 💜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants