Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vobject does not handle UTF-8 file with BOM #175

Closed
dagbdagb opened this issue Jan 4, 2015 · 18 comments · Fixed by #178
Closed

vobject does not handle UTF-8 file with BOM #175

dagbdagb opened this issue Jan 4, 2015 · 18 comments · Fixed by #178
Labels

Comments

@dagbdagb
Copy link

dagbdagb commented Jan 4, 2015

I have a UTF-8 vcf-file which vobject appears unable to handle. vobject will flat out not recognize the file. The problem appears to be the Initial Byte Order Mark. http://en.wikipedia.org/wiki/Byte_order_mark.
Removing the BOM enables vobject to parse the file.

Seeing how I found an android utility which created this file, this is a real-world example.

I have not found anything expressively prohibiting the use of a BOM in rfc 6350, but I'll admit to neither being a skilled reader of RFCs, nor a programmer. In any case, it does not appear to be a very hard thing to fix?

@staabm
Copy link
Member

staabm commented Jan 4, 2015

Ok, so we could just strip the BOM?

@evert
Copy link
Member

evert commented Jan 4, 2015

I don't think using the BOM these days is appropriate. Could you share the name of the application that produced this, and did you also submit a bug report there?

@afflux
Copy link

afflux commented Jan 4, 2015

The UTF-8 specification RFC 3629 says

It is therefore RECOMMENDED to avoid stripping an initial U+FEFF interpreted as a signature without a good reason, to ignore it instead of stripping it when appropriate (such as for display) and to strip it only when really necessary.

It is worth noting that it also says

A protocol SHOULD forbid use of U+FEFF as a signature for those textual protocol elements that the protocol mandates to be always UTF-8, the signature function being totally useless in those cases.

While this would apply to vCards as they are required in RFC 6350 to always be UTF-8, there is no mention of the BOM to be allowed or forbidden.

@dagbdagb
Copy link
Author

dagbdagb commented Jan 4, 2015

@evert
From the wikipedia page above: "The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use." .

But please read http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 for yourself.

I retraced my steps to find the guilty application, and as it turns out, it was not the android application which added the BOM. My Contacts Backup' by OBSS Mobile appears to produce a plain, regular UTF-8 file, with no BOM.

But by editing the resulting file in LibreOffice (4.3.1.2) and saving it as plain text, it has gained a BOM.

@staabm
For internal processing and validation, I can't see any problem with stripping the BOM first. In fact, all you need to do is to recognize the BOM and realize that this is in fact a valid UTF-8 file.

@dagbdagb dagbdagb closed this as completed Jan 4, 2015
@dagbdagb dagbdagb reopened this Jan 4, 2015
@Hywan Hywan added the bug label Jan 4, 2015
@Hywan
Copy link
Member

Hywan commented Jan 4, 2015

@evert: +1, BOM is a design error…

@dagbdagb
Copy link
Author

dagbdagb commented Jan 4, 2015

@Hywan: You may be right. However, there is also: http://en.wikipedia.org/wiki/Robustness_principle

@afflux
Copy link

afflux commented Jan 4, 2015

Since the vCard standard does not explicitly forbid it, the BOM it should be ignored as per the UTF-8 RFC.

@evert
Copy link
Member

evert commented Jan 5, 2015

I did a bit of searching, and this is the only 'official' word I saw around the BOM in vCard 4:

http://www.ietf.org/mail-archive/web/ietf-types/current/msg00958.html

vCard 4 does not use a BOM. vCard 4 is always UTF-8, period.

RFC 3629 also states:

A protocol SHOULD forbid use of U+FEFF as a signature for those
textual protocol elements that the protocol mandates to be always
UTF-8, the signature function being totally useless in those
cases.

To be fair though, while vCard 4 specifically mentions that only UTF-8 is acceptable, it does not explicitly state that the BOM is forbidden.

I'm a little bit on the fence with this one. I don't think BOM's are widely supported as they were once intended to be, and I don't think it's a common expectation anymore to have to deal with it.

Usually I just take the pragmatic route and simply do my best to support the formats that appear in the wild (within reason), but in this case we just have one person who manually opened a vcard in libreoffice, which is kind of the epitome of 'edge case'.

So I'm leaning towards not providing support for this, for two reasons:

  1. It adds code that will practically only ever be executed by unittests.
  2. I think, in the context of CardDAV, I would want my parser to throw an error. If we'd store this vCard and consider it 'valid', it's almost guaranteed that other CardDAV clients that don't universally support the DOM will trip on this.

As an aside, I am not a proponent of the robustness principle. I think, from a server perspective it's good to be strict and throw hard failures when clients mess up. If we don't, a client developer may have the assumption that what they're doing is valid, and assume other servers will also consider it valid. I think being strict in what you accept ultimately creates a more interoperable world.

@dagbdagb
Copy link
Author

dagbdagb commented Jan 5, 2015

I honestly did not think about the validate and store file case, only validate and "import records into database".

In the validate and store file case, I think I must agree with not adhering to the robustness principle.

I would still appreciate it if you would recognize the BOM and complain about that specificly, rather than just rejecting the file and leaving the end-user in the dark.

@dagbdagb
Copy link
Author

dagbdagb commented Jan 5, 2015

Rethinking this. Again.
I am torn. The file with the BOM is, strictly speaking, valid. And libreoffice is likely a well used editor among the great unwashed masses. I am honestly not sure what course of action best adheres to POLA. (http://en.wikipedia.org/wiki/Principle_of_least_astonishment )

But it is not for me to decide anyway. Good luck!

@dagbdagb
Copy link
Author

dagbdagb commented Jan 5, 2015

Apparently, notepad.exe also adds a BOM when saving a utf-8 file.

@dagbdagb
Copy link
Author

dagbdagb commented Jan 5, 2015

I believe @afflux' quote from RFC 3629 applies here:

It is therefore RECOMMENDED to avoid stripping an initial
U+FEFF interpreted as a signature without a good reason, to ignore it
instead of stripping it when appropriate (such as for display) and to
strip it only when really necessary.

As far as we know, no vcard-accepting clients require a BOM to be present, and we know that some clients cannot accept a vcard with an initial BOM. (https://code.google.com/p/android/issues/detail?id=10107)

We also know that two widely used editors add a BOM when producing UTF-8 files.

I'd say we are safely within the treshold for stripping the BOM, if vobject has the capability to edit files.

I'll try really hard to leave this issue for now. :-)

@Hywan
Copy link
Member

Hywan commented Jan 5, 2015

On the other hand, it's not a big deal to skip or drop the BOM :-/.

@evert
Copy link
Member

evert commented Jan 6, 2015

I think I would like to close this issue for the moment. If this issue arises again because more people are running into this, I would certainly reconsider.

In the meantime, you can indeed pretty simply skip the BOM. The vobject parsers accept streams for input, and you can just use fgets to skip the stream two bytes ahead.

@evert evert closed this as completed Jan 6, 2015
@dagbdagb
Copy link
Author

dagbdagb commented Jan 6, 2015

googling 'utf-8 bom vcard' or 'utf-8 bom vcf' certainly gives an impression that this is an issue. Worse is that Joe Sixpack most likely will have no clue whatsoever why it fails and leave no clues on the Internet that this was an issue. He'll either just give up or more likely fiddle with various tools until it 'magically' just works. But it is your call.

I'll try to convince the owncloud guys to look for and strip the BOM.

@evert
Copy link
Member

evert commented Jan 6, 2015

I could be persuaded otherwise if there was a fully functional and unit-tested patch in place =)

@jbtbnl
Copy link

jbtbnl commented Jan 6, 2015

@evert so basically, this issue could be reopened again?
It would at least persuade me to close owncloud/contacts#635 :)

@Hywan Hywan reopened this Jan 7, 2015
@Hywan
Copy link
Member

Hywan commented Jan 7, 2015

@evert Done ;-).

@Hywan Hywan closed this as completed Jan 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants