-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validating internationalized mail addresses in <input type="email"> #4562
Comments
Hey, coming here from this chrome bug. |
@josepharhar I agree that some servers can break (old ones but f.x. in Poland most popular e-mail providers are ... not working as they should) but please remember that we are still saying about client-side e-mail field validation. RFC 6532 was not supported for a long time in many software apps (f.x. Thunderbird makes really strange things when receives non-encoded UTF-8 mail compilant with RFC 6532 - it's still open in Bugzilla) but up-to-date mail servers allow to create such accounts and send such mails (Postfix has support for it since ~2015). It's complex problem as f.x. delivery of UTF-8 mail to old mailbox can lead to some problems but what can we do else other than progressively upgrade used technologies to support it? :) Anyway, I don't think that it's browser responsibility to "protect backend from problematic e-mail addresses" so if RFC allows it and up-to-date software supports it, we should allow it. |
It's more complex than that, and it's not about ß which is an odd special case. EAI (internationalized) mail can handle addresses like пример@Бориса.РФ. While the domain part can turn into ASCII A-labels xn--80abvxkh.xn--p1ai (sometimes called punycode), the mailbox cannot, and only an EAI mail system can handle that address. The treatment of ß has nothing to do with this. The obsolete IDN2003 and current IDN2008 internationalized domain names are almost the same but one of the few differences is that 2003 normalizes (not punycodes) ß to ss while 2008 makes it a valid character. An address with an ASCII mailbox like |
A few tiny additions and clarifications to John Levine;'s note (we do not disagree about the situation in any important way; the issues are just a bit more complex, with potentially broader implications, that one might infer from his message and they may call part of his suggestion into question). In particular, "eaimaill" or something like it may be the wrong solution to the problem and may dig us in even deeper. For those who lack the time or inclination to read a fairly long analysis and explanation, skip to the last paragraph. First, while his explanation of the difficulty with ß is correct, it is perhaps useful to also note that the ß -> ss transformation is often brought about by the improper or premature application of NFKC, which may have been the source of the recent dust-up about phishing attacks using Mathematical special characters. In the latter case, IDNA2008 imposes a requirement on "lookup applications" (including browsers) to check for and reject such things but they obviously cannot do so if the characters the IDNA interface sees are already transformed to something valid. The current version of Charmod-norm discusses, and recommends against, general application of compatibility mappings. It is perhaps also worth noting that UTS #46 is still recommending the use for NFKC (as part of NFKC_Casefold and its associated tables (see Section 5 of that document)) but also calls out the problem of reaching some IDNA2008-conformant domain names if the IDNA2003 rules are followed. Because, from observation, some (perhaps many or most) browsers look to UTS #46 for authority in interpreting domain names in, e.g., URLs while most or all SMTPUTF8 implementations (incorrectly, but commonly, known as "EAI") are strictly conformant to IDNA2008, the differences between the two introduces additional complications . John mentions that a browser cannot tell what the MTA and configuration a remote server might have, but it is even worse than that. In general, the browser is unlikely to know very much about the precise capabilities of the local MTA or Message Submission Server (MSA() unless those functions are actually built into the browser. The web page designer is even less likely to know and is in big trouble if different browsers behave differently. If the browser does not know, or cannot be configured to know, the distinction between an input type="email" and one of ""eaimail" (which I hope would be called something else, perhaps "i18nemail") would not be as useful as his message implies. Thinking about these issues in terms of what mail systems do with the addresses my miss an important issue. In many cases, web pages are trying to accept and validate something that looks like an email address but is not headed immediately into a mail system. Instead, it is destined for insertion into a database or comparison with something already there, validation by some other process entirely, or is actually an email address (or something that looks like one) used as a personal identifier such as a user ID. For the latter case, conversion of the part of the string following the "@" via the Punycode algorithm may not produce a useful result whether IDNA2008, IDNA2003, or UTS #46 rules are used. I would think it would be dumb, but if someone wanted to allow 3!!!@#$%^&.ØØØ as a user ID and some system wants to allow that, we should probably stay out of their way (perhaps by insisting they use a type that does not imply an email address). However, the other side of that example is probably relevant to the discussion. The operator or administration of a mail server, or the administrator of a system that uses email addresses as IDs, gets to pick the addresses they will allow. Especially in the ID case, if they use a set of rules narrower than what RFC 5821 allows (and that are allowed in addresses on many mail systems), then they open themselves up to many frustrations and complaints from from users whose email addresses are valid according to the standards and work perfectly well on most of the Internet but that are rejected by their systems. Internationalized addresses open up a different problem. As an example, I don't know many mail servers identified by domains subsidiary to the 公益 TLD have allowed registration of local parts in Tamil or Syriac scripts, but I suspect that "zero" wouldn't be a bad guess. Someone designing a web site for users in China might know that and, for the best quality user experience, might want to reject or produce messages about non-Chinese local parts for that domain or perhaps even for any Chinese-script and China-based TLD. Similar rules might be applied in other places to tie the syntax of the local part to the script of the TLD but, for example in countries where multiple scripts are in use and "official", such rules might be a disaster. And, because almost anyone can set up an email server and there are clearly people on the Internet who prioritize being clever or cute or exhibiting a maximum of their freedom of expression over what others might consider sensible or rational, most of us who have been around email for many years have seen some truly bizarre (but valid) local parts of all-ASCII addresses and see no reason to believe we won't see even worse excesses as the Internet becomes increasingly internationalized. This leads me to a conclusion that is a bit different from when this was discussed at length over a year ago. As we have seen when web sites reject legitimate ASCII local parts because people somehow got in into their heads that most non-alphanumeric characters were forbidden or were stand-ins for something else and, more broadly, because it is generally impossible to know what a remote MTA with email accounts on it will allow in those accounts, trying to validate email addresses by syntax alone is hard and may not be productive. When one starts considering email addresses (or things that look like them) that contain non-ASCII characters, things get much more difficult. IDNA2008, IDNA2003, and UTS#46 (in either profile) each have slightly different ideas about what they consider valid. Whatever any of them allow is going to be a superset of what any sensible domain or mail administrator or will allow in practice. In general, a browser does not know what conventions back-end systems or a mail system at the far end of the Internet are following, much less whether they will be doing the same thing next month. So my suggestion would be that Input type="email" be interpreted and tested only as "sort of looks like an all-ASCII email address", that a new input type="i18nmail" be introduced as "looks like 'email' but with some non-ASCII characters strewn around", and that the notion of validating beyond those really general rules be left to the back-end systems, the remote "delivery" MTAs, and so on. In addition, to the extent to which one cares about the quality of the user experience, it may be time to start redesigning the APIs associated with various libraries and interfaces to that they can report back real information about why putative email addresses didn't work for them more precise than "failed" or "invalid address". good luck to us all, |
FYI, new installs of Postfix get EAI enabled by default. My take is that a new input type is not required. An attribute by which to reject EAI is fair (e.g., because the site's MTAs don't support EAI on outbound. |
s/reject/accept/ and I agree |
Validation on the front-end creates more ways to lose rather than more ways to win, and doesn't really protect the backend from vulnerabilities. So I'm just not very keen on the browser doing much validation here. If the site operator has / does not have a limitation as to outbound email, I'm fine with stating it, but I'm also fine with allowing whatever, and making it the backend's job (or any scripts' on the page) to do any validation. My take is that the default should be permissive. This should be how it is in general. Consider what happens otherwise. You might have a page and site that can handle EAI just fine but a developer forgot to update their email inputs on their pages to say so: now you have a latent bug to be found by the first user who tries to enter an internationalized address. This might mean losing user engagement, and you might never find out because why would the users tell you? But, really, why do we need the input to do so much validation? The input has to be plausibly an email address -- a subset of RFC5322, |
The user should able to enter an email address verbatim, with no second-guessing by input forms. If that address is known to be a-priori unworkable by the server's backend system, it can be rejected with an appropriate error message on the initial POST. Otherwise, if the address vaguely resembles mailbox syntax, it should be accepted and used verbatim. It may not be deliverable, but that's also true of many addresses that are syntactically boring |
https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email) defines The value sanitization algorithm is as follows: Strip newlines from the value, then strip leading and trailing ASCII whitespace from the value. |
Keep reading and in another paragraph or two you'll find the Javascript pattern they tell you to use to validate e-mail addresses. |
The PCRE pattern behind the link is rather busted. It fails to properly validate dot-atoms, allowing multiple consecutive periods in unquoted local-parts (invalid addresses), while disallowing quoted local-parts (valid addresses). EAI-aside, this sort of fuzzy approximation of the actual requirements is harmful. |
Hil Maybe it would be helpful to back up a little bit an look at this from the perspective of a fairly common use case. Suppose I have a web site that sets up or uses user accounts and that I've decided to use email addresses as user IDs (there are lots of reasons why that isn't a good idea, but the horse has left the barn and vanished over the horizon). Now, while it would probably not be a good practice, there is no inherent requirement that my system ever send email to that address -- it can be, as far as I'm concerned, just a funny-looking user ID. On the other hand, if I tell a user who has been successfully using a particular email address for a long time that their address is invalid, I am going to have one very annoyed user on my hands. If I am operating in an environment in which "user" is spelled "customer", and I don't have a better reason for rejecting that address than "W3C and WHATWG said it was ok to reject it" I may also be able to have various sales types, managers, and executives in my face. The fact that email address is being used as a user ID probably answers another question. Suppose the user registers with an email address using native Unicode characters in both the local part and the domain part. Now suppose they come back a few weeks later and try to sign in using the same local part but a domain part that contains A-labels. Should the two be considered to match? Remembering that this is a user ID that has the syntax of an email address, not something that is going to be used exclusively in an email context, I'd say that is a business decision and not some HTML (or browsers, or similar tools) should get into the middle of. There is one exception. One of the key differences between IDNA2003 and IDNA2008 is that, in the latter, U-labels and A-labels are guaranteed to be duals of each other. If the browser or the back-end database system are stuck in IDNA2003 or most interpretations of UTR#46, then the fact that multiple source labels can map to a single punycode-encoded form opens the door to a variety of attacks and anyone deciding that the two are interchangeable in that environment has best be quite careful about what user names they allow and how they are treated. It may also be a reasonable business decision in some cases for a site to say "we don't accept non-ASCII email addresses as user IDs/ account identifiers" or even "we accept addresses that uses these characters, or characters from a particular set of scripts, and not others". But nothing in the HTML rules about the valid syntax for email address should be in the middle of that decision. Beyond that, as others have suggested, one just can't know whether an email address is valid without somehow asking the server that hosts the relevant mailbox (or its front end). It may not be possible to ask that question in real time and, even if it is, doing so is likely to require significantly more time (user-visible delay) than browser implementers have typically wanted to invest. So let's stick to syntax That scenario by itself argues strongly for what I think John, Nico, and others are suggesting: the only validation HTML should be performing on something that is claimed to be an email address is conformity to the syntax restrictions in RFC 6531. Could one be even more liberal than that? Yes, but why bother. |
I was actioned by the W3C I18N WG with replying to this thread with a sense of the group. Generally, we concur with @kleinsin's comment just above ⬆️. We think that type=email should accept non-ASCII addresses the better to permit adoption of EAI and IDNA. One reason for low adoption of these are barriers to using them across the Web/Internet. Removing these types of artificial barriers will not only encourage adoption, but will support those users who are already using these. Users of this feature in HTML expect that the input value follow the structural requirements of an email address but don't expect the value to be validated to be an actual valid address. At best this amounts to ensuring that there is an @ sign and maybe some other structure that can be found with a regex. Users who want to impose an ASCII restriction or do additional validation are free to do so and mostly have to do this anyway. In our opinion, HTML would thus be best off to provide minimal validation. User agents can use type=input as a hint for additional features (such as prompting the user with their own email address or providing access to the user's address book), but this is outside the realm of HTML itself. |
I played with this a bit and it seems the current state is rather subpar, though that also leaves more room for changes. Example input: One thing that would help here is a precise definition of the validation browsers would be expected to perform if we changed the current definition as well as tests for that. I can't really commit for Mozilla though if we can make this a bit more concrete I'd be happy to advocate for change. |
@aphillips @annevk just about the only thing worth validating here is the RHS of the What is the most minimal mailbox validation? Certainly: that it's not empty. Validating that the mailbox is not some garbage like just ASCII periods, and so on, might help, but getting that right is probably difficult. So that's my advice: validate that the given address is of any RFC 5322 form that is ultimately of the form |
@annevk, I think your examples actually point out the problem. In order: it would be rare, but not impossible (details on request but I want to keep this relatively short) to see on on the RHS of the "@", and % is prohibited by the syntax in RFC 5321 , but I'd generally recommend the use of percent-encoding in any part of email addresses. Pushing a domain-part through Punycode is prohibited by IDNA unless the labels it contains are validated to be U-labels. I can't tell from your example but if, e.g., the domain -part of the mailbox was \u1D7AA\u1D7C2 then it should be rejected, not encoded with punycode: doing otherwise invites errors down the line, errors for which the user get obscure and/or misleading messages. The problem is that email addresses with non-ASCII characters in the local-part and/or domain part are now valid and increasing numbers of people who can use them for email are expecting to use them through web interfaces. (1) If a mailbox consists of a string of between 1 and 64 octets, an "@", and at least 2 and up to 255 more octets, treat it as acceptable and move on, understanding that all sorts of things may apply additional restrictions in actual email handling. (2) In addition, if you wanted to and the domain-part contained non-ASCII characters, you could verify that any labels were valid ISDNA2008 U-labels and reject the name if they were not ("invalid domain name in email address:" would be a much better message than "invalid email address") AND, optionally iff the local-part was entirely ASCII, convert those U-labels to A-labels. The SMTPUTF8 ("EAI") specs strongly recommend against making that conversion if the local-part is all-ASCII. When the local part is all-ASCII, the conversion will allow some valid cases to go through but, over time, it seems likely that those cases will become, percentagewise, less frequent so whether it is worth the effort is somewhat questionable. FWIW, the above was written in parallel with @nicowilliams's comment rather than after studying it, but that his recommendation and mine are not significantly different except for that one marginal case of an ASCII local-part and a non-ASCII (but IDNA2008-valid) domain part. |
I should have added, as @vdukhovni more or less points out, if one is going to try to validate the syntax of the local-part (even all-ASCII local-parts) if it important to actually get it right. As he shows, getting it right is a moderately complicated process, perhaps best left to email systems that are doing those checks anyway (which is what @nicowilliams and I essentially suggest above). But, if one is going to try to do it, it should be done right because halfway attempts (fuzzy approximations) are harmful, including letting some local-parts with invalid syntax through and prohibiting some valid ones. |
@klensin I'm not sure what you're trying to convince me of. I was offering to help. (Percent-encoding is just part of the MIME type form submission uses by default, it's immaterial. Chrome's Punycode handling is what is encouraged by HTML today. That browsers do incompatible things suggests it might be possible to change the current handling.) |
@annevk I drew an action item (during part of I18N's meeting when @klensin was not available) to propose changes and I'd appreciate your thoughts on how to approach this. Looking at the current text, I guess a question is whether we should attempt to preserve the current behavior for ASCII email addresses (or their LHS/RHS parts) while simultaneously allowing labels in that use non-ASCII Unicode? I18N WG participants seem to agree that we don't want to get into deep validation of the address's validity and limit ourselves to "structurally valid" addresses. |
Right, e.g., at a minimum we should probably require that the string contains a |
It certainly has to be valid Unicode (e.g., no unpaired UTF-16 surrogates, no invalid UTF-8 bytes), and follow the rules like no unpaired quotes. Restricting it more than that is not likely to help. |
Even if people are just using things that look like email addresses for purposes other than sending email, do you really want to allow unnormalized Unicode or leading or trailing white space in the LHS? |
@masinter Absolutely this must allow unnormalized Unicode because users cannot be counted to produce normalized Unicode. Regarding whitespace, trimming it is fine. I don't think there are any security concerns regarding client-side validation -- if there is a site where relaxing client-side validation of email addresses creates a security concern, then the site is already vulnerable. |
RFC 6531 was published 12 years ago, and its predecessor RFC 5336 was published 16 years ago, so we have been waiting for quite a while. Mail systems have complete latitude about what they do with the local parts you hand them. In a lot of cases it would make sense to normalize the input to NFC before comparing it to local addresses. Or maybe NFD depending on how the internal character processing works. Or even NFKC or NFKD. But to point out something we've said a dozen times, YOU DON'T KNOW. So don't guess, just pass along whatever the user enters. |
That's unfortunate. Postfix, which is the most widely used open source MTA, does not automatically make A-label and U-label versions of domains equivalent. I expect that most mail systems are configured to do so, but they don't have to and once again, YOU DON'T KNOW. So please do not imagine you are cleverer than the people who wrote the specs, and just pass along anything that complies with the RFC, which is IDNA2008 in this case. By the way, a lot of implementations screw up handling of German ß. Point your browser at https://fuß.standcore.com/ to see how it does. |
Right, and in my opinion this is the fundamental problem with the current email specs and the likely reason why there's been very little adoption in the last 12 years. In my opinion the specs are too lax and give mail systems too much latitude. The "YOU DON'T KNOW" is a problem that the email specs should be solving so we can actually have a sane universal definition of a standards-compliant email address. |
To the html people: I agree submitting the ToASCII form is probably wrong for an ideal internationalized email field and browsers probably call ToASCII only for backward compatibility. input type="url" doesn't do that, right? I'd suggest being consistent with url field. |
Indeed any equivalence is up to the administrator to implement by installing suitable lookup table key/value pairs (for virtual domains) or just listing both domains in And no equivalence can be 100% complete via Postfix alone, because delivery may be delegated to non-Postfix delivery agents (LMTP, or "pipe" commands), where Postfix can't know which address forms the delivery agent supports. Leaving the input form as-is facilitates onward relaying in a form that is plausibly most likely to be understood by the ultimate MDA. |
To point out what should be obvious, we're not offering advice on how to run mail systems. People already have mail addreses that they got somewhere else, and if we have aesthetic objections to those addresses, too bad. The only relevant question for what goes in an an email address box is whether the string is a plausible address. Not a great address, not a pretty address, just whether it's an address that a mail system might have assigned them. if so, then we should accept it. To be further obvious, most addresses that are syntactically valid are not actual addresses. There are over 3 quadrillion possible addresses of the form |
Collin, What has brought us to that conclusion each time has been some set of edge cases. Each time, it would be nice to ban the particular case globally. The earliest, and perhaps most obvious, example invoves quoting: the first example I can remember involved systems on the early ARPANET whose "user names" and hence mailboxes were of the form "GroupNumber UserNumber" -- without some way to represent the space(s) that separated the two, big problems because there was no guarantee that simply mashing the two together would produce something unambiguous. Worse, some operating systems and applications would figure out various ways to un-quote strings because they thought they knew what they means (hence the rather complex, potentially redundant, quoting rules in RFC 821 and its successors). And then there are chacters, still in ASCII, that have special meanings in some mail environments, are ordinary characters in others, and prohibited in still more. All of "!%$&+-_." have been important examples. In general only the delivery MTA and the system hosting the mailboxes know. Trying to guess what is or is not valid, or what might have special meanings, just invites trouble. As we moved past ASCII, things got more complicated, not becasue we could not make up restrictive rules but because almost any we could come up with caused problems with some reasonable (to them) set of contentions. I note that every single example in this threat has been Latin script -- characters of the variety sometimes called "decorated Latin" or even "decorated ASCII". Well, too bad, but, for historical reasons, Latin script is easy... and so are Greek and Cyrillic as long as one is careful about shared character graphemes among them. Can one make a simple global rule like "no local-part strings with mixed scripts"? Nope: many systems have come up with reasons to do that. And that is still near the top of the slippery slope with lots of sliding opportunities I have not covered. An issue that probably should be quite separate further complicates the situation for the "domain-part" of an email address. Most, if not all, browsers follow UTS#46 as the standard for IDNs. However, most, if not all, mail transport systems follow IDNA2008. For selected cases, they are not compatible no matter what options are chosen for the former. Bottom line remains the same: only systems that actually host mailboxes and determine what strings they will assign or allow can determine what is, or is not, a valid mailbox. |
It seems to me that it wouldn't be HTML creating a security risk if adhering to a "SHOULD" from RFC 6532 results in delivery to the wrong mailbox. It's worth noting that if HTML doesn't normalize, the same security risk would be actualized by the user seeing the email address rendered as visual text on a screen or in print and writing what they see in I think addresses that create a security risk when seen and written by users are entirely unsuited for email addresses that are communicated to humans, and I believe Normalizing to NFC is very minimal compared to what https://www.unicode.org/reports/tr39/#Email_Security_Profiles says. It says "The goal is to flag addresses that are structurally unsound or contain unexpected detritus." That is, the Unicode document covering Unicode security issues characterizes addresses whose local part isn't in NFKC (among other criteria) as "structurally unsound".
Do you consider a piece of software that knows that U-labels are a possibility but that doesn't treat the corresponding A-labels and U-labels as equivalent as meeting the goals of Universal Acceptance?
I said earlier that I think requiring browsers to carry extra data to implement IDNA 2008 restrictions for
Consider these six cases:
Cases 4, 5, and 6 currently cannot be submitted if entered into Case 3 is currently cannot be submitted in WebKit. Currently, Web sites that use I think submitting the ToASCII form of the domain is the least risky approach and the shortest path to interop between Web engines, because to do otherwise would be asking the engine with the largest market share and the closest to current spec behavior to start treating inputs that can be submitted today (inputs of type 3) in a different way in submission. This logically involves Web compat (site breakage) risk, and asking Blink to take such risk for case 3 seems counter-productive to the goal of enabling cases 4, 5, and 6. (Asking the domain handling to differ depeding on the local part seems asking for trouble. I do think we should ask Blink to change its Notably, submitting the ToASCII form makes case 3 work with sites that don't specifically handle IDNA. On the other hand, sites whose back end is internationalized email-aware can transform the Punycode form to the Unicode form if they want to show the Unicode form or treat the Unicode form as canonical. (I think HTML shouldn't cater to the Postfix brokenness described above or to other setups that allow Unicode forms of domains but don't treat the corresponding Unicode and ASCII forms as equivalent. AFAICT, such Postfix configurations would already fail e.g. with Thunderbird senders when the submission SMTP server doesn't negotiate SMTPUTF8 with Thunderbird.)
Can you give an example of a domain that IDNA 2008 accepts but that doesn't result in the same Punycode to be passed to DNS with UTS 46 in non-transitional mode? |
I don't know about "most" mail systems, but FWIW, Postfix uses "libicu" for IDN support, without transitional processing, which is largely a UTS#46 superset of IDNA2008. I am not aware of a ubiquitous library that implements IDNA2008. :-( |
Yes, because it is up to the system administrator to decide which address forms are supported and deliverable. This may mean support for either or both of the A-label and U-label variants of the address domain part. Trying each lookup twice is not attractive (especially when multiple aliases are chained), and normalisation to a form that isn't what came in can hamper downstream deliverability. Postfix supports non-ASCII addresses, but does not attempt any built-in normalisation to either A-label or U-label form. DNS lookups (MX, A, AAAA, ...) of course use the A-label form. |
While not part of the transport system per se, Thunderbird also uses UTS 46 in non-transitional mode when it needs to turn a domain name into an ASCII-only form (notably, when the submission SMTP server does not negotiate SMTPUTF8 but the email address to send to has been entered as ascii@unicode).
Much of the disconnect in this this discussion arises from differing views on whether HTML should facilitate downstream processing giving semantic differences to bit-wise differences that Unicode-wise are not supposed to have semantic differences. Whether HTML should cater to local parts that Unicode considers canonically equivalent to be delivered to potentially different mailboxes. Whether HTML should cater to software giving a semantic difference to different domain representation that would map the same way under (non-transitional) ToASCII. Let's take a step back: What's the goal here? Is it to enable people around the world to use their native script in email addressing? Or is it to cater to email server configurations that create semantic differences where there Unicode-wise (I'm counting UTS 46 under "Unicode-wise" here) are not supposed to be any? I think it's worthwhile to change |
On 2024-05-06 23:01, Henri Sivonen wrote:
It seems to me that it wouldn't be HTML creating a security risk if
adhering to a "SHOULD" from RFC 6532 results in delivery to the wrong
mailbox.
The "SHOULD" of RFC-6532 addresses normalization of email headers within
messages, not of addresses used to send and receive mail via the SMTP
protocol. There is no such "SHOULD" direction to normalize in RFC-6531.
If HTML would stick with the requirements of RFC-5321+RFC-6531 `Mailbox`
syntax, and trust that what people enter into input form fields is in
fact what they want to use, this section of the spec can become
straightforward.
best regards
|
We don't have to guess. I tested a dozen mail systems for the UASG and Postfix passed with flying colors. Read all about it: https://uasg.tech/download/uasg-030-evaluation-of-eai-support-in-email-software-and-services-report-en/ People who actually write mail software and run mail systems have told you over and over why it is a bad idea to imagine that mail addresses follow any pattern beyond what the RFCs require. If you insist on NFC or whatever, you will reject some real mail addresses, with no benefit to anyone. But if you insist you know better than everyone else, there's not much we can do about it. |
There's certainly a lot of argument by authority, but not a lot of engagement with the identified issues and questions put forward. |
Viktor is one of the people who maintain Postfix. Klensin and I have been working on mail standards for decades, in his case many more decades. I have done work for the UASG, and talk with people at large mail systems at M3AAWG and other places. Can you say more about why our experience is irrelevant here? |
I'm not saying your experience is irrelevant, but it's not a great conversation when one side asks questions and the other side is essentially saying "trust us". E.g., you claim Postfix is perfect but above there's also an undisputed claim it uses UTS46 and not IDNA2008 (and still nobody addressed the question for which inputs those might be different). So I guess that was not tested then? |
The document says: I interpret "trap" as not meeting the goals.
I've tried to avoid engaging on the "who knows better" topic, but: More to the point, it's not just saying "trust us" instead of engaging with the specific implementation-relevant questions raised but saying "trust us over the Unicode spec writers on Unicode matters". What I've said about NFC is very mild (and, when it comes to e.g. non-ASCII control characters, possibly insufficient) compared to https://www.unicode.org/reports/tr39/#Email_Security_Profiles . |
Ah, thanks for explaining to me what I meant when I wrote that. In any event, it's your choice. If you want to do something useful and allow people to enter their actual addresses, you'll allow whatever the RFCs allow. Or if you only want to allow addresses that meet your aesthetic preferences and if someone's address isn't one of those, too bad, they're not going to fill out a web form today, that seems bizarre but we haven't had much luck getting that point through. |
This comment was marked as duplicate.
This comment was marked as duplicate.
Hmm. I think we are successfully talking past each other. To illustrate from "the other side"
What I have written and seen others write/say is not "trust us" but things closer to "there are well-established, written, email specs, written as IETF Standards Track RFCs, and you should believe them rather than trying to make up your own rules". The only thing I've seen that comes close to "trust us" is when we have tried to assure you that, in the process of developing, and periodically reviewing and updating, those standards, proposals that would have resulted in much more restricted syntax have been carefully considered and then rejected. What I've seen instead is a good deal of "not listening" on the part of those who think that their ideas of what email syntax should be ought to prevail, in HTML and elsewhere over what those standards, and assorted actual implementations, establish as the actual practice. So, if you and some of your colleagues are hearing "us" saying "trust us" rather than "don't try to make rules more restrictive than the actual email standards and practices based on them" and "we" are seeing signs of what we actually do say not getting through, the discussion is doomed to go around in circles... which seems, to me at least, to be a good description. I think that, without any hint of "trust us", Gene's comment above: "At this time, I would say that web forms rejecting perfectly valid email addresses is more of an impediment to adoption of Internationalized Email than any problems with the email standards." is just right.
I haven't made any claims about the perfection of Postfix. The only claims I've heard others make are about its being a popular open source implementation that conforms to the standards. As far as UTS46 and IDNA2008 are concerned, first remember that they have nothing to do with the discussion above, which, at least AFAICT, has been about the local-part of an email address and not the domain part, the differences in the status and interpretation of various inputs is well documented, with much of the documentation being part of UTS#46 itself. To state this as neutrally as I can, most of the differences are because UTS#46 tried to maintain a higher degree of compatibility with IDNA2003 while there were many areas in which the consensus in the IETF about how to make the DNS work better and some rather explicit agreements with DNS registries with needs for specific scripts concluded that allowing certain incompatibilities was much wiser for both the short and long term. Some of the most glaring differences result from the use of Case Folding in UTS#46 (and IDNA2003), with, using an example that came up in this discussion, whether "ß" is a letter or just a funny way to write "ss". And so on -- more examples on request, but I doubt they will accomplish anything. |
I wasn't telling you what you meant. I was saying what I understood as a reader. To me as a reader, the quoted bit looks like it is calling out a Universal Acceptance-relevant problem in Postfix's behavior. If there was a communication failure, I think that giving me a direct answer instead of pointing me to an external document might have avoided a communication failure. In particular, I think it's evident in this discussion that reading what discussion participants have written elsewhere isn't predictive of specific opinions here. For example, as a reader (again, not claiming author intent) of RFC 5198, I wouldn't have predicted this position (which I find anti-persuasive): "On the other hand, if some mail administrator with mailboxes in Vietnamese wants to make it difficult for someone who does not have a locally-normal Vietnamese input device to send mail to a particular mailbox and decides to only allow the non-NFC form, the standards deliberately do not tell them they cannot do that and we shouldn't either."
As far as I can tell, you (plural) are asking for an HTML spec change and Web engine implementation changes. The issue has been open for years without success at getting the changes. Now when you've had the attention of folks who can change the spec or an implementation, it seems to me you haven't been using the attention effectively. It seems to me that no matter how wrong you think we are, it would be more productive to reply to specific questions instead of talking past.
I suggest that you focus on the impediments. However, I've seen statements that to me have looked like requests not just to expand the value space that If being unable to submit Unicode in the local part is an impediment, I suggest focusing on expanding the value space but treating dot specifics as a distinct point out of scope here. If Chrome turning "ß" in input into "ss" in output is an impediment, I suggest focusing on specifying the non-transitional mode of UTS 46 instead of suggesting either IDNA 2008 or suggesting not using ToASCII. If Safari not allowing Unicode-form domain input is an impediment, I suggest changing the spec to be more obviously algorithmic to make it clearer that the spec intends to enable the input of the Unicode (or mixed) form despite the logical value space of the domain part being ASCII. On the other hand, if submitting the ToASCII form like Chrome does and the current spec requires is seen as an impediment, it seems to me the impediment is elsewhere (in software that fails to treat Unicode and ASCII forms equivalently). Furthermore, to maximize success of removing impediments, it makes sense to make the kind of spec changes that to a browser implementor don't look risky to be the first one to ship. If you ask for changes that would make implementors prefer that someone else go first, success is less likely.
First, that's not an example of UTS 46 in non-transitional mode differing from IDNA 2008. Second, are more examples really available on request? I have already requested twice and Anne has pointed out that I've requested. Yet, there's been talking past instead of showing examples. I reiterate my request: Can you tell me what kind of inputs are there that IDNA 2008 accepts but that UTS 46 in non-transitional mode either rejects or maps to different ASCII form than IDNA 2008? As an UTS 46 implementor, my current understanding is that there are none, but if there are some, it would be useful for me to know. |
That Postfix code is indeed easy to misconfigure: I wrote the code and later misconfigured it. Sigh. That said, it's not the kind of bug that you can fix easily once you've seen a user make a mistake. When I wrote the code I thought that Postfix had to choose between three different suboptimal behaviours, and I chose the least harmful possibility. Details offtopic in this thread. I've also done a thorough study of the domains in the production DNS and probably have at least half of the real-world examples. They're boring. Most of the real-world domains that are treated differently between any paid of IDAN2003, IDAN2008 and UTS46(x/y) look like tests to my eyes. I should be able to dig up the list if anyone really wants them, but be warned, it's splendid material for a bikeshed discussion. My advice, if anyone were to listen, would be to consider IDNA2008 and UTS46 as practically interchangeable and go work on something important. I agree with John Klensin's comment above. We have large problems here, we should IMO not allow ourselves to be distracted by fascinating exceptions and edge cases. (Don't misunderstand, I love edge cases and bikeshedding discussions, but we shouldn't engage in that kind of thing, even if I'm as guilty as anyone, every time we do.) |
This is more or less the same issue as https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 but I think it's worth another look since a lot of things have changed.
The issue is that the e-mail address validation pattern in sec 4.10.5.1.5 only accepts ASCII addresses, not EAI addresses. Since last time, large hosted mail systems including Gmail, Hotmail/Outlook, Yahoo/AOL (soon if not yet), and Coremail handle EAI mail. On smaller systems Postfix and Exim have EAI support enabled by a configuration flag.
On the other side, writing a Javascript pattern to validate EAI addresses has gotten a lot easier since JS now has Unicode character class patterns like /(\p{L}|\p{N})+/u which matches a string of letters and digits for a Unicode version of letters and digits.
Last time around the consensus seemed to be that EAI input fields should be marked as unicode or eai or the like, since it'll be a while since all mail systems handle EAI.
For the avoidance of doubt, when I say EAI, I mean both Unicode local parts and Unicode domain names, since that's what EAI mail systems handle. There is no benefit to translating IDNs to A-labels (the ones with punycode) since that's all handled deep inside the mail system.
The text was updated successfully, but these errors were encountered: