Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BytesGenerator breaks UTF8 string #92081

Closed
Yuribtr opened this issue Apr 30, 2022 · 8 comments
Closed

BytesGenerator breaks UTF8 string #92081

Yuribtr opened this issue Apr 30, 2022 · 8 comments
Assignees
Labels
topic-email type-bug An unexpected behavior, bug, or error

Comments

@Yuribtr
Copy link

Yuribtr commented Apr 30, 2022

Hi!
I found an issue when sending emails with Cyrillic letters in Subject header. Some spaces at Subject header are trimmed when sent.

Example:
When sending email with below subject:
Уведомление о принятии в работу обращения

at SMTP server logs I see subject that differs from original:
Уведомление о принятиив работу обращения

  • As you can see, space between words "принятии в" was stripped.

During research I've found that problem relates to small piece of code which encodes EmailMessage instance to byte string.
Python versions tested and problem confirmed: 3.8, 3.9, 3.10

Here is minimal reproducible example. Code can be used "as is", without any third party packages.

Minimal reproducible example
import io
import email.generator
from email.message import EmailMessage
from email.header import decode_header


def encode_decode(subject: str):
    # preparing EmailMessage
    msg = EmailMessage()
    msg['Subject'] = subject

    # below code sample was taken from "send_message" function (lib/python3.8/smtplib.py)
    # this is the place where problem actually appears
    with io.BytesIO() as bytesmsg:
        g = email.generator.BytesGenerator(bytesmsg)
        g.flatten(msg, linesep='\r\n')
        flatmsg = bytesmsg.getvalue()

    # assembling string and cutting off beginning part ('Subject: ')
    result = ''
    for string, encoding in decode_header(flatmsg.decode()):
        result += string.decode(encoding=encoding or 'utf8')
    return result[9:]


if __name__ == '__main__':
    test_cases = [
        'ффффффффффффффффффффффффф',   # ok
        'фффффффффффффффффффффффф ',   # ok
        'ффффффффффффффффффффффф ф',   # ok
        'фффффффффффффффффффффф фф',   # ok
        'ффффффффффффффффффффф ф ф',   # broken
        'фффффффффффффффффффф фф ф',   # ok
        'ффффффффффффффффффф ф ф ф',   # ok
        'фффффффффффффффффф ф ф ф ф',  # broken
        'ффффффффффффффффф ф фф ф ф',  # broken
        'фффффффффффффффф ф ффф ф ф',  # broken
        'ффффффффффффффф ф фффф ф ф',  # broken
        'фффффффффффффф ф ффффф ф ф',  # broken
        'ффффффффффффф ф фффффф ф ф',  # broken
        'фффффффффффф ф ффффффф ф ф',  # broken
        'ффффффффффф ф фффффффф ф ф',  # broken
        'фффффффффф ф ффффффффф ф ф',  # broken
        'ффффффффф ф фффффффффф ф ф',  # broken
        'фффффффф ф ффффффффффф ф ф',  # broken
        'ффффффф ф фффффффффффф ф ф',  # broken
        'фффффф ф ффффффффффффф ф ф',  # broken
        'ффффф ф фффффффффффффф ф ф',  # broken
        'фффф ф ффффффффффффффф ф ф',  # broken
        'ффф ф фффффффффффффффф ф ф',  # broken
        'фф ф ффффффффффффффффф ф ф',  # broken
        'ф ф фффффффффффффффффф ф ф',  # broken
        ' ф ффффффффффффффффффф ф ф',  # broken
        'ф фффффффффффффффффффф ф ф',  # ok
        ' ффффффффффффффффффффф ф ф',  # broken
        'фффффффффффффффффффффф ф ф',  # ok
    ]
    for in_ in test_cases:
        out_ = encode_decode(in_)
        res = 'ok' if out_ == in_ else 'broken'
        print(f'In  | {in_}', f'Out | {out_}', f'Res | {res}\n', sep='\n')

Above code demonstrates inequality of input and output strings after encoding message with BytesGenerator. Please note that not all strings with Cyrillic letters are broken. Only those strings that have word with single Cyrillic char only are affected under some conditions.
Small additional list of string with explanations you can find below:

Additional strings with explanations
# Example 1 - below string will be broken
'Уведомление о принятии в работу обращения для подключения услуги',
# fixed version (removed one cyrillic UTF8 "и" from word "принятии")
'Уведомление о приняти в работу обращения для подключения услуги',
# fixed version (changed cyrillic UTF8 letter "о" to ASCII letter "o")
'Уведомление o принятии в работу обращения для подключения услуги',
#
# Example 2 - below string will be broken
'Уведомление принятии в работу обращения для подключения услуги',
# fixed version (removed preposition "в" that consist from single cyrillic UTF8 letter)
'Уведомление принятии работу обращения для подключения услуги',
# fixed version (changed preposition "в" with cyrillic UTF8 letter to ASCII letter "B")
'Уведомление принятии B работу обращения для подключения услуги',

Linked PRs

@Yuribtr Yuribtr added the type-bug An unexpected behavior, bug, or error label Apr 30, 2022
@mrabarnett
Copy link

The problem has something to do with the maximum header length when the message is encoded to ASCII.

Increasing the maximum header length (the default is 78) makes the problem disappear for the sample tests:

g = email.generator.BytesGenerator(bytesmsg, maxheaderlen=160)

@Yuribtr
Copy link
Author

Yuribtr commented May 1, 2022

The problem has something to do with the maximum header length when the message is encoded to ASCII.

Increasing the maximum header length (the default is 78) makes the problem disappear for the sample tests:

g = email.generator.BytesGenerator(bytesmsg, maxheaderlen=160)

Thanks for hint. As temporary workaround I set maxheaderlen to zero:
g = email.generator.BytesGenerator(bytesmsg, maxheaderlen=0)

UPD.
For sending emails you should set max_line_length greater than 3. This is necessary in order to avoid ValueError at /Lib/email/quoprimime.py::body_encode() in some cases. I prefer to set max_line_length to maximum.

msg = EmailMessage()
msg.policy = msg.policy.clone(max_line_length=sys.maxsize)

@abadger
Copy link
Contributor

abadger commented May 3, 2022

Note: This problem will occur with Generator as well as BytesGenerator.

What I believe is happening in the problem cases is that the Generator is creating two or more encoded-words. A space character present in the input is placed between the encoded words rather than becoming a part of the encoded words. When the decoder is run, the decoder discards the linear whitespace that occurs between the encoded words. This leads to the omission of the space in the round tripped data.

Here's the output from the Generator for one of the problem cases:

Subject: =?utf-8?b?0YTRhNGE0YTRhNGE0YTRhNGE0YTRhNGE0YTRhNGE0YTRhNGE0YTRhNGE?=
  =?utf-8?b?0YQg0YQ=?=

You can see that there are two spaces (in addition to the \r\n) in between the two encoded blocks.

After reading the examples in https://www.rfc-editor.org/rfc/rfc2047 I believe that the generator is at fault here. The space characters should be added to either the leading or trailing encoded-word rather than being emitted as a literal between the words.

@abadger
Copy link
Contributor

abadger commented May 3, 2022

Still exploring precisely what is going on but I have been able to modify the code to fix all of the cases reported by forcing whitespace to be encoded more frequently in _header_value_parser.py::_refold_parse_tree() and preventing leading and trailing whitespace from being detected in _fold_as_ew() . I think my changes are too broad but I'll post them here in case something happens to me before I have a chance to finish diagnosing this:

EDIT: Only leading whitespace needs to be undetected in _fold_as_ew()

diff --git a/Lib/email/_header_value_parser.py b/Lib/email/_header_value_parser.py
index 8a8fb8bc42..498a3c1e01 100644
--- a/Lib/email/_header_value_parser.py
+++ b/Lib/email/_header_value_parser.py
@@ -2781,6 +2781,8 @@ def _refold_parse_tree(parse_tree, *, policy):
         if part.token_type == 'ptext' and set(tstr) & SPECIALS:
             # Encode if tstr contains special characters.
             want_encoding = True
+        elif part.token_type == 'fws' and last_ew:
+            want_encoding = True
         try:
             tstr.encode(encoding)
             charset = encoding
@@ -2877,19 +2879,19 @@ def _fold_as_ew(to_encode, lines, maxlen, last_ew, ew_combine_allowed, charset):
         to_encode = str(
             get_unstructured(lines[-1][last_ew:] + to_encode))
         lines[-1] = lines[-1][:last_ew]
-    if to_encode[0] in WSP:
-        # We're joining this to non-encoded text, so don't encode
-        # the leading blank.
-        leading_wsp = to_encode[0]
-        to_encode = to_encode[1:]
-        if (len(lines[-1]) == maxlen):
-            lines.append(_steal_trailing_WSP_if_exists(lines))
-        lines[-1] += leading_wsp
+    #if to_encode[0] in WSP:
+    #    # We're joining this to non-encoded text, so don't encode
+    #    # the leading blank.
+    #    leading_wsp = to_encode[0]
+    #    to_encode = to_encode[1:]
+    #    if (len(lines[-1]) == maxlen):
+    #        lines.append(_steal_trailing_WSP_if_exists(lines))
+    #    lines[-1] += leading_wsp
     trailing_wsp = ''
     if to_encode[-1] in WSP:
         # Likewise for the trailing space.

@abadger
Copy link
Contributor

abadger commented May 3, 2022

Closer..... This change in _fold_as_ew fixes all but two of the newly reported problems while all the unittests continue to pass:

-    if to_encode[0] in WSP:
+    if last_ew is None and to_encode[0] in WSP:

The two which are broken are: ффффффффффффффффффффф ф ф and ф ффффффффффффффффффф ф ф

I believe something like the above should make it into the final fix as the comment for this block says that it should only be invoked if we're joining this encoded-word to a non-encoded word but there's nothing in this condition which checks that the previous word was non-encoded.

And elif to_encode[0]: might also do the right thing here.

abadger added a commit to abadger/cpython that referenced this issue May 4, 2022
email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

Test case from python#92081
@abadger
Copy link
Contributor

abadger commented May 4, 2022

I've opened a PR that fixes the cases where we need spaces between encoded words.

I've figured out that the remaining two problems start with a space. I'm searching the RFCs but so far haven't found anything that says whether the generator should handle this by including the initial space in the initial encoded word or if the decoder should handle it by displaying the space (or as a third option, that initial spaces in a Subject should be ignored).

abadger added a commit to abadger/cpython that referenced this issue Apr 26, 2023
email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

Test case from python#92081
abadger added a commit to abadger/cpython that referenced this issue Apr 26, 2023
email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from python#92081
@abadger
Copy link
Contributor

abadger commented Apr 27, 2023

I believe #92281 now fixes all the problems reported here. If anyone would like to test that the problems are resolved by it, that would be appreciated!

abadger added a commit to abadger/cpython that referenced this issue Jul 21, 2023
email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from python#92081
warsaw pushed a commit that referenced this issue May 20, 2024
…ncoded words. (#92281)

* Fix for email.generator.Generator with whitespace between encoded words.

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from #92081

* Rename a variable so it's not confused with the final variable.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 20, 2024
…ween encoded words. (pythonGH-92281)

* Fix for email.generator.Generator with whitespace between encoded words.

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from pythonGH-92081

* Rename a variable so it's not confused with the final variable.
(cherry picked from commit a6fdb31)

Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 20, 2024
…ween encoded words. (pythonGH-92281)

* Fix for email.generator.Generator with whitespace between encoded words.

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from pythonGH-92081

* Rename a variable so it's not confused with the final variable.
(cherry picked from commit a6fdb31)

Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>
warsaw pushed a commit that referenced this issue May 20, 2024
…tween encoded words. (GH-92281) (#119245)

* Fix for email.generator.Generator with whitespace between encoded words.

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from GH-92081

* Rename a variable so it's not confused with the final variable.
(cherry picked from commit a6fdb31)

Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>
warsaw pushed a commit that referenced this issue May 20, 2024
…tween encoded words. (GH-92281) (#119246)

* Fix for email.generator.Generator with whitespace between encoded words.

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from GH-92081

* Rename a variable so it's not confused with the final variable.
(cherry picked from commit a6fdb31)

Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>
@warsaw warsaw self-assigned this May 20, 2024
@warsaw
Copy link
Member

warsaw commented May 20, 2024

Thanks for the fix @abadger

@warsaw warsaw closed this as completed May 20, 2024
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
…ween encoded words. (python#92281)

* Fix for email.generator.Generator with whitespace between encoded words.

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from python#92081

* Rename a variable so it's not confused with the final variable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-email type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants