BytesGenerator breaks UTF8 string #92081

Yuribtr · 2022-04-30T13:07:38Z

Hi!
I found an issue when sending emails with Cyrillic letters in Subject header. Some spaces at Subject header are trimmed when sent.

Example:
When sending email with below subject:
Уведомление о принятии в работу обращения

at SMTP server logs I see subject that differs from original:
Уведомление о принятиив работу обращения

As you can see, space between words "принятии в" was stripped.

During research I've found that problem relates to small piece of code which encodes EmailMessage instance to byte string.
Python versions tested and problem confirmed: 3.8, 3.9, 3.10

Here is minimal reproducible example. Code can be used "as is", without any third party packages.

Minimal reproducible example

import io
import email.generator
from email.message import EmailMessage
from email.header import decode_header


def encode_decode(subject: str):
    # preparing EmailMessage
    msg = EmailMessage()
    msg['Subject'] = subject

    # below code sample was taken from "send_message" function (lib/python3.8/smtplib.py)
    # this is the place where problem actually appears
    with io.BytesIO() as bytesmsg:
        g = email.generator.BytesGenerator(bytesmsg)
        g.flatten(msg, linesep='\r\n')
        flatmsg = bytesmsg.getvalue()

    # assembling string and cutting off beginning part ('Subject: ')
    result = ''
    for string, encoding in decode_header(flatmsg.decode()):
        result += string.decode(encoding=encoding or 'utf8')
    return result[9:]


if __name__ == '__main__':
    test_cases = [
        'ффффффффффффффффффффффффф',   # ok
        'фффффффффффффффффффффффф ',   # ok
        'ффффффффффффффффффффффф ф',   # ok
        'фффффффффффффффффффффф фф',   # ok
        'ффффффффффффффффффффф ф ф',   # broken
        'фффффффффффффффффффф фф ф',   # ok
        'ффффффффффффффффффф ф ф ф',   # ok
        'фффффффффффффффффф ф ф ф ф',  # broken
        'ффффффффффффффффф ф фф ф ф',  # broken
        'фффффффффффффффф ф ффф ф ф',  # broken
        'ффффффффффффффф ф фффф ф ф',  # broken
        'фффффффффффффф ф ффффф ф ф',  # broken
        'ффффффффффффф ф фффффф ф ф',  # broken
        'фффффффффффф ф ффффффф ф ф',  # broken
        'ффффффффффф ф фффффффф ф ф',  # broken
        'фффффффффф ф ффффффффф ф ф',  # broken
        'ффффффффф ф фффффффффф ф ф',  # broken
        'фффффффф ф ффффффффффф ф ф',  # broken
        'ффффффф ф фффффффффффф ф ф',  # broken
        'фффффф ф ффффффффффффф ф ф',  # broken
        'ффффф ф фффффффффффффф ф ф',  # broken
        'фффф ф ффффффффффффффф ф ф',  # broken
        'ффф ф фффффффффффффффф ф ф',  # broken
        'фф ф ффффффффффффффффф ф ф',  # broken
        'ф ф фффффффффффффффффф ф ф',  # broken
        ' ф ффффффффффффффффффф ф ф',  # broken
        'ф фффффффффффффффффффф ф ф',  # ok
        ' ффффффффффффффффффффф ф ф',  # broken
        'фффффффффффффффффффффф ф ф',  # ok
    ]
    for in_ in test_cases:
        out_ = encode_decode(in_)
        res = 'ok' if out_ == in_ else 'broken'
        print(f'In  | {in_}', f'Out | {out_}', f'Res | {res}\n', sep='\n')

Above code demonstrates inequality of input and output strings after encoding message with BytesGenerator. Please note that not all strings with Cyrillic letters are broken. Only those strings that have word with single Cyrillic char only are affected under some conditions.
Small additional list of string with explanations you can find below:

Additional strings with explanations

# Example 1 - below string will be broken
'Уведомление о принятии в работу обращения для подключения услуги',
# fixed version (removed one cyrillic UTF8 "и" from word "принятии")
'Уведомление о приняти в работу обращения для подключения услуги',
# fixed version (changed cyrillic UTF8 letter "о" to ASCII letter "o")
'Уведомление o принятии в работу обращения для подключения услуги',
#
# Example 2 - below string will be broken
'Уведомление принятии в работу обращения для подключения услуги',
# fixed version (removed preposition "в" that consist from single cyrillic UTF8 letter)
'Уведомление принятии работу обращения для подключения услуги',
# fixed version (changed preposition "в" with cyrillic UTF8 letter to ASCII letter "B")
'Уведомление принятии B работу обращения для подключения услуги',

Linked PRs

The text was updated successfully, but these errors were encountered:

mrabarnett · 2022-04-30T19:27:30Z

The problem has something to do with the maximum header length when the message is encoded to ASCII.

Increasing the maximum header length (the default is 78) makes the problem disappear for the sample tests:

g = email.generator.BytesGenerator(bytesmsg, maxheaderlen=160)

Yuribtr · 2022-05-01T19:02:27Z

The problem has something to do with the maximum header length when the message is encoded to ASCII.

Increasing the maximum header length (the default is 78) makes the problem disappear for the sample tests:
g = email.generator.BytesGenerator(bytesmsg, maxheaderlen=160)

Thanks for hint. As temporary workaround I set maxheaderlen to zero:
g = email.generator.BytesGenerator(bytesmsg, maxheaderlen=0)

UPD.
For sending emails you should set max_line_length greater than 3. This is necessary in order to avoid ValueError at /Lib/email/quoprimime.py::body_encode() in some cases. I prefer to set max_line_length to maximum.

msg = EmailMessage()
msg.policy = msg.policy.clone(max_line_length=sys.maxsize)

abadger · 2022-05-03T15:51:17Z

Note: This problem will occur with Generator as well as BytesGenerator.

What I believe is happening in the problem cases is that the Generator is creating two or more encoded-words. A space character present in the input is placed between the encoded words rather than becoming a part of the encoded words. When the decoder is run, the decoder discards the linear whitespace that occurs between the encoded words. This leads to the omission of the space in the round tripped data.

Here's the output from the Generator for one of the problem cases:

Subject: =?utf-8?b?0YTRhNGE0YTRhNGE0YTRhNGE0YTRhNGE0YTRhNGE0YTRhNGE0YTRhNGE?=
  =?utf-8?b?0YQg0YQ=?=

You can see that there are two spaces (in addition to the \r\n) in between the two encoded blocks.

After reading the examples in https://www.rfc-editor.org/rfc/rfc2047 I believe that the generator is at fault here. The space characters should be added to either the leading or trailing encoded-word rather than being emitted as a literal between the words.

abadger · 2022-05-03T17:48:43Z

Still exploring precisely what is going on but I have been able to modify the code to fix all of the cases reported by forcing whitespace to be encoded more frequently in _header_value_parser.py::_refold_parse_tree() and preventing leading and trailing whitespace from being detected in _fold_as_ew() . I think my changes are too broad but I'll post them here in case something happens to me before I have a chance to finish diagnosing this:

EDIT: Only leading whitespace needs to be undetected in _fold_as_ew()

diff --git a/Lib/email/_header_value_parser.py b/Lib/email/_header_value_parser.py
index 8a8fb8bc42..498a3c1e01 100644
--- a/Lib/email/_header_value_parser.py
+++ b/Lib/email/_header_value_parser.py
@@ -2781,6 +2781,8 @@ def _refold_parse_tree(parse_tree, *, policy):
         if part.token_type == 'ptext' and set(tstr) & SPECIALS:
             # Encode if tstr contains special characters.
             want_encoding = True
+        elif part.token_type == 'fws' and last_ew:
+            want_encoding = True
         try:
             tstr.encode(encoding)
             charset = encoding
@@ -2877,19 +2879,19 @@ def _fold_as_ew(to_encode, lines, maxlen, last_ew, ew_combine_allowed, charset):
         to_encode = str(
             get_unstructured(lines[-1][last_ew:] + to_encode))
         lines[-1] = lines[-1][:last_ew]
-    if to_encode[0] in WSP:
-        # We're joining this to non-encoded text, so don't encode
-        # the leading blank.
-        leading_wsp = to_encode[0]
-        to_encode = to_encode[1:]
-        if (len(lines[-1]) == maxlen):
-            lines.append(_steal_trailing_WSP_if_exists(lines))
-        lines[-1] += leading_wsp
+    #if to_encode[0] in WSP:
+    #    # We're joining this to non-encoded text, so don't encode
+    #    # the leading blank.
+    #    leading_wsp = to_encode[0]
+    #    to_encode = to_encode[1:]
+    #    if (len(lines[-1]) == maxlen):
+    #        lines.append(_steal_trailing_WSP_if_exists(lines))
+    #    lines[-1] += leading_wsp
     trailing_wsp = ''
     if to_encode[-1] in WSP:
         # Likewise for the trailing space.

abadger · 2022-05-03T23:55:30Z

Closer..... This change in _fold_as_ew fixes all but two of the newly reported problems while all the unittests continue to pass:

-    if to_encode[0] in WSP:
+    if last_ew is None and to_encode[0] in WSP:

The two which are broken are: ффффффффффффффффффффф ф ф and ф ффффффффффффффффффф ф ф

I believe something like the above should make it into the final fix as the comment for this block says that it should only be invoked if we're joining this encoded-word to a non-encoded word but there's nothing in this condition which checks that the previous word was non-encoded.

And elif to_encode[0]: might also do the right thing here.

email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. Test case from python#92081

abadger · 2022-05-04T01:12:25Z

I've opened a PR that fixes the cases where we need spaces between encoded words.

I've figured out that the remaining two problems start with a space. I'm searching the RFCs but so far haven't found anything that says whether the generator should handle this by including the initial space in the initial encoded word or if the decoder should handle it by displaying the space (or as a third option, that initial spaces in a Subject should be ignored).

email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. Test case from python#92081

email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. A second problem happens with continuation lines. A continuation line that starts with whitespace and is followed by a non-encoded word is fine because the newline between such continuation lines is defined as condensing to a single space character. When the continuation line starts with whitespace followed by an encoded word, however, the RFCs specify that the word is run together with the encoded word on the previous line. This is because normal words are filded on syntactic breaks by encoded words are not. The solution to this is to add the whitespace to the start of the encoded word on the continuation line. Test cases are from python#92081

abadger · 2023-04-27T16:18:11Z

I believe #92281 now fixes all the problems reported here. If anyone would like to test that the problems are resolved by it, that would be appreciated!

email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. A second problem happens with continuation lines. A continuation line that starts with whitespace and is followed by a non-encoded word is fine because the newline between such continuation lines is defined as condensing to a single space character. When the continuation line starts with whitespace followed by an encoded word, however, the RFCs specify that the word is run together with the encoded word on the previous line. This is because normal words are filded on syntactic breaks by encoded words are not. The solution to this is to add the whitespace to the start of the encoded word on the continuation line. Test cases are from python#92081

…ncoded words. (#92281) * Fix for email.generator.Generator with whitespace between encoded words. email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. A second problem happens with continuation lines. A continuation line that starts with whitespace and is followed by a non-encoded word is fine because the newline between such continuation lines is defined as condensing to a single space character. When the continuation line starts with whitespace followed by an encoded word, however, the RFCs specify that the word is run together with the encoded word on the previous line. This is because normal words are filded on syntactic breaks by encoded words are not. The solution to this is to add the whitespace to the start of the encoded word on the continuation line. Test cases are from #92081 * Rename a variable so it's not confused with the final variable.

…ween encoded words. (pythonGH-92281) * Fix for email.generator.Generator with whitespace between encoded words. email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. A second problem happens with continuation lines. A continuation line that starts with whitespace and is followed by a non-encoded word is fine because the newline between such continuation lines is defined as condensing to a single space character. When the continuation line starts with whitespace followed by an encoded word, however, the RFCs specify that the word is run together with the encoded word on the previous line. This is because normal words are filded on syntactic breaks by encoded words are not. The solution to this is to add the whitespace to the start of the encoded word on the continuation line. Test cases are from pythonGH-92081 * Rename a variable so it's not confused with the final variable. (cherry picked from commit a6fdb31) Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>

…tween encoded words. (GH-92281) (#119245) * Fix for email.generator.Generator with whitespace between encoded words. email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. A second problem happens with continuation lines. A continuation line that starts with whitespace and is followed by a non-encoded word is fine because the newline between such continuation lines is defined as condensing to a single space character. When the continuation line starts with whitespace followed by an encoded word, however, the RFCs specify that the word is run together with the encoded word on the previous line. This is because normal words are filded on syntactic breaks by encoded words are not. The solution to this is to add the whitespace to the start of the encoded word on the continuation line. Test cases are from GH-92081 * Rename a variable so it's not confused with the final variable. (cherry picked from commit a6fdb31) Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>

…tween encoded words. (GH-92281) (#119246) * Fix for email.generator.Generator with whitespace between encoded words. email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. A second problem happens with continuation lines. A continuation line that starts with whitespace and is followed by a non-encoded word is fine because the newline between such continuation lines is defined as condensing to a single space character. When the continuation line starts with whitespace followed by an encoded word, however, the RFCs specify that the word is run together with the encoded word on the previous line. This is because normal words are filded on syntactic breaks by encoded words are not. The solution to this is to add the whitespace to the start of the encoded word on the continuation line. Test cases are from GH-92081 * Rename a variable so it's not confused with the final variable. (cherry picked from commit a6fdb31) Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>

warsaw · 2024-05-20T20:14:05Z

Thanks for the fix @abadger

…ween encoded words. (python#92281) * Fix for email.generator.Generator with whitespace between encoded words. email.generator.Generator currently does not handle whitespace between encoded words correctly when the encoded words span multiple lines. The current generator will create an encoded word for each line. If the end of the line happens to correspond with the end real word in the plaintext, the generator will place an unencoded space at the start of the subsequent lines to represent the whitespace between the plaintext words. A compliant decoder will strip all the whitespace from between two encoded words which leads to missing spaces in the round-tripped output. The fix for this is to make sure that whitespace between two encoded words ends up inside of one or the other of the encoded words. This fix places the space inside of the second encoded word. A second problem happens with continuation lines. A continuation line that starts with whitespace and is followed by a non-encoded word is fine because the newline between such continuation lines is defined as condensing to a single space character. When the continuation line starts with whitespace followed by an encoded word, however, the RFCs specify that the word is run together with the encoded word on the previous line. This is because normal words are filded on syntactic breaks by encoded words are not. The solution to this is to add the whitespace to the start of the encoded word on the continuation line. Test cases are from python#92081 * Rename a variable so it's not confused with the final variable.

Yuribtr added the type-bug An unexpected behavior, bug, or error label Apr 30, 2022

hugovk added the topic-email label Apr 30, 2022

abadger mentioned this issue May 4, 2022

gh-92081: Fix for email.generator.Generator with whitespace between encoded words. #92281

Merged

alex-pobeditel-2004 mentioned this issue Jan 12, 2023

email.policy.SMTP TypeError: 'Header' object is not subscriptable #100261

Open

miss-islington mentioned this issue May 20, 2024

[3.13] gh-92081: Fix for email.generator.Generator with whitespace between encoded words. (GH-92281) #119245

Merged

miss-islington mentioned this issue May 20, 2024

[3.12] gh-92081: Fix for email.generator.Generator with whitespace between encoded words. (GH-92281) #119246

Merged

warsaw self-assigned this May 20, 2024

warsaw closed this as completed May 20, 2024

JelleZijlstra mentioned this issue May 28, 2024

backport a9a74da 3.13 #119642

Closed

dtrodrigues mentioned this issue Jun 26, 2024

email module generates wrong MIME header with quoted-printable encoded extra space with Python 3.12.4 #120930

Closed

bitdancer mentioned this issue Jul 15, 2024

gh-120930: Remove extra blank occuring in wrapped encoded words in email headers #121747

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BytesGenerator breaks UTF8 string #92081

BytesGenerator breaks UTF8 string #92081

Yuribtr commented Apr 30, 2022 •

edited by bedevere-app bot

Loading

mrabarnett commented Apr 30, 2022

Yuribtr commented May 1, 2022 •

edited

Loading

abadger commented May 3, 2022

abadger commented May 3, 2022 •

edited

Loading

abadger commented May 3, 2022 •

edited

Loading

abadger commented May 4, 2022

abadger commented Apr 27, 2023

warsaw commented May 20, 2024

BytesGenerator breaks UTF8 string #92081

BytesGenerator breaks UTF8 string #92081

Comments

Yuribtr commented Apr 30, 2022 • edited by bedevere-app bot Loading

Linked PRs

mrabarnett commented Apr 30, 2022

Yuribtr commented May 1, 2022 • edited Loading

abadger commented May 3, 2022

abadger commented May 3, 2022 • edited Loading

abadger commented May 3, 2022 • edited Loading

abadger commented May 4, 2022

abadger commented Apr 27, 2023

warsaw commented May 20, 2024

Yuribtr commented Apr 30, 2022 •

edited by bedevere-app bot

Loading

Yuribtr commented May 1, 2022 •

edited

Loading

abadger commented May 3, 2022 •

edited

Loading

abadger commented May 3, 2022 •

edited

Loading