Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-92081: Fix for email.generator.Generator with whitespace between encoded words. #92281

Merged
merged 2 commits into from
May 20, 2024

Conversation

abadger
Copy link
Contributor

@abadger abadger commented May 4, 2022

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines. The
current generator will create an encoded word for each line. If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words. This
fix places the space inside of the second encoded word.

Test case from #92081

@abadger abadger requested a review from a team as a code owner May 4, 2022 00:57
@abadger abadger marked this pull request as draft May 4, 2022 00:57
@abadger
Copy link
Contributor Author

abadger commented May 5, 2022

This is a work in progress because #92081 has one other issue that also needs to be fixed. Whitespace at the start of the Subject is being omitted as well. I'm leaning towards fixing that one in the decoder but I'm still reading through the rfcs to see if it has anything to say about that.

Note: the other issue has been fixed here as well and this is ready for review.

@abadger abadger force-pushed the email-bytes-generator-breakage branch from bc7a42d to 80f5cfa Compare April 26, 2023 22:03
@abadger abadger marked this pull request as ready for review April 26, 2023 22:24
@abadger abadger force-pushed the email-bytes-generator-breakage branch from bec90d8 to 69b205d Compare April 26, 2023 22:48
@@ -1628,7 +1629,7 @@ def test_address_display_names(self):
'Lôrem ipsum dôlôr sit amet, cônsectetuer adipiscing. '
'Suspendisse pôtenti. Aliquam nibh. Suspendisse pôtenti.',
'=?utf-8?q?L=C3=B4rem_ipsum_d=C3=B4l=C3=B4r_sit_amet=2C_c'
'=C3=B4nsectetuer?=\n =?utf-8?q?adipiscing=2E_Suspendisse'
'=C3=B4nsectetuer?=\n =?utf-8?q?_adipiscing=2E_Suspendisse'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, the data in the unittest was buggy. If you run the original output through decode_header() you'll find that it is missing the space between cônsectetuer and adipiscing

@abadger
Copy link
Contributor Author

abadger commented Apr 26, 2023

This fix is now ready to be reviewed.

@abadger abadger force-pushed the email-bytes-generator-breakage branch from 69b205d to 4104b8e Compare July 21, 2023 14:47
@abadger
Copy link
Contributor Author

abadger commented Jul 21, 2023

Hey @warsaw , this fix is ready for review if you have some cycles to spare for thinking about email and the stdlib.

@alex-pobeditel-2004
Copy link

@abadger I tested this fix on my project and it worked like a charm. Inexpressible thanks! Will use your patch until this will not be merged into CPython :)

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from python#92081
@abadger abadger force-pushed the email-bytes-generator-breakage branch from 4104b8e to 5071b52 Compare May 20, 2024 17:40
@warsaw warsaw self-assigned this May 20, 2024
@warsaw warsaw added 3.12 bugs and security fixes 3.13 bugs and security fixes needs backport to 3.12 bug and security fixes needs backport to 3.13 bugs and security fixes labels May 20, 2024
Copy link
Member

@warsaw warsaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution to Python!

@warsaw warsaw enabled auto-merge (squash) May 20, 2024 18:07
@warsaw warsaw merged commit a6fdb31 into python:main May 20, 2024
39 checks passed
@miss-islington-app
Copy link

Thanks @abadger for the PR, and @warsaw for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 20, 2024
…ween encoded words. (pythonGH-92281)

* Fix for email.generator.Generator with whitespace between encoded words.

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from pythonGH-92081

* Rename a variable so it's not confused with the final variable.
(cherry picked from commit a6fdb31)

Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>
@abadger abadger deleted the email-bytes-generator-breakage branch May 20, 2024 19:11
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 20, 2024
…ween encoded words. (pythonGH-92281)

* Fix for email.generator.Generator with whitespace between encoded words.

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from pythonGH-92081

* Rename a variable so it's not confused with the final variable.
(cherry picked from commit a6fdb31)

Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented May 20, 2024

GH-119245 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label May 20, 2024
@bedevere-app
Copy link

bedevere-app bot commented May 20, 2024

GH-119246 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 bug and security fixes label May 20, 2024
warsaw pushed a commit that referenced this pull request May 20, 2024
…tween encoded words. (GH-92281) (#119245)

* Fix for email.generator.Generator with whitespace between encoded words.

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from GH-92081

* Rename a variable so it's not confused with the final variable.
(cherry picked from commit a6fdb31)

Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>
warsaw pushed a commit that referenced this pull request May 20, 2024
…tween encoded words. (GH-92281) (#119246)

* Fix for email.generator.Generator with whitespace between encoded words.

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from GH-92081

* Rename a variable so it's not confused with the final variable.
(cherry picked from commit a6fdb31)

Co-authored-by: Toshio Kuratomi <a.badger@gmail.com>
estyxx pushed a commit to estyxx/cpython that referenced this pull request Jul 17, 2024
…ween encoded words. (python#92281)

* Fix for email.generator.Generator with whitespace between encoded words.

email.generator.Generator currently does not handle whitespace between
encoded words correctly when the encoded words span multiple lines.  The
current generator will create an encoded word for each line.  If the end
of the line happens to correspond with the end real word in the
plaintext, the generator will place an unencoded space at the start of
the subsequent lines to represent the whitespace between the plaintext
words.

A compliant decoder will strip all the whitespace from between two
encoded words which leads to missing spaces in the round-tripped
output.

The fix for this is to make sure that whitespace between two encoded
words ends up inside of one or the other of the encoded words.  This
fix places the space inside of the second encoded word.

A second problem happens with continuation lines.  A continuation line that
starts with whitespace and is followed by a non-encoded word is fine because
the newline between such continuation lines is defined as condensing to
a single space character.  When the continuation line starts with whitespace
followed by an encoded word, however, the RFCs specify that the word is run
together with the encoded word on the previous line.  This is because normal
words are filded on syntactic breaks by encoded words are not.

The solution to this is to add the whitespace to the start of the encoded word
on the continuation line.

Test cases are from python#92081

* Rename a variable so it's not confused with the final variable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes topic-email
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants