Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling msg within an msg #14

Open
tballison opened this issue May 9, 2024 · 5 comments
Open

Handling msg within an msg #14

tballison opened this issue May 9, 2024 · 5 comments

Comments

@tballison
Copy link

Thank you so much for an awesome library. While writing a wrapper for readpst for Apache Tika, we noticed a small number of cases where there were fewer attachments when selecting the .msg output option. Tika's jira issue: https://issues.apache.org/jira/browse/TIKA-4250

We were able to reproduce this with a test file we have in our unit tests: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testPST.pst

The last email "8" is an email with an embedded email, and inside that embedded email is a docx file.

This is processed correctly with rfc822 and mbox output. However, there is no msg attachment within the 8.msg file.

@tballison
Copy link
Author

test-pst.zip

I'm including the original pst, the mbox, the msg, the .eml and the debug file

@tballison
Copy link
Author

Separately, we noticed that we're getting non-deterministic output when we select the .msg option. Sometimes we get 7 files and sometimes we get 8.

@pabs3
Copy link
Member

pabs3 commented Jul 1, 2024

To be clear; the libpst library has a long history with many contributors, the current maintainers didn't create the library but try to merge patches promptly and work on it when they are able to.

Thanks for the report and the test files, we'll take a look when we can.

The issue with non-deterministic output is known and has a workaround in git master, please comment on the issue if you still see it with the latest commit:

#7

@vsessink
Copy link

vsessink commented Dec 9, 2024

Hi, I stumbled on this - thanks @pabs3 for directing me here. The reason for the bug is that readpst has the wrong idea about "From quoting".

The issue is a commit from 2012: d8ddc13 where From quoting is introduced whenever something is an attachment. However, message/rfc822 does not ask for From quoting as far as I can see.

I'd guess that From quoting is asked for if readpst would output mbox files (as mbox uses From as a message delimiter), but individual mail files should not have their From quoted.

The test file is useful (this is the link to the raw file, here's another test:
backup.pst.gz

To reproduce:

wget https://github.com/user-attachments/files/18064791/backup.pst.gz
gunzip backup.pst.gz
readpst -e -D -8 -cv backup.pst

Now start Python and use the following script (getting the right modules at hand is left as an exercise to the user):

from email.parser import BytesParser
from email.policy import default
plcy=default.clone(refold_source='none')
mail=open('Outlook Data File/Concepten/1.eml')
msg=BytesParser(policy=plcy).parse(mail)
mail.close()
for part in msg.walk():
  print(part.get_content_type())

The result will be:

multipart/mixed
text/html
message/rfc822
text/plain

This isn't correct, due to the fact that readpst (libpst) will add a superfluous > to the From line in the message/rfc822 attachment. See how that would look with: sed -i 's/^>From /From /' Outlook\ Data\ File/Concepten/1.eml and run the python script again. It will now see a whole bunch of image files:

multipart/mixed
text/html
message/rfc822
multipart/mixed
text/html
image/png
image/png
image/png
image/png
image/png
...

Now that I'm here, I'll try and send a fix.

@vsessink
Copy link

OK, this actually is two separate bugs - both of which have been addressed.

One is the quoting problem that got me here in the first place. @tballison please see your "8.eml" message: the attached mail, an rfc/822 attachment, starts with >From (note the greater > sign). This bug has been fixed in 234ac91 (closing #3).

The other is #7 (as mentioned by @pabs3)

The reason I didn't see the fix, is that instead of giving the mode variable a proper value, write_normal_email() renamed it current_mode, so the function body was suddenly using a global named mode. Yes, I should have looked better but I still think that this is bug fix obfuscation :-S

The readpst tools in Debian 12 and Ubuntu 24.04 are still v0.6.76 and don't have these fixes.

Anyway: @pabs3 I think this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants