-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling msg within an msg #14
Comments
I'm including the original pst, the mbox, the msg, the .eml and the debug file |
Separately, we noticed that we're getting non-deterministic output when we select the .msg option. Sometimes we get 7 files and sometimes we get 8. |
To be clear; the libpst library has a long history with many contributors, the current maintainers didn't create the library but try to merge patches promptly and work on it when they are able to. Thanks for the report and the test files, we'll take a look when we can. The issue with non-deterministic output is known and has a workaround in git master, please comment on the issue if you still see it with the latest commit: |
Hi, I stumbled on this - thanks @pabs3 for directing me here. The reason for the bug is that readpst has the wrong idea about "From quoting". The issue is a commit from 2012: d8ddc13 where From quoting is introduced whenever something is an attachment. However, message/rfc822 does not ask for From quoting as far as I can see. I'd guess that From quoting is asked for if The test file is useful (this is the link to the raw file, here's another test: To reproduce:
Now start Python and use the following script (getting the right modules at hand is left as an exercise to the user):
The result will be:
This isn't correct, due to the fact that readpst (libpst) will add a superfluous
Now that I'm here, I'll try and send a fix. |
OK, this actually is two separate bugs - both of which have been addressed. One is the quoting problem that got me here in the first place. @tballison please see your "8.eml" message: the attached mail, an rfc/822 attachment, starts with The other is #7 (as mentioned by @pabs3) The reason I didn't see the fix, is that instead of giving the The readpst tools in Debian 12 and Ubuntu 24.04 are still v0.6.76 and don't have these fixes. Anyway: @pabs3 I think this can be closed. |
Thank you so much for an awesome library. While writing a wrapper for readpst for Apache Tika, we noticed a small number of cases where there were fewer attachments when selecting the .msg output option. Tika's jira issue: https://issues.apache.org/jira/browse/TIKA-4250
We were able to reproduce this with a test file we have in our unit tests: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testPST.pst
The last email "8" is an email with an embedded email, and inside that embedded email is a docx file.
This is processed correctly with rfc822 and mbox output. However, there is no msg attachment within the 8.msg file.
The text was updated successfully, but these errors were encountered: