I had a few large collection of emails in the MBOX format, and I needed to filter out emails that contained a particular keyword in either the subject or the body of the email. The goal was to extract these emails and store them in a structured format, such as CSV, for further analysis and reference.
I used Python to create a script that processes the MBOX file, that i downloaded from my apple mail and extracts the relevant emails. Key steps:
-
Extracting Emails from MBOX File: Using the
mailbox
module, the script reads the MBOX file and iterates through each email message. -
Filtering Emails by Keyword: The script checks if the keyword ("bewerbung") is present in either the subject or the body of the email. This way, only relevant emails are selected.
-
Extracting Important Information: The script extracts essential details from each email, such as the date, sender, recipient, subject, and message body.
-
Saving Data to CSV: After filtering and extracting the necessary data, the script writes the relevant information into a CSV file for easy access and analysis.
The filtering process was successfully completed, and a CSV file containing only the relevant emails was generated. This allowed me to isolate and review the specific emails that matched the criteria, enabling a more efficient way to manage and process my inbox.
- processed multipart emails by iterating through different content types.
- used
BeautifulSoup
to extract text from HTML content, removing unnecessary formatting. - handled "quoted-printable" encoding with the
quopri
module to correctly decode email bodies.
- a preprocessing step was added to ensure extracted text contained only printable ASCII characters.
- error handling mechanisms were implemented to skip problematic emails instead of terminating the script.
- converted both the subject and body to lowercase for case-insensitive matching.
- checked if the keyword appeared in either field before extracting the email.
- extracted key fields: date, sender, recipient, subject, and body.
- saved the filtered data into a CSV file for easy access and further analysis.
- encoding errors were resolved by ensuring proper decoding of email content.
- virtual environment issues and incorrect Python installations were addressed to enable smooth execution.
- error handling was improved to ensure robustness, allowing the script to continue running even when encountering problematic emails.