Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time detection #65

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from
Draft

Time detection #65

wants to merge 14 commits into from

Conversation

kimlee87
Copy link
Contributor

Main features: detect time in multiple formats (issue #49)

  1. All phrases in the email content that match predefined patterns will be extracted.
  2. Only phrases that can be parsed to datetime will be kept as candidates.
  3. Merged two candidates into one phrases if needed
  4. Return a list of date-time phrases and their corresponding indices.

Note: 100% accuracy is not possible.

There are two detection mode: strict and non-strict.

  • With the non-strict mode

    • Pros: Most of the time formats would be detected, including absolute, relative, or incomplete formats
      • E.g.: Mittwoch, 17, April 2024, 17:30, 30 avril, 24 a las 3
    • Cons: Single numbers in the email content might be detected if these numbers can be parsed to datetime
  • With the strict mode

    • Pros: Most of the time formats in the header of forwarded emails would be detected. Single numbers in email content won't be detected.
    • Cons: dates in the email content will also be detected if they are in the full date time format (e.g. 17, April 2024, 17:30)
  • One can use the script at the end of utils.py to create the results running on emails from heiBOX.

  • Script for running on emails from csv: tbu.

Other updates

  • Move init_spacy() in parse.py to utils.py to avoid circular import
    • Modify init_spacy() to select morphologizer in spaCy pipeline (used for getting POS in time detection features)
    • Set "de" as the default language of the default model
    • Note: The utils.py is quite long now. Maybe move LangDetector and TimeDetector to new files
  • Add pytest markers to pyproject.toml to manage the test cases in test_utils.py
  • Add function to clean up email content

Copy link

codecov bot commented Mar 21, 2025

Codecov Report

Attention: Patch coverage is 94.54225% with 31 lines in your changes missing coverage. Please review.

Project coverage is 94.35%. Comparing base (68f8bd0) to head (393e4df).

Files with missing lines Patch % Lines
mailcom/utils.py 86.09% 31 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #65      +/-   ##
==========================================
+ Coverage   93.01%   94.35%   +1.34%     
==========================================
  Files           6        6              
  Lines         859     1400     +541     
==========================================
+ Hits          799     1321     +522     
- Misses         60       79      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kimlee87 kimlee87 requested a review from iulusoy March 21, 2025 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant