A spam filter for E-mails via IMAP-Protocol. Various features are used to classify spam:
- Mail-Body via Bayesian Network
- Mail-Subject via Bayesian Network
- URLs in the mail via Google Safe Browsing API
- Mail-Sender via a Blacklist
After you have specified all required required configuration values in the spamfilter.ini file (see next section), you can simply start the application via docker or with your local python installation (see requirements.txt for dependencies).
If the application is started in the 'USERMAIL_TRAINING' or 'ONLINE_TRAINING' start-mode, it can be trained further with new mails if you enter 'train' in the console. This training is synchronized with the mail-check.
The application can be configured by providing the values in the spamfilter.ini file or by using command line arguments. If both are specified, the command line arguments overwrite the values in the config-file. To see the possible configuration values, run the application with the -h argument.
Some configurations are required. A default value is used for optional values if they are omitted. The config-file is separated in different section:
Config-Key | required | default value | description |
---|---|---|---|
username | yes | - | The username / email-address of the email-account to be checked |
password | yes | - | The password for the email-account to be checked |
host | yes | - | The imap-host-address of the mail-server |
port | no | 993 | The imap-port of the mail-server |
ssl | no | True | A flag that indicates whether or not you want to connect to the mail-server using SSL |
Config-Key | required | default value | description |
---|---|---|---|
inbox | yes | - | The mailbox to check for spam |
spam_mailbox | yes | - | The mailbox into which spam mails are to be moved |
train_ham_mailbox | yes | - | The mailbox that you want to use to train ham mails |
train_spam_mailbox | yes | - | The mailbox that you want to use to train spam mails |
Config-Key | required | default value | description |
---|---|---|---|
score_threshold | no | 0.5 | The thresholds from which an email is treated as spam |
check_interval | no | 15 | The interval at which spam is regularly filtered |
It is possible to configure the weight of each feature with which it contributes to the total score. The sum of all weights must be at least 1, but can also be higher.
Config-Key | required | default value | description |
---|---|---|---|
body_weight | no | 0 | The weight with which the mail body contributes to the total score |
subject_weight | no | 0 | The weight with which the mail subject contributes in the total score |
url_weight | no | 0 | The weight with which urls in the mail contributes in the total score |
from_weight | no | 0 | The weight with which the sender contributes in the total score |
Config-Key | required | default value | description |
---|---|---|---|
start_mode | no | USERMAIL_TRAINING | The mode in which the application is started and trained. See section 'Modes' for details |
check_mode | no | NORMAL | The mode im which spam mails are handled. See section 'Modes' for details |
max_train_mails | no | 500 | The maximum amount of mails used for training from each mailbox |
batch_size | no | 100 | The amount of mails that are retrieved at once from the mail server |
console_log_level | no | INFO | The level at which the application should be logged |
create_logfiles | no | False | A flag that indicates whether a logfile should be created |
Config-Key | required | default value | description |
---|---|---|---|
google_api_token | only if url_weight is greater than 0 | - | The API-key to access the Google Safe Browsing API. See the API documentation for details |
It is possible to specify the Start-Mode (config-key start_mode) and the Check-Mode (config-key check_mode) The possible values and there effects are described here:
Mode | description |
---|---|
PRETRAINED | The Bayesian Network is deserialized from a previously trained run |
USERMAIL_TRAINING | The mails from the specified mailboxes are used for the training |
ONLINE_TRAINING | The Bayesian Network is first deserialized from a previously trained run and then further learned from the mails in the specified mailboxes |
TESTDATA_TRAINING | The provided test mails will be used for training |
NO_TRAINING | No training will be performed. This only makes sense if the weight for body and subject is set to 0 |
LIST_MAIL_FOLDERS | Available mailboxes for the mailbox settings will be listed. The application then shuts down |
Mode | description |
---|---|
NONE | Only the training will be performed. No mails are checked |
NORMAL | Detected spam mails are moved to the specified mailbox |
FLAGGING | Instead of moving mails, they are only flagged |
DRYRUN | Mails are checked, but neither flagged nor moved |