Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate handling #106

Closed
mc0e opened this issue Aug 21, 2015 · 9 comments
Closed

duplicate handling #106

mc0e opened this issue Aug 21, 2015 · 9 comments

Comments

@mc0e
Copy link

mc0e commented Aug 21, 2015

I've been doing some filtering operations on largish mailboxes, and hit a situation where my copy operation was timing out. imapfilter repeatedly re-attached and failed to complete the copy. I had to kill imapfilter, which left me with a large number of messages that had been copied, but not yet deleted from source.

So what I now want to do is to be able to sort out duplicates, based on the Message-Id header. I imagine it might be possible to do this with some processing in lua, but I don't see explicit support in imapfilter.

What I'd like would be functions like:

  • messages:no_duplicates() - filter a message table (which might or might not refer to multiple mailboxes) such that only the first message in the table with a given Message-Id header would remain.
  • messages:duplicates() - filter a message table so only the second and subsequent messages with a given message-id would remain.

It would be possible to calculate messages:no_duplicates() using messages:select_all() - messages:duplicates(), but this might be less efficient, and it would be less user-friendly.

@mc0e
Copy link
Author

mc0e commented Aug 21, 2015

OK, so I started doing my own de-dup'ing in lua. I have no lua experience, but with a little googling, it looks like the following should be close to right:

messages = myaccount[folder].select_all()
seen = {}


for _, message in ipairs(messages) do
    mailbox, uid = table.unpack(message)
    messageId = mailbox[uid]:fetch_header('Message-Id')
    if seen[messageId] then
        -- delete message

    else
        seen[messageId] = true
    end
end

At this stage, I don't know how to delete a message when I have it outside the context of a result-set?

@lefcha
Copy link
Owner

lefcha commented Aug 21, 2015

First I am interested to know more about the copy operation failing, such as a debug file output. You can get a debug file with -d debug.log, and please then copy/paste the relevant parts of it. Thanks!

Then regarding deleting duplicates; your example looks very nice, and you can see examples of how to construct a messages set in order to use it to delete the duplicate message in the https://github.com/lefcha/imapfilter/blob/master/samples/extend.lua file. Specifically the 2nd example there is what you want to do.

But I'll just summarize here that you need to do more or less something like:

results = Set {}
for _, message in ipairs(messages) do
    mailbox, uid = table.unpack(message)
    messageId = mailbox[uid]:fetch_header('Message-Id')
    if seen[messageId] then
        table.insert(results, uid) 
    else
        seen[messageId] = true
    end
end
results:delete_messages()

@mc0e
Copy link
Author

mc0e commented Aug 24, 2015

There's a sanitised debug log here, starting after the bit full of message content data, and with user/domain/folder names removed:

https://gist.github.com/mc0e/d70dae9f639637c0f8c7

It'd be nice if the standard copy operation abstracted out behaviour behind the scenes where it's broken into batches of some maximum number of messages. It'd be nice if those were flagged for deletion at each step's completion.

I'm just about to have another go at the lua for removing the duplicates, so nothing to add there yet.

@lefcha
Copy link
Owner

lefcha commented Aug 25, 2015

Regarding the problem with the COPY request failing, and then imapfilter recovering the connection, and so on... Try setting the following line at the top of your configuration file, and then try again:

options.limit = 50

@mc0e
Copy link
Author

mc0e commented Aug 28, 2015

I wound up implementing my own batch processing as follows:

delete_these = Set {}
counter = 0
for _, message in ipairs(messages) do
    mailbox, uid = table.unpack(message)
    messageId = mailbox[uid]:fetch_field('Message-Id')

    if seen[messageId] then
        -- delete message
        io.write(string.format("To delete     : %s\n",messageId))
        table.insert(delete_these, message)
        if counter == 500 then
            io.write("deleting messages\n")
            delete_these:delete_messages()
            delete_these = Set {}
            counter = 0
        end
    else
        io.write(string.format("First Sighting: %s\n",messageId))
        seen[messageId] = true
    end
end
delete_these:delete_messages()

@lefcha
Copy link
Owner

lefcha commented Aug 28, 2015

Well that is already implemented inside imapfilter in mailbox.lua, and it was for many years now, it's just that I disabled this functionality, because all IMAP compliant servers should have no limitation on the items passed to them.

As a workaround for servers that can't cope with that, there is the options.limit option, that basically brings back the old way of handling many items. So imapfilter would work the same as versions prior to 2.6 when you use:

options.limit = 50

That's why I suggested you use that, in order to overcome the problems with the COPY request and possibly the deletions failing.

@mc0e
Copy link
Author

mc0e commented Aug 30, 2015

I've just been trying out options.limit=50. It seems to be working for moving messages. but not for unmark_seen(). It'd be nice if there was a setting to not alter the \Seen flag of messages accessed for filtering?

I don't see options.limit in the documentation. Seems like it probably should be there. Also seems to me that it you set a timeout by default, then that implies that the limit should also be set. The size of a task that can be sent to the server might not be limited, but the size of a task that can complete in finite time is.

@lefcha
Copy link
Owner

lefcha commented Aug 30, 2015

What do you mean that it not working for unmark_seen()? What happens?

Also, the \Seen should not be altered, we need to investigate what is happening here, a debug log is necessary. Please provide the relevant parts of the debug log, where you think something incorrect is taking place.

Regarding not seeing the options.limit in the man page; what version are you using? You can see that with imapfilter -V.

Also, the options.limit and options.timeout are not connected, as the first is there as a workaround for problematic servers who can't cope with long requests, while the latter is there to cope with network failures and it is not related to the servers or the clients bandwidth and how fast it can receive/send data.

For example, let's say imapfilter is sending a request to the IMAP server, then it waits up to options.timeout seconds for an answer even a partial one. If imapfilter didn't receive any network traffic for that time it considers the connection is lost. If the server sends partial results, in a slow rate, imapfilter considers the connection healthy, and no network timeout is happening.

@lefcha lefcha closed this as completed Jan 17, 2016
@jpbrucker
Copy link

The above example, that handles duplicate messages, seems to require minor changes with recent version of imapfilter. I use:

seen = {}
duplicates = Set {}
results = account["Inbox"]:select_all()
for _, message in ipairs(results) do
        mailbox, uid = table.unpack(message)
        messageId = mailbox[uid]:fetch_field("Message-Id")
        -- Remove prefix to ignore Id/ID difference.
        messageId = string.sub(messageId, 12)
        if seen[messageId] then
                table.insert(duplicates, {mailbox, uid})
        else
                seen[messageId] = true
        end
end
duplicates:mark_seen()
duplicates:move_messages(account["dups"])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants