Today I got tasked with removing duplicate mails from a mail folder with over 100,000 mails in it. Doing this from a mail client is so impractical, it’s not even worth giving any thought at all. Fortunately, the mailbox is on a mail server using Maildir style mailboxes, so I knew this could be done with minimum effort.
I discovered the ‘reformail’ utility, part of courier-imap, and after a few trial runs, I settled on the following:
# cd /path/to/mailbox/Maildir/cur
# for i in `find . -type f`; do reformail -D 10000000 /tmp/duplicates <$i && rm $i; done
-D looks for, and deletes duplicates.
10000000 is the length of the temporary file where a list of message IDs will be written
/tmp/duplicates is the aforementioned temporary file.
The temporary file needs to be big enough to accommodate the message ID of each mail. In this particular case, I have found the average length to be 54 characters, but I would suggest using around double that to be safe. So adjust to your needs.
In a big mail folder, and especially on ext3, this will take a long time to complete.