Today I got tasked with removing duplicate mails from a mail folder with over 100,000 mails in it.  Doing this from a mail client is so impractical, it’s not even worth giving any thought at all.  Fortunately, the mailbox is on a mail server using Maildir style mailboxes, so I knew this could be done with minimum effort.

I discovered the ‘reformail’ utility, part of courier-imap, and after a few trial runs, I settled on the following:

# cd /path/to/mailbox/Maildir/cur

# for i in `find . -type f`; do reformail -D 10000000 /tmp/duplicates <$i && rm $i; done

-D looks for, and deletes duplicates.

10000000 is the length of the temporary file where a list of message IDs will be written

/tmp/duplicates is the aforementioned temporary file.

The temporary file needs to be big enough to accommodate the message ID of each mail.  In this particular case, I have found the average length to be 54 characters, but I would suggest using around double that to be safe.  So adjust to your needs.

In a big mail folder, and especially on ext3, this will take a long time to complete.