Simple SPAM solution with auto-learn and recovery options

Perhaps you’ve come across those fancy, branded solutions that allow quarantining mail messages, delivery on demand (when you click a link of a suspected message), filtering incoming messages and so on. Upon some reseach it appears that these come with a price tag of over $60,000 in typical configurations. Expensive, huh? This got me thinking how hard would it be to deliver such functionality using open source solutions that come under the GPL license. Well, I’m a bit rusty now but every now and then try to do something like this, for mental hygiene. And it did not take me more than a couple of hours so it is simple and minimalistic.

Side note: generally I think off the shelf solutions most of the time lack the flexibility while when it comes to their advantages, more often than not it is the packaging, branding and some poor fellow on the on-call support ready to answer basic questions. On the other hand, with the right team you can do wonders using GPL tools.

For the purpose of presenting the system in this article, I will assume you were able to set up your server(s) with an MTA, clamav and spamassassin using whatever configuration you need to efficiently deliver email to your customers – internal or external. For me – I use exim, spamassassin and clamav, but you could do any other MTA as you prefer.

Usually an MTA and spamassassin allow spam filtering by default using a collection of rules in Perl. On top of that, optionally you can enable an AI based on the Bayesian learning algorithm. Such setup all together offers the following functionality:

  • Unsolicited mail gets filtered out to some extent by the Perl rules which look for certain combinations of words and score penalty points.
  • Other messages can be learned by the algorithm to allow improving the efficiency of the filter

Something is missing, right? For example:

  • You need manual intervention in case the filters missed a spam message
  • Nothing handles false-positive hits

This guide will explain the most minimalistic method I could think of for delivering this missing functionality using open source solutions. Lets start with the easiest one – auto-learn without admin intervention.

Enabling auto-learn per maildir folder

This will allow me to further train the filter to reach decent values. When the filter is well trained, I was able to achieve efficiency of 100% for many months, while having processed an average count of up to 1000 messages per day.

In this setup, I use maildir format for storage so that each IMAP folder is a filesystem folder and each message is a file. This allows me to announce and create for every user I support a special folder called SPAM directly inside the INBOX. As a part of the design, I can then ask users to move all files they consider spam to that folder and have a script that will periodically (use your favorite implementation of cron, I use cronie) scan these folders for new messages, learn them as spam and delete.

Additional ideas – you can make this smarter by:
1. Adding more controls to prevent mistaken drag-and-drop-and-forget.
2. Move instead of delete, to some quarantine folder.

A script performing such activity can look like this:

# Learn messages thrown to the SPAM folder as SPAM and delete them
# Script by Patryk Rzadzinski

# Configurables
maildir="put path to your maildir here";
log="/usr/bin/logger -t spamd";

# For whom should this run? Fill this the array with all your users
mailers=(Alice Bob Charlie)

for user in "${mailers[@]}"; do
	if [ -d "$maildir/$user/$spamdir" ]; then
		for spam in $(find "$maildir/$user/$spamdir" -type f ! -name "dovecot*" ! -name "maildirfolder"); do
			$(which sa-learn) --spam "${spam}" 2>&1 >>/dev/null && eval "${log}" "Learned ${spam} message as spam.";
			rm -f "${spam}" && eval "${log}" "Deleted ${spam}.";

Since this is a new thing, I have enabled simple logging using the logger tool available on any Linux system. The tag used here is called “spamd” which is arbitrary, however I have previously configured my syslog-ng to catch all system messages using this tag and keep 30 days of files in a specific folder. This might come in handy for debugging, but is not the subject of this article.

Naturally, we need a crontab entry for this to work. How often should this script run? I think it depends on your load and amount of users. For home solutions you can have this run every minute, with the benefit of acting quickly and downside of someone moving a file by mistake, since there will be little time to react. I would recommend to start with hourly runs and then fine-tune as required.

Allowing the user to deal with false positives

One thing I really wanted at some point is to make sure an important email does not simply get filtered out. At the same time, I wanted vast majority of the spam that reaches my mailbox to simply disappear. I designed the following ompromise. A weekly digest of all the messages that were not delivered to me because the system considered them as spam (in other words, they scored a sufficient amount of penalty points) plus an option to recover each such message with 1-2 clicks. To achieve that, I wrote the following script that I have cron run for me once a week to generate a report about messages that were not delivered to me but thrown into the spam directory instead.

spamdir="path to your spam folder";

cd $spamdir;
rm -f ${spam_report};
recovery="<a href=\"";

# Removal of the base64 encoded text
Mail_decode () {
        decoded="$(echo ${1} | cut -d' ' -f2)";
        if [[ "${decoded}" =~ "UTF-8" ]]; then
                stripped="$(echo "${decoded}" | sed -e 's:=?UTF-8?[BQ,]?::g' -e 's:?=::')";
                decoded="$(echo "${stripped}" | base64 -d)";
        printf "${decoded}$n";

# Generate labels from spam message headers
for spam in *; do
        if [[ ! -d ${spam} ]] && (((($(date +%s) - $(/usr/bin/stat -c %Y ${spam}))) < 604800)); then
                (printf "Message ID: ${spam}$n";
                printf "RECV: $(grep -i received: ${spam})$n";
                printf "$(grep -i from: ${spam} | tr -d '<>')$n";
                printf "$(grep -i to: ${spam} | tr -d '<>')$n";
                printf "Subject: $(Mail_decode "$(grep -i subject: ${spam})")";
                printf "${recovery}${spam}\">RECOVER$n";
                printf "=========================================$n";) >> ${spam_report};

if [[ -z "${spam_report}" ]] ; then
        echo "no new messages" | /usr/bin/mailx -s "spam-digest: no new messages" ${rcpt};
    	cat "${spam_report}" | /usr/bin/mailx -a 'Content-Type: text/html' -s "spam-digest" ${rcpt};

This script goes over the messages and in case they came over the last 7 days, it collects the information that would allow me to distinguish them from actual spam and send over a summary. It also adds a HTML link which allows me to recover such messages. Last but not least, I change the MIME type of the message in the report to HTML to allow processing of the a href part by an MUA – this allows the solution to work from roundcube and mail clients on my phone or laptop. I think it will also work with pine or mutt.

In any case, the script results in the following report:

ID: q1aYxhW-74796
Envelope: (envelope-from
From: Contact
To: info
Subject: Re: Information
ID: q1aYzMJ-74797
Envelope: (envelope-from
From: =?UTF-8?B?SmVycm9sZCBTb3Rv?=
To: =?UTF-8?B?cGF0cnlr?= Reply-To: =?UTF-8?B?cGF0cnlr?=
Subject: Free watches
ID: q1aYzTR-74849
Envelope: (envelope-from
From: someone trying to send spam
To: =?UTF-8?B?cGF0cnlr?= Reply-To: =?UTF-8?B?cGF0cnlr?=
Subject: Inheritance

This way at a glance I know what hit the spam bin. Each “RECOVER” text is actually a HTML link pointing at an email address using the mailto: directive. Such directive on most systems is configured to open a new message by the MUA the user (or the corporation) has chosen. It takes additional parameters. In my example, I set the Subject to the value equal the file name of the spam message I want to recover. When I click (or tap) on RECOVER under the message that I consider not to be spam, it will open a new “compose email” window with pre-defined subject and send it to a special mail address configured on my system, which will then trigger a script re-delivering the falsely-spambinned message to the original recipient.

How to configure exim to understand and process such messages? This can be done on ACL level, using the “run” command, but I keep such functionality in the system filter. I use the following logic: when a message comes from an address that has authenticated (make sure you deny non-authenticated submissions), to the pre-defined RecoverSpam recipient, execute a shell script with the following 2 arguments: name_of_file (we pass that in the subject, right) and original sender (so if I want something recovered, the system sents back the message to me – this is in envelope-from header field).

Here’s an example of exim system filter configuration:

# Exim filter
logfile /var/log/exim/filter.log

if "$h_to:" contains then
	pipe "/path/to/ \"$h_subject:\" \"$sender_address\""

# regular spam action depends on the custom header line injected by the ACL
if $message_headers: contains "X-ACL-Warn: message looks like spam" then
	save	/path/to/spamdir 0640

Very simple and minimalistic, which was the choice set in the beginning of this article.

If you have a CISO looking at all this, then you might want to secure this a bit more. The easiest would be adding more checks. One nice idea that comes to mind is adding soft tokens which could come from any system daemon based on, say, normalized 8-character strings from /dev/urandom - these can be injected in headers of the messages, you can define whatever you want and then have the script check their validity when processing the message for recovery. Be careful in larger installations, where you might run out of entropy. In such cases you could simply install a larger token issuing daemon.

The last element of this system is the script that would process the recoveries. Again, this is the simpliest approach which can be nicely expanded with additional security and sanity checks as needed.

# Recover files from spam, by Patryk Rzadzinski

# Configurables
log="/usr/bin/logger -t spamd";

# Change this to match your spam tag or token
clean="/bin/sed -i '/X-ACL-Warn: message looks like spam/d'";


# Conditioning - tighten as required.
# Replace the last check with whatever authentication method you use - this one is for passwd
if [ ! -z "$1" ] && [ "$#" -eq 2 ] && [[ "${userid}" =~ ^[a-zA-Z]{4,}$ ]] && id -u "${userid}" >/dev/null 2>&1; then
	# remove the spam identificator from the message
	eval "${clean}" "${spamdir}/${1}" && eval "${log}" "Removed spam tag from message $1.";

	# teach spamd that this message is not spam
	$(which sa-learn) --ham "${spamdir}/$1" 2>&1 >> /dev/null && eval "${log}" "Spamd learned message $1 as ham.";

	# re-deliver
	cat "${spamdir}/${1}" | eval "${mailer}" "${2}" && eval "${log}" "Re-delivered message $1 to recipient";

	# delete message from spamdir
	rm -f "/var/spool/spam/$1" && eval "${log}" "Removed message file $1 from ${spamdir}.";

	# return 0 to confirm script completed successfully
	exit 0;
	printf "Arguments passed to spam recovery are not OK - 1: $1 2: $2 3: userid: ${userid}.\n";
	exit 1;

As before, there is again space for improvement. For example, instead of a single RECOVER “button”, you could actually make it optional depending on arguments, and offer the user multiple choice of actions: delete, deliver, learn as HAM and deliver, etc. In my case, RECOVER means (in this order): remove the spam tag AND learn as HAM AND re-deliver to original recipient AND remove from the spamdir. This is simple but can be improved for flexibility. Even better if you expand the mail-digest script with some branding to make your users happy and give them the “expensive solution” feel, add some graphics.

And that’s all! Note that you could move this whole logic to PHP and process it over the web, but it does then make it much less minimalistic and you end up securing PHP, which is always fun and full of pitfalls.

Another thing worth doing is moving the recovery process to a Docker container running only exim and sharing the spamdir folder, which would be a nice piece of sandboxing. I did not do this only because my VPS is low on disk space, but this solution should make CISOs happy and also offer some peace of mind.

Comment on vision & strategy in light of the Apple vs FBI case

Originally I published this on LinkedIn, here’s the comment.

The [lost] ability to define long term [IT] strategy

One thing I absolutely love about the US presidential elections is that the candidates are actually challenged to provide opinions on theoretic subjects that actually matter. This allows getting a decent insight into their ability to define long term strategies often on subjects which are very abstract and require a certain type of mental discipline and the ability to imagine multiple levels of implications. The most recent example is the case of the demand against Apple to introduce backdoors to their software.

Long story short, those candidates remaining in the race, with an exception from the Libertarian Party, agree that the government needs to have some sort of a “master key” in order to tap into communications where required [1]. As commenters [2] already noticed – such requirement leads to a number of issues. First of all, the next vendor might not be US-based and thus the whole efort will be futile. Second, currently the suspects have used software as provided by the vendor, but what stops them from creating a cryptographic application that would encrypt the communication for them? Nevertheless, the presidential candidates make it quite clear in their speaches: security above liberty, because “something must be done”.

So that leads me to the long term part. By the rule of induction – should cryptography be banned alltogether? No ciphers, all communication in the open? I think it is safe to say it is clear such approach will just not work. In vast majority of cases cryptography is there to secure our information, payment card information inclusive.

So the residual facts that remain seem to be:

  • Backdoors or Master Keys are not the answer – they do not solve the problem, but are probably likely to win some votes. And they end up opening a whole new bag of problems
  • Cryptography is here to stay.

But the long-term question stands: what should we be doing to avoid these situations? And how does all this tie to IT? In my line of work, I often lead technical incident response teams challenged to find a solution to an actual problem. One thing I have learned over the time is that sometimes having the best minds in the team is simply not enough to solve a case. Sometimes you need to take a few steps back and realize the root cause is outside of the picture everyone is focusing on. Sometimes, a long term strategy of doing (or not) certain things in a certain (standardization!) way offers the solution – the catch is, it might seem completely unrelated!

Does that mean the FBI should just allow San Bernardino to happen? Of course not, simply the root cause is completely outside of the scope of the discussion and cryptography has nothing to do with it. The problem will not be solved in this area, but who will be the leader that can still notice that and can democratic elections still give us such leaders?