facileOCR

plugin for spamassassin


What is it?

facileOCR uses convert and ocrad to extract words from image spam and compares it to a configurable list of spamwords. Every matched word increments the score by 0.8 points.

It handles PNG, JPEG and GIF (also animated) images, one or several, inline or attached to an email.
To keep the use of resources as small as possible, the plugin has a minimum (6kB) and maximum image filesize to scan. Maximum image size is configurable (10kB - 100kB). In addition there is a maximum count of layers for image animations (9).
The convert + ocrad call has a timeout of 10 seconds, which should actually never be reached. Normally the execution of spamtests with the plugin lasts for 1 to 3 seconds longer per mail, than without facileOCR.

The plugin can write debugging messages to a logfile, which makes it easier to find out, what it is doing and how the score is achieved. It has three ocr_profiles to conveniently switch the way, text is extracted by ocrad.

facileOCR was tested on Linux with SpamAssassin version 3.2.4, running on Perl version 5.10.0,
with SpamAssassin version 3.2.3, running on Perl version 5.8.8 (with amavisd-new-2.4.3),
and with SpamAssassin version 3.2.5 running on Perl version 5.8.8 (with spamd).

Important note: Linux means Linux and NOT Unix! Whoever gave currency to that rumour was wrong.

Yet another OCR plugin?

Yes. None of the existing was exactly, what I needed. Some of them are not under active development anymore. So I sat down and wrote my own, inspired by plugins like BayesOCR.

Dependencies

facileOCR needs
  1. Linux
  2. spamassassin
  3. convert and identify (ImageMagick)
  4. ocrad

License

This software is released under the Apache Software License (version 2.0).

Download and Installation

Download the files from facileOCR-[current].tar.gz
Unpack and follow the instructions in INSTALL file.

Contact

Send feedback, bugreports, feature requests... to shee2ne [at] yahoo.com.
Keep in mind, that this software was made in my sparetime. No commercial support is available.

Trouble Shooting

Error message in maillog:
print() on closed filehandle FILE at /etc/mail/spamassassin/facileOCR.pm
==> remove or chown /tmp/sa_OCR.log
You tested as root, so the logfile is owned by root and the spamd user has no write permissions.

No score, no words found:
Turn on debuglog and check, what is actually extracted.
You can copy repeatedly recognized words over to your spamwords list. A "word" in the spamwords list doesn't have to be a real word, it can also be something like "notcllck,Fvorle,www,óS Wq óS,Favorl", upper-/lowercase doesn't matter.
If text in most spam images is too obuscated, try a different ocr_profile, e.g 3.
 
Sometimes spammers only use few words. To increase the score, you can use a little trick in your spamwords list: repeat parts of the word.
spamwords cialis,cial,alis,viagra,agra,viag,nagra,penis,enis
(and so on...)
With a list like that, the whole word "Cialis" gives a score of 2.4, because cialis, cial and alis match. A partly recognized word still gives 0.8 points.

Too high score:
You can easily influence score by adding or removing words.
The original spamwords list was pretty optimized for high score (May 2009). If you prefer a lower score, remove some words. Each matched word gives 0.8 points.

Typo in facileOCR.cf, version 0.3:
# you will find the log in /tmp/OCR.log
The logfile is /tmp/sa_OCR.log (not /tmp/OCR.log).

Changelog

Version 0.5

Version 0.4

Version 0.3

Extra Downloads

Here are two extra "Viagra Image Rules-of-the-Day" to download (8.12.2009):

is_viagra_img.cf
susp_jpg.cf

To use these as global rules, copy the files to /etc/mail/spamassassin/ and reload spamd (or amavis).