facileOCR
plugin for spamassassin
What is it?
facileOCR uses convert and ocrad to extract words from image spam and compares it to a configurable list of spamwords. Every matched word increments the score by 0.8 points.It handles PNG, JPEG and GIF (also animated) images, one or several, inline or attached to an email.
To keep the use of resources as small as possible, the plugin has a minimum
(6kB) and maximum image filesize to scan. Maximum image size is configurable (10kB - 100kB).
In addition there is a maximum count of layers for image animations (9).
The convert + ocrad call has a timeout of 10 seconds, which should actually never be reached.
Normally the execution of spamtests with the plugin lasts for 1 to 3 seconds longer per mail, than without facileOCR.
facileOCR was tested on Linux with SpamAssassin version 3.2.4, running on Perl version 5.10.0,
with SpamAssassin version 3.2.3, running on Perl version 5.8.8 (with amavisd-new-2.4.3),
and with SpamAssassin version 3.2.5 running on Perl version 5.8.8 (with spamd).
Yet another OCR plugin?
Yes. None of the existing was exactly, what I needed. Some of them are not under active development anymore. So I sat down and wrote my own, inspired by plugins like BayesOCR.Dependencies
facileOCR needs- Linux
- spamassassin
- convert and identify (ImageMagick)
- ocrad
License
This software is released under the Apache Software License (version 2.0).Download and Installation
Download the files from facileOCR-[current].tar.gzUnpack and follow the instructions in INSTALL file.
Contact
Send feedback, bugreports, feature requests... to shee2ne [at] yahoo.com.Keep in mind, that this software was made in my sparetime. No commercial support is available.
Trouble Shooting
Error message in maillog:print() on closed filehandle FILE at /etc/mail/spamassassin/facileOCR.pm
==> remove or chown /tmp/sa_OCR.log
You tested as root, so the logfile is owned by root and the spamd user has no write permissions.
No score, no words found:
Turn on debuglog and check, what is actually extracted.
You can copy repeatedly recognized words over to your spamwords list.
A "word" in the spamwords list doesn't have to be a real word, it can also be
something like "notcllck,Fvorle,www,óS Wq óS,Favorl", upper-/lowercase doesn't matter.
If text in most spam images is too obuscated, try a different ocr_profile, e.g
3.
Sometimes spammers only use few words. To increase the score, you can use a
little trick in your spamwords list: repeat parts of the word.
spamwords cialis,cial,alis,viagra,agra,viag,nagra,penis,enis
(and so on...)
With a list like that, the whole word "Cialis" gives a score of 2.4, because
cialis, cial and alis match. A partly recognized word still gives 0.8
points.
Too high score:
You can easily influence score by adding or removing words.
The original spamwords list was pretty optimized for high score (May 2009). If you
prefer a lower score, remove some words. Each matched word gives 0.8 points.
# you will find the log in /tmp/OCR.log
The logfile is /tmp/sa_OCR.log (not /tmp/OCR.log).
Changelog
Version 0.5
- new config option max_img_size
- rewrite of timeout to avoid use of "ps"
- more secure temporary files
- better checks on config options
- changed some debuglog output to avoid misconception
Version 0.4
- fixed some minor bugs and typos
- updated spamword list
- tweaked ocr profile no. 3
- new additional spamassassin rule (independent from plugin)
Version 0.3
- first official release
- moved settings to facileOCR.cf
- debuglog and ocr profiles