Tuesday, December 7, 2010

Spam and Machine Learning

A year or so ago I took a machine learning class. Going into the class I thought we were going to learn how to build AI’s. Turns out that wasn’t the focus, but I did learn some other valuable things. Machine learning, at least at best as I can describe it, is a field of study in which algorithms can evolve based on empirical data. It is very interesting to watch that happen. It is actively applied in many different fields in and outside of computer science. Meteorology is a perfect example—tons of data gathered and used to help algorithms evolve to model future weather patterns. Machine learning can also play a crucial role in spam filtering systems as well. Since spammers tend to get pretty creative in their means to get around the road blocks set before, filtering software needs to evolve along with it. Enter machine learning.

A paper I recent read outlines an application of machine learning in spam filtering. They claim that most blacklists fail to keep pace with spammers because they are based filtering assumed persistent identifiers (e.g. IP addresses) and they compartmentalize email-sending behavior to a single domain rather analyzing behaviors across domains. This paper proposes a behavioral blacklist approach. They introduce a new system called SpamTracker which uses clustering, based on a principal components analysis, and classification algorithms to detect pre-blacklisted spam. Their system does okay. It is meant more to supplement existing systems rather than replace them, but at least it closes the gap a little more. As with all machine learning, the work is in tuning parameters, getting better data to train against, and deciding which features to “learn” from.

No comments:

Post a Comment