Identifying Spammers in Outbound Email Systems

A minor thesis submitted in partial fulfilment of the requirements for the degree of Master of Computer Science

Martin François Foster School of Computer Science and Information Technology RMIT Melbourne, Victoria, Australia January 24, 2011

Declaration

This thesis contains work that has not been submitted previously, in whole or in part, for any other academic award and is solely my original research, except where acknowledged.

This work has been carried out since March 2010, under the supervision of Dr. Ibrahim Khalil.

Martin François Foster School of Computer Science and Information Technology RMIT January 24, 2011

Acknowledgement

I would like to thank my supervisor, Dr. Ibrahim Khalil for his feedback throughout the process, and particularly his guidance in choosing a reasonable amount of data and metrics to analyse.

I thank my wife Nicole and daughter Stefanie for travelling to Canada in the final weeks of my write-up process, and our families to hosting them while there. Without this time to focus, this minor thesis would probably have never been completed.

List of Figures

List of Tables

Table 1: 8 sources with the highest ratio of undelivered messages

Abstract

Large scale Service Provider email systems have always been targeted for potential exploitation by spammers because of these systems’ capability to deliver huge volumes of their nefarious payload.

Today, most email systems decide whether to accept email from another system based the sender system’s network reputation. Reputation provides Service Providers with the incentive to minimize spam originating from their network; allowing too much spam to be sent via their facilities is likely to result in a poor reputation, meaning other reputable service providers will reject their mail.

Compared to the volumes of research targeted to detecting spam in inbound mail flows, there has been relatively little work done in identifying spam in outbound mail flows. The research that has been done on outbound flows is difficult to apply to modern mail systems, in that it discounts either their complexity, has too much reliance on simple metrics such as volume, or provides mechanisms that are only suited for offline analysis – by which time it is too late to act.

Service Provider email system complexity must be accounted for in order to build a complete picture of the paths and transformations that affect messages on their way out of a mail system: typically crossing multiple servers and multiple different software packages. Old metrics such a message sending volume are likely to penalize the wrong party; with the advent of stringent anti-spam laws legitimate mailing lists tend to deliver a high volume of email to recipients desiring this content. Whereas spammers have moved away from sending high volumes from one host over a short amount of time – preferring to send low volumes from many hosts over a long period of time to net higher overall delivery and avoid or delay detection by service providers.

This thesis presents a novel mechanism to detect spammers in the outbound flows of Service Provider e-mail systems. It accounts for the complexity of such systems, can be used for near real time analysis, and uses the foreign destination system’s response as a measure of the probability that the sender is a spammer.