Chapter 1

Introduction

Most research in the field of e-mail spam detection has focussed on inbound flows to an e-mail system. Inbound flows are defined as the messages a system under the control of one administrative group receives from other systems that are not under this group’s control.

Less focus has been placed on vetting the outbound flows from e-mail systems. Outbound flows are defined as the messages that originate from subscribers of an e-mail system under the control of one administrative group destined for receivers on other systems that are not under this group’s control.

Much of the prior work done in detecting spammer activity in outbound flows is difficult to apply to large scale service provider e-mail systems because it discounts their complexity. This thesis focuses on the large scale use case; the work in and raw data used by this thesis is based on the author’s experience in designing and operating an Internet Service Provider (ISP) e-mail system servicing approximately 100,000 subscriber mailboxes. The author believes the methods used apply to any large scale e-mail hosts, such as ISPs, and email hosting services such as Google, Microsoft, Yahoo and Fastmail. Unless otherwise noted, all references in this thesis to e-mail systems imply large scale.

Where does the spam in outbound flows originate from?

Spam is defined as unsolicited commercial email (UCE) or unsolicited bulk email (UBE). Spam’s goal is usually financial. Direct financial gain by generating revenue via the sale of a product or fraud. Indirect financial gain by stealing access to the target’s resources, usually by intercepting their system credentials or convincing the target to install malicious software on their computers.

The malicious software angle is important. Because of the introduction of Anti-Spam legislation, the number of persons and organizations overtly spamming or providing Internet access services to spammers has greatly reduced: what was previously frowned upon is now illegal. Coupled with the widespread adoption of network reputation services, almost no e-mail systems are operated to directly deliver spam anymore. Instead most is originated by computers that are running malicious software, usually as a part of a larger network of systems running similar malicious software under the control of one organization – a Botnet. These compromised accounts are responsible for the bulk of the spam being injected into e-mail service providers’ outbound flows.

Operators of email systems are motivated to reduce spam in their outbound flows for reasons of deliverability, resource consumption, and reciprocal improvement.

Deliverability is the measure of how much legitimate e-mail sent by customers of a mail system gets received by its intended recipients. At the receiving system side, messages are first accepted into inbound flows based on the sender’s reputation. An indication of reputation is provided by consulting one of the many Realtime Blocklists (RBLs), who indicate how likely a host on the Internet is to deliver spam based on past observed behaviour. If operators of an email system do not manage and control spammer’s use of their systems, their reputation will degenerate in these RBLs until such point that few other systems will accept their mail. Customers will quickly cease to use an e-mail system with low deliverability due to poor reputation.

The financial cost of sending an e-mail is small [1]. However, if the overall volume of spam sent is not controlled, service providers will incur considerable infrastructure and staff costs. If a 10 server system is overprovisioned by 25% in order to managed high utilization peaks caused by queuing spam for delivery, then 2 servers worth of infrastructure are not justified by customer use or revenues. Personnel is more expensive than infrastructure; sending excess outbound spam requires more staff to process a system abuse reports, and to trace and manage compromised accounts.

The final motivator is reciprocal. Due to economies of scale, there is an ongoing trend for smaller companies and organisations to outsource their email hosting needs to larger providers. The net effect is a larger volume of mail is moving between relatively fewer e-mail systems. If these systems in turn better detect spammers in their outbound flows, they should eventually face lesser inbound flows to filter for spam.

1.1. Objective

The objective of this work is to identify spam in the outbound mail flows of large scale service provider mailsystems. This would provide means of reducing the volume of spam delivered to mailboxes, thus reducing the financial motives to spam.

1.2. Contribution

This thesis presents a novel mechanism to identify spammers in the outbound flows of Service Provider e-mail systems. It accounts for the complexity of such systems, can be used for near real time analysis, and uses the foreign destination system’s response as a measure of the probability that the sender is a spammer.

1.3. Organization

The problem of identifying spammers in outbound e-mail system flows requires a good understanding of how electronic mail is routed from one system to another. Chapter two explains this, introduces the syslog data collection mechanism, and the intricacies in retracing the path that a message took through a high volume multiple server e-mail system.

The background chapter concludes with an overview of related works in determining e-mail sender reputation, assessing e-mail sender behaviour, and identifying outbound spam.

The third chapter presents the techniques and algorithms used to first retrace the paths taken by messages through the system, and analyse these paths for evidence that the source is spamming. The results demonstrate that using the receiving system’s SMTP reply code alone is a good metric for identifying spammers; one that could be used to nearly halve the amount of outbound spam generated by an e-mail system.