Spam directly engages a very wide range of stakeholders that includes individual consumers, all organizations of whatever size in the private and public sectors that are Internet users, network operators and ISPs, suppliers of Internet security products and services, commercial e-mail marketers, entities and organizations that commission spamming campaigns, a variety of government policy departments, regulatory authorities and enforcement agencies at the national level, and various intergovernmental and other international organizations at the regional and global levels. Spam raises general concerns in all network and service environments:
Spam can be annoying or offensive to consumers and imposes various additional costs, especially on individuals who access the network through pay-per-use or low bandwidth connections, thereby hampering the development of Internet access.
Spam imposes significant costs on organizations in the private, public and not-for-profit sectors, whose employees may spend substantial amounts of work time sorting through e-mail messages to determine which are legitimately related to their work, and in deleting the rest.
Spam also imposes significant costs on Internet Service Providers (ISPs) and other network operators, since it requires investment in a range of tools that are needed to counter spam, including anti-spam technologies (e.g. filtering technologies), server and transmission capacity, human resources, and anti-spam information sharing, cooperation, and regulatory structures. This is a particularly important concern in developing countries.
Spam provides a cover for spreading viruses, worms, trojans, spyware, etc., which typically are sent as attachments to e-mail messages, which may cause harm to individual consumers and user organizations, as well as to network operators and service providers.
As well causing inconvenience and reducing the utility of the Internet for consumers and users, spam may violate national law – e.g. if it constitutes an invasion of privacy (e.g. spyware), leads to malicious attacks on their personal property (e.g. viruses), or results in the unauthorized use of this property, possibly for illegal purposes (e.g. zombie networks).
Spam also provides a cover for other forms of cyber crime, such as identity theft through “phishing” and other forms of online fraud, which cause harm to individual consumers and impose costs on corporations (e.g. in the financial services sector), and government agencies (e.g. that issue licences).
Spam Characteristics
Spam characteristics appear in two parts of a message; email headers and
message content:
1. Email headers
Email headers show the route an email has taken in order to arrive at its
destination. They also contain other information about the email, such as the
sender and recipient, the message ID, date and time of transmission, subject and
several other email characteristics. Most spammers try to hide their identity by
forging email headers or by relaying mail to hide the real source of the message.
Since they need to send mails to a large number of recipients, spammers use
certain methods for mass mailing that can be classified as pure spam practices
and can therefore be identified in the email headers. Although newsletters and
legitimate mailings are also sent to a large number of recipients, these will
generally not display the same characteristics since the message source does not
need to be concealed.
Headers can also be used to trace back the origin of the spam message.
However, in this article we are mainly focusing on how to distinguish a spam
message from a legitimate message by looking at the email headers, rather than
actually tracing the sender of the spam message.
Typical email header characteristics in spam messages:
Recipient’s email address is not in the To: or Cc: fields: The reason
for this is that the recipient’s email address is hidden in the Bcc: field or X-
receiver field, along with a substantial number of other email addresses.
Spammers do this in order to conceal the fact that the mail was sent to a large number of recipients, and presumably so as not to publish their
email list. Some persons might add recipients to the Bcc: field for sending
out ‘legitimate’ mailings, but these will tend to be of a more personal
nature (which you might wish to block anyway) since most professional
companies do not use this method for sending newsletters or mailings.
Note however that if you do block emails without a local recipient in the
To: or Cc: field, you will be blocking all bcc: messages.
Empty To: field: This is also typical for spam messages. Because
spammers send out bulk emails by entering all recipients in the Bcc: field
or X-receiver header, the To: field is often empty. According to the RFC
822, Paragraph A.3.1. (http://www.w3.org/Protocols/rfc822/
Overview.html), the worldwide standard for the format of email messages,
every message is required to have at least one email address in the To:
field. Therefore, if this field is empty, this must indicate ‘shady practices’.
To: field contains invalid email address: Instead of being empty or
containing someone else’s email address, the To: field can also contain a
bogus email address, e.g. one without an @ sign or a non-existent one.
Missing To: field: Emails that have no To: field at all, can quite definitely
be considered as spam since this can only happen if done on purpose for
spamming reasons.
From: field is the same as the To: field: This is another common
practice. Instead of entering a bogus or empty To: field, the email address
in the From: field is also used in the To: field. Both email addresses are
most probably fake email addresses.
Missing From: field: Again the reasoning behind this is to disguise the
actual sender of the message.
Missing or malformed Message ID: Since the Message ID includes
information about where the message is coming from, it is often missing
or malformed (i.e. no @ sign or an empty string) in spam messages. The
Message ID is in the form of xxx@domain.com. The first part can be
anything and the second part is the name of the machine that assigned
the ID. Although Message ID’s are not strictly required, one can safely
assume that they would only be missing or malformed if done deliberately
to disguise the source of the message.
More than 10 recipients in To: and/or Cc: fields: Many spam
messages contain more than 10 recipients in the To: and/or Cc: fields.
This can however also be used for ‘legitimate mailings’ but again these will
tend to be of a personal nature (which you might wish to block anyway)
since most professional companies do not use this method for sending
newsletters or mailings.
Bcc: header exists: In normal email messages, a Bcc: header does not
exist since this is stripped from the mail.
X-mailer field contains name of popular spam ware: The X-mailer
field includes the name of the mailing software that was used to send the
mail. If this header contains the name of popular spam software, such as
Floodgate, Extractor, Fusion, Masse-mail, Quick Shot, NetMailer, Aristotle
Mail, Emailer Platinum, Mast Mailer, The Bat and Calypso, this could
indicate that it is a spam message. However, many spam mails do not
contain an X-mailer header, or contain mail software that is widely used such as Microsoft Outlook or Eudora. Since you might also be blocking
legitimate mails if you do not filter on the right names, this header is
probably not worth filtering on.
X-Distribution = bulk: Spammers using Pegasus mail will have the X-
header ’X-Distribution: bulk’ added to their mail if it is addressed to a large
number of recipients. This header occurs quite rarely, so you will not be
able to catch large amounts of spam by filtering on this header.
X-UIDL header exists: Incoming messages should not have an X-UIDL
header since they are only intended for the mail server to stop it
downloading messages more than once, for instance when ‘leave
messages on server’ is checked. This header would normally be stripped
when the message is received. Spammers add an X-UIDL header to try to
get the recipient’s mail server to download multiple copies of their
message and therefore increase the chance that the message will be read.
Code and space sequence exists: Many spam mails include a certain
code for identification in the subject of the message. To hide the code from
the recipient, a large number of spaces are usually placed before the code.
This is done so that the recipient won't notice the code or that it is not
displayed in the mail client before opening the message.
Illegal HTML exists: Some spam messages include a code for
identification in the text of the message. The text is entered outside the
HTML tags so as to hide the code from the recipient. There is no reason to
add text outside HTML tags, so the mere presence of illegal HTML can be
treated as suspicious.
Comment tags to avoid detection by email filters: Some spammers
try to circumvent content filters by placing lots of HTML comment tags
within the email body text. In this way, content filters will not recognize
the spam words since they are separated by comment tags. The recipient
however, will not see the comment tags since these are not displayed
when viewing the message in HTML. Therefore it is important to use an
email filter that can filter emails by removing HTML tags first.
HTML message without plain text body part: HTML messages usually
include a plain text version of the email so that recipients with email
clients that cannot read HTML can still view the message in plain text.
However, many spammers tend to send HTML messages without this plain
text body part, not only to save on size but also to force recipients to read
the HTML version. This enables spammers to embed links and unique IDs
in the HTML code. For instance, many spammers include an image link
that connects to a site when the message is opened. Since each message
contains a unique ID, the spammer will know exactly which recipient has
viewed the mail. In this way, spammers know how many people have
viewed their message and which email addresses are still 'live'. When
spammers know that your email address is 'live' this will entice them to
send you even more spam, so it is important to put a stop to these kinds
of spam messages by using a spam filter that is capable of checking this.
Newsletters also tend to send messages without a plain text body part, so
it is important to use a white list of allowed newsletters so as not to catch any false positives.
2. Message contents
Apart from headers, spammers tend to use certain language in their emails that
companies can use to distinguish spam messages from others. Typical words are
free, limited offer, click here, act now, risk free, lose weight, earn money, get
rich, and (over) use of exclamation marks and capitals in the text. Spam can be
blocked by checking for words in the email body and subject, but it is important
that you filter words accurately since otherwise you might be blocking legitimate
mails as well.
The email filtering system should filter out spam messages (in
order of ‘spam certainty’):
1. Block spam at the gateway by checking domains in real time black
hole lists: There are a number of 'black hole lists' that contain IP
addresses and domains from known spammers. By using these lists you
can filter out a large amount of spam. Not only will you stop a large proportion of spam messages from reaching your users, it will also save you
utilizing your bandwidth to download spam messages since the message is
blocked at the gateway, before the mail is even downloaded. There are two
types of lists: (a) Lists of known spammer's domains, for example the
Spamhaus Block List (SBL), and (b) Lists of mail servers that are open to
relaying and therefore will allow spammers to send mail via their mail server.
An example of this last kind of list is the Open Relay Database (ORDB). Whilst
lists of the first type (spammer's domains) should be fairly accurate, lists of
the second type, the open relay lists, can result in more false positives. This is
because genuine persons that wish to contact your organization might not be
aware that their mail server is being used for relaying. Therefore, it is
important to treat each spam list differently. For instance, you could choose
not to download all messages from domains listed on the Spamhaus Block
List, and quarantine or delete (with the possibility to undelete) mails from the
Open Relay Database.
2. Filter out spam based on email header characteristics: Most of the
email header characteristics mentioned above can safely be used to
classify a mail as spam. Therefore, you could decide to delete messages
that contain any or some of the above mentioned spam headers. Since
checking email headers is a fast process, it is good to check these before
checking the actual email message content.
3. Identify junk mail content: There will still be spam messages that get
through both filters mentioned above. The last way to distinguish these
mails is by checking for spam message content. Depending on the words
you select to filter on, this can usually be very accurate. For instance
messages that contain phrases such as CLICK HERE, FREE!!, EARN
MONEY, FAST CASH, BUY NOW, $$$, fast bucks and huge savings are
almost 100% certain of being spam. Then there are words that could
possibly be used in legitimate mails as well, such as money back, accept
credit cards, credit profile, cash back, FREE. Therefore it is important to
either perform different actions on the different sets of phrases, or to use
textual analysis software that can minimize the chance of catching
legitimate messages. For instance, by giving words or phrases a certain
word score and specifying a word score threshold per email, you are able
to specify quite precisely which messages should be blocked and therefore
will decrease the amount of wrongly blocked messages. It is also
important to apply case sensitivity to words, since spammers often use
capitals in their messages.
Standard spam detection techniques are used to classify the e-mails into two
categories, namely, spam and non-spam. For each of the two result-
ing workloads, as well as for the aggregate workload, we analyze
a set of parameters, based on the information available in the e-mail headers. We aim at identifying the quantitative and qualitative
characteristics that significantly distinguish spam from non-spam
traffic and assessing the impact of spam on the aggregate traffic by
evaluating how the latter deviates from the non-spam traffic.
- Unlike traditional non spam e-mail traffic, which exhibits
clear weekly and daily patterns, with load peaks during the
day and on weekdays, the numbers of spam e-mails, spam
bytes, distinct active spammers and distinct spam e-mail re-
cipients are roughly insensitive to the period of measure-
ment, remaining mostly stable during the whole day, for all
days analyzed. - Spam and non spam inter-arrival times are exponentially dis-
tributed. However, whereas the spam arrival rates remain
roughly stable across all periods analyzed. The arrival rates
of non spam e-mails vary as much as a factor of five in the
periods analyzed. - E-mail sizes in the spam, non-spam and aggregate workloads
follow Lognormal distributions. However, in our workload
the average size of a non-spam e-mail is from six to eight
times larger than the average size of a spam. Moreover, the
coefficient of variation (CV) of the sizes of non-spam e-mails
is around three times higher than the CV of spam sizes. The
impact of spam on the aggregate traffic is a decrease on the
average e-mail size but an increase in the size variability. - The distribution of the number of recipients per e-mail is
more heavy-tailed in the spam workload. Whereas only 5%
of non-spam e-mails are addressed to more than one user,
15% of spams have more than one recipient, in our work-
load. In the aggregate workload, the distribution is heavily
influenced by the spam traffic, deviating significantly from
the one observed in the non-spam workload. - Regarding daily popularity of e-mail senders and recipients,
the main distinction between spam and non-spam e-mail traffics comes up in the distribution of the number of e-mails
per recipient. Whereas in the non-spam and aggregate work-
loads, this distribution is well modeled by a single Zipf-like
distribution plus a constant probability of a user receiving
only one e-mail per day, the distribution of the number of
spams a user receives per day is more accurately approxi-
mated by the concatenation of two Zipf-like distributions, in
addition to the constant single-message probability. - There are two distinct and non-negligible sets of non-spam
recipients: those with very strong temporal locality and those
who receive e-mails only sporadically. These two sets are not
clearly defined in the spam workload. In fact, temporal lo-
cality is, on average, much weaker among spam recipients
and even weaker among recipients in the aggregate workload. Similar trends are observed for the temporal locality
among e-mail senders.
Sender Popularity
With respect to e-mail sender and recipient popularity are:
The distributions of the number of non-spam e-mails per sender and recipient follow, mostly, a Zipf-like distribution. This
result is consistent with previous findings that the connec-
tions between e-mail senders and recipients are established
using a power law (e.g., a Zipf distribution) [28, 29].
The distribution of the number of spams per recipient does
not follow a true power law, but rather, presents a flat region over the most popular recipients. This may be caused
by large spam recipient lists and large number of recipients
shared among spammers. The number of spams per sender is
reasonably well approximated with a Zipf-like distribution.
In all three workloads, the number of bytes per recipient is
most accurately modeled by two Zipf-like distributions. In
the case of the non-spam and aggregate workloads, this is
probably due to the high variability in e-mail size. The distribution of the number of bytes per sender is well modeled
by a single Zipf-like distribution in all three workloads.