Home Publications Teaching Dataset Press
Fake Account DetectionSybil attacks, where attackers register a large number of fake accounts, are a fundamental threat to social systems. For instance, reports from August 2014 showed that 8.5% of Twitter's active users were fake, and it was reported in August 2012 that 9.2% of Facebook users were fake. Previous studies showed that fake accounts leverage their access to millions of benign users to disseminate scams, carry out phishing attacks, distribute malware, and harvest private user data.
I designed SybilBelief to detect fake accounts at scale using the social relationships between users. The intuition is that it is hard for attackers to establish trust relationships between fake accounts and benign users, even though they can manipulate arbitrarily the fake accounts they created. From a machine learning perspective, detecting fake accounts is a binary classification problem with benign and fake as the two classes. Previous work was based on either random walks or community detection in social graphs, and they were one-class classification approaches since they leveraged either known benign accounts or known fake accounts, but not both, to learn their classification models. SybilBelief leverages information about both known benign accounts and known fake accounts, as well as the social connections amongst them and other unlabeled accounts, through a semi-supervised learning approach that is based on pairwise Markov Random Fields and Loopy Belief Propagation. In particular, SybilBelief models a social graph as a pairwise Markov Random Fields, which is a joint probability distribution over the states (i.e., benign or fake) of all accounts; given some known benign and fake accounts, we use Loopy Belief Propagation to infer the posterior probabilities of all other accounts being fake; and we further use the posterior probabilities to classify them. SybilBelief substantially outperforms previous approaches and I am applying it to detect fake accounts in a large-scale Twitter dataset with 21 million nodes and 265 million edges.