Algorithms can detect trolls from as few as 50 tweets
Image credit: Dreamstime
A linguist from Friedrich Schiller University in Jena, Germany, has presented a pair of algorithms which can detect troll tweets based on the distinctive use of word repetition and word pairs.
In the context of this study, trolling refers to messaging with the intent of achieving a specific (typically malicious) purpose while masking that purpose. For instance, in 2018, 13 Russian individuals were accused of having used fake Twitter accounts to attempt to influence the 2016 US presidential election.
“Troll internet messages, especially those posted on Twitter, have recently been recognised as a very powerful weapon in hybrid warfare,” the study says.
Since the efforts of Kremlin-backed trolls to influence democratic events have become common knowledge, scientists and engineers have been studying troll tweets in an effort to build detection tools. While many previous studies have focused on distinguishing features of these tweets – such as timing, hashtag use, and location – this study focused specifically on linguistic features of the tweets.
Linguist Sergei Monakhov based his project on the idea that trolls have a limited number of messages to convey, but must repeat these messages with enough diversity of wording and topics to pass as a legitimate Twitter user in what Monakhov described as “[essentially] an imitation game”.
He used a library of more than 2.5 million tweets from Russian trolls and genuine tweets from US representatives, and found that the trolls – due to having to keep repeating a limited set of messages – showed a distinctive pattern of repeated words and word pairs, which are different to the patterns seen in tweets from legitimate Twitter users.
Monakhov then tested two algorithms which used the patterns to detect troll tweets, and found that they required as few as a randomly selected sample of 50 tweets to correctly distinguish the trolls from the US representatives. They were also able to distinguish troll tweets from tweets by US President Donald Trump. While Trump’s tweets were provocative and “potentially misleading”, they are distinct from troll tweets (as defined by this study) due to not trying to mask purpose.
“Though troll writing is usually thought of as being permeated with recurrent messages, its most characteristic trait is an anomalous distribution of repeated words and word pairs,” Monakhov said. “Using the ratio of their proportions as a quantitative measure, one needs as few as 50 tweets for identifying internet troll accounts.”
He hopes that these algorithms could contribute towards efforts to combat information warfare while preserving freedom of speech by avoiding automated censorship of legitimate messaging.
Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.