Emoji faces art

You won’t believe what happened when this AI read clickbait

Image credit: Dreamstime

Researchers based at Penn State and Arizona State University have brought together humans and computers to create an accurate machine learning model for detecting ‘clickbait’ headlines.

Clickbait refers to content presented in a way intended primarily to grab attention and maximise clicks. It is commonly associated with false advertising, hyperbole, and poor journalism and native advertising.

In this study, researchers developed a machine learning model capable of detecting clickbait more effectively than other clickbait detectors, as well as being capable of differentiating between clickbait headlines written by humans and generated by bots.

One of the major obstacles facing the researchers (and many other machine learning researchers and engineers) was a lack of labelled datasets. A large, labelled dataset is necessary for machine learning models to be ‘trained’ to form accurate associations.

“One of the things we realised when we started this project is that we don’t have many positive data points,” said Thai Le, a PhD student at Penn State’s College of Information Sciences and Technology. “In order to identify clickbait, we need to have humans label that training data. There is a need to increase the amount of positive data points so that, later on, we can train better models.”

This problem is further complicated by the fact that there are variations in clickbait headlines and content, including listicles and headlines phrased as questions; finding a sufficient quantity of each of these types of clickbait is a time-consuming challenge.

“Even though we all moan about the number of clickbaits around, when you get around the obtaining them and labelling them, there aren’t many of those datasets,” said Professor S. Shyam Sundar, co-director of Penn State’s Media Effects Research Laboratory.

In order to create a reliable and well-labelled dataset, the researchers asked 125 journalism students and 85 other recruits to write their own clickbait headlines for 500-word articles. They additionally used a Variational Autoencoders generative model – which relies on probabilities to detect patterns in data – to generate artificial clickbait headlines. These headlines were combined to form a dataset to train their algorithm to detect clickbait.

According to the researchers, the algorithm is approximately 14.5 per cent more accurate than other systems for detecting clickbait, such as high-performing systems from the Clickbait Challenge 2017: a clickbait detection competition.

The study also helped the researchers identify differences between how people and algorithms approach headline creation differently. For instance, headlines written by people tend to use more determiners such as “which” and “that”, while the participants with some journalism training were most likely to use longer words and pronouns.

In addition to their algorithm being used to detect (and potentially filter) clickbait, the researchers hope to use their findings to guide investigations into a more robust “fake-news detection system”. The team also hope that this collaborative approach to machine learning may help improve machine learning performance more generally.

“This result is quite interesting as we successfully demonstrated that machine-generated clickbait training data can be fed back into the training pipeline to train a wide variety of machine learning models to have improved performance,” said Penn State’s Professor Dongwon Lee, who led the project. “This is the step toward addressing the fundamental bottleneck of supervised machine learning that requires a large amount of high-quality training data.”

Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.

Recent articles