Wikipedia articles could get automated rewrites for greater accuracy
Image credit: Pexels
Researchers from the Massachusetts Institute of Technology (MIT) have developed a system which could be used to automatically update factual inconsistencies in Wikipedia articles, reducing the time and effort spent by human editors who currently do the task manually.
Wikipedia, the free online encyclopedia, comprises millions of articles that are in constant need of edits to reflect new information. This could involve article expansions, major rewrites, or more routine modifications such as updating numbers, dates, names and locations. At present, humans across the globe volunteer their time to make these edits.
In a paper on the subject, the researchers describe a text-generating system that pinpoints and replaces specific information in relevant Wikipedia sentences, while keeping the language similar to how humans write and edit.
For the system to work, humans would type an unstructured sentence into an interface with updated information, without needing to worry about style or grammar. The system would then search Wikipedia, locate the appropriate page and outdated sentence and rewrite it in a human-like fashion.
The team have expressed how, in the future, there’s potential to build a fully automated system that identifies and uses the latest information from around the web to produce rewritten sentences in corresponding Wikipedia articles that reflect updated information.
“There are so many updates constantly needed to Wikipedia articles. It would be beneficial to automatically modify exact portions of the articles, with little to no human intervention,” said Darsh Shah, a PhD student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT.
“Instead of hundreds of people working on modifying each Wikipedia article, then you’ll only need a few because the model is helping or doing it automatically,” he added. “That offers dramatic improvements in efficiency.”
Many other bots exist that already make automatic Wikipedia edits, with Shah saying that such bots work on mitigating vandalism or dropping some narrowly defined information into predefined templates.
He said that the researchers’ model solves a harder artificial intelligence (AI) problem: given a new piece of unstructured information, the model automatically modifies the sentence in a human-like fashion. “The other [bot] tasks are more rule-based, while this is a task requiring reasoning over contradictory parts in two sentences and generating a coherent piece of text,” he added.
According to CSAIL graduate student Tal Schuster, the system can also be used for other text-generating applications.
In the paper, the researchers used the tool to automatically synthesise sentences in a popular fact-checking dataset that helped reduce bias, without manually collecting additional data. “This way, the performance improves for automatic fact-verification models that train on the dataset for, say, fake news detection,” Schuster said.
According to the researchers, behind the system is a fair bit of text-generating ingenuity in identifying contradictory information between, and then fusing together, two separate sentences. It takes as input an “outdated” sentence from a Wikipedia article, plus a separate “claim” sentence that contains the updated and conflicting information.
The system must automatically delete and keep specific words in the outdated sentence, based on information in the claim, to update facts but maintain style and grammar. The team said this is an easy task for humans, but a novel one in machine learning.
MIT provide an example of this process. For instance: “Fund A considers 28 of their 42 minority stakeholdings in operationally active companies to be of particular significance to the group.” The claim sentence with updated information may read: “Fund A considers 23 of 43 minority stakeholdings significant.”
Here, the system would locate the relevant Wikipedia text for “Fund A,” based on the claim, then automatically strips out the outdated numbers (28 and 42) and replaces them with the new numbers (23 and 43), while keeping the sentence exactly the same and grammatically correct.
The MIT study also showed that the system can be used to augment datasets to eliminate bias when training detectors of 'fake news'. Some of these fake news detectors train on datasets of agree-disagree sentence pairs to 'learn' to verify a claim by matching it to given evidence.
In these pairs, the claim will either match certain information with a supporting 'evidence' sentence from Wikipedia (agree) or it will be modified by humans to include information contradictory to the evidence sentence (disagree).
The models are also trained to flag claims with refuting evidence as 'false', which can be used to help identify fake news. However, the team said that such datasets currently come with unintended biases.
“During training, models use some language of the human written claims as 'giveaway' phrases to mark them as false, without relying much on the corresponding evidence sentence,” Shah said. “This reduces the model’s accuracy when evaluating real-world examples, as it does not perform fact-checking.”
To challenge this issue, the researchers used the same deletion and fusion techniques from their Wikipedia project to balance the disagree-agree pairs in the datasets to help mitigate the bias.
For some 'disagree' pairs, the team used the modified sentence’s false information to regenerate a fake 'evidence' supporting sentence. Some of the give-away phrases then exist in both the 'agree' and 'disagree' sentences, which forces models to analyse more features.
By using their augmented dataset, the researchers reduced the error rate of a popular fake news detector by 13 per cent.
“If you have a bias in your dataset and you’re fooling your model into just looking at one sentence in a disagree pair to make predictions, your model will not survive the real world,” Shah concluded. “We make models look at both sentences in all agree-disagree pairs.”
Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.