Students relying on text generated by ChatGPT risk plagiarism, scientists say

Image credit: Dreamstime

Students using chatbots like ChatGPT to complete essay assignments could be risking plagiarism due to the way the AI processes text, a study has found.

“Plagiarism comes in different flavours,” said Dongwon Lee, professor of information sciences at Penn State University. “We wanted to see if language models not only copy and paste but resort to more sophisticated forms of plagiarism without realising it.”

The researchers identified three forms of plagiarism: verbatim, or directly copying and pasting content; paraphrase, or rewording and restructuring content without citing the original source; and idea, or using the main idea from a text without proper attribution.

They constructed a pipeline for automated plagiarism detection and tested it against OpenAI’s GPT-2 because the language model’s training data is available online, allowing the researchers to compare generated texts to the eight million documents used to pre-train GPT-2.

The scientists used 210,000 generated texts to test for plagiarism in pre-trained language models and fine-tuned three language models to focus on scientific documents, scholarly articles related to Covid-19, and patent claims.

They found that the language models committed all three types of plagiarism, and that the larger the dataset and parameters used to train the model, the more often plagiarism occurred.

They also noted that fine-tuned language models reduced verbatim plagiarism but increased instances of paraphrase and idea plagiarism.

“People pursue large language models because the larger the model gets, generation abilities increase,” said lead author Jooyoung Lee. “At the same time, they are jeopardising the originality and creativity of the content within the training corpus. This is an important finding.”

The study highlights the need for more research into text generators and the ethical and philosophical questions that they pose, according to the researchers.

“Even though the output may be appealing, and language models may be fun to use and seem productive for certain tasks, it doesn’t mean they are practical,” said Thai Le, assistant professor of information science at the University of Mississippi. “In practice, we need to take care of the ethical and copyright issues that text generators pose.”

Though the results of the study only apply to GPT-2, the automatic plagiarism detection process that the researchers established can be applied to newer language models like ChatGPT to determine if and how often these models plagiarise training content.

Testing for plagiarism, however, depends on the developers making the training data publicly accessible, said the researchers.

The current study can help AI researchers build more robust, reliable and responsible language models in future, according to the scientists. For now, they urge individuals to exercise caution when using text generators.

The plagiarism outcome is not something unexpected, added Dongwon Lee.

“We taught language models to mimic human writings without teaching them how not to plagiarise properly,” he said. “Now, it’s time to teach them to write more properly, and we have a long way to go.”

Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.

Recent articles