View from India: Web search takes a safer route
Content Moderation is a new offering from the Microsoft Cognitive Services stable that takes advantage of developments in artificial intelligence. Indian researchers have made an important contribution.
Content Moderation is a natural language search and web search capability but what makes it interesting is the path to its beginning that winds back to a student internship programme.
When Harish Yenala, a research student pursuing MS at IIIT Hyderabad, began to intern in the Microsoft campus in Hyderabad in 2016, little did he know that he would embark on an engaging journey. The timing of the internship has coincided with Microsoft’s work to develop application programming interface (API) initiatives.
“We have developed virtual agents for chatbots. Virtual agents entertain, converse and get tasks done. Sometimes, they failed to understand when they were asked specific questions as they lacked maturity; their behaviour too needed to be regulated,” reasoned Dr Manoj Chinnakotla, adjunct professor at IIIT Hyderabad and senior applied scientist, Microsoft Hyderabad.
With this, the very nature of the search suggestions requires pruning. For instance, if a child were to search for a topic on “kite flying” and enters the prefix “ki”, the suggestion could be “killing people”. The suggestions can be inappropriate and even misused or misunderstood.
There is also a feeling that offensive comments tend to do the rounds in social media and they need to be replaced by good, clean and safe interactions. These are the realities of the digital age, and sometimes one needs to mask data in order to navigate the digital space with ease and mental comfort. There has been a felt need for such a service in the market and Microsoft has been keen on catering to the need. Around the same time, Yenala stepped in for an internship. “I had the knowhow of the algorithms in the deep-learning technique as I have been exposed to it in college. During the internship, we have explored this as it’s a topical subject,” he explains.
Yenala, under the guidance of Dr Chinnakotla and Jay Goyal, principal development manager at Microsoft India, began to pursue the research, which has brought a fair share of challenges.
What has been difficult is that researchers have all along relied on conventional solutions such as manually curating the list of patterns sometimes involving offensive words, phrases and slangs. The other options include classical machine learning (ML) techniques that use hand-crafted features or typically words for learning the intent classifier, or the standard off-the-shelf deep-learning model architectures such as convolutional neural network (CNN), long short-term memory networks (LSTMs) or bidirectional LSTMs (BLSTMs). “Our approach has been relatively new and there’s not much work done in this direction. In the absence of available data, we have used Bing search to get the desired results,” added Yenala.
Various kinds of data have been pulled out for the web log, and AI solutions have been built in order to identify words and images. Further research led the team to arrive at a hybrid architecture represented by convolutional bidirectional LSTM (C-BiLSTM) to create a model that has been trained to answer and identify around 50,000 queries. This new query representation then passes through a fully connected network that predicts the target class before giving out the output suggestion.
“The distinction in our intention lies in the fact that we don’t take away people’s desire to search for anything. But we filter and detect things which suggest self-harm and are discouraging in nature,” added Dr Chinnakotla.
The technique has also been evaluated in real-world search queries from a commercial search engine, and the results revealed that it outperformed both pattern-based and other hand-crafted feature-based baselines. C-BiLSTM also performed better than individual CNN, LSTM and BLSTM models trained for the same task.
The research shaped up as a paper titled ‘Convolutional Bi-directional LSTM for Detecting Inappropriate Query Suggestions in Web Search’. The paper received the Best Paper Award at the recent Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2017.
Given the extent of research, it’s no surprise that Yenala has been absorbed into workforce of Microsoft in Hyderabad, where he now works on deep learning and natural language processing techniques.
Convolutional Bi-directional LSTM is based on deep learning (DL), which aims to build machines that can process data and learn in the same way as the human brain does through artificial neural networks that are trained to mimic the behavioural patterns of the human brain. These networks can begin to reason through the given inputs like words, images and sounds. It doesn’t rely on hand-crafted features, is trained end-to-end as a single model, and effectively captures both local and global semantics.
It’s intended to create an image equivalent of the present offering. Considering social media has changed our mindsets and given Gen Y’s highly social nature, Microsoft intends to market Content Moderation on social media platforms, email services, chat rooms, discussion forums and search engines to make them more contextually aware which will hopefully result in a safer and secure web.
Content Moderation as a service is expected to be released in the coming months.