HBKU research explores AI use in detecting fraudulent text messages

Published: 27 Aug 2022 - 08:35 am | Last Updated: 27 Aug 2022 - 08:46 am

Fazeena Saleem | The Peninsula

Hamad Bin Khalifa University’s College of Science and Engineering (CSE), which focuses on conducting world-class research and innovation in the area of conversational artificial intelligence, aims at designing and developing practical tools with clear value to the local industry.

In line with this, the college recently conducted a research project on detecting fraudulent text messages with strong privacy protection, said Dr. David Yang, Associate Professor at CSE.

The research project funded by Qatar National Research Fund (QNRF) was conducted in collaboration with Ooredoo.
“The project aims to design and develop privacy-preserving data analytics solutions on telecommunication data, with a focus on detecting fraudulent messages, protecting customers, and improving customer experience in general,” said Dr. Yang.

“Among the outcomes of this project are novel model for natural language processing (NLP) techniques based on the Transformer architecture, which are necessary for parsing and classifying text messages, and graph analytics tools that analyse the relationships between customers, in order to identify vulnerable customers who tend to become victims of fraud,” he added.

The project was funded by QNRF under the National Priority Research Program (NPRP) program Cycle 10 and led to two prestigious academic awards.

Read Also

Conversational AI is a type of AI that allows computers and devices to understand and respond to human language. This type of AI is used in chatbots, digital assistants, and other applications that rely on natural language processing.

Discussing about prospects of conversational artificial intelligence and its industry-wide applications, Dr Yang said AI algorithm indeed understands the question and then creates its own answer which is fluent, concise, and to the point.

Explaining how natural language understanding (NLU) works, Dr Yang said, “We do not exactly know how natural language understanding happens. What we do know is how to build an AI system for this purpose. Typically, this is done using a large-scale Transformer, which is a deep learning model trained with a large corpus of text obtained from the Internet.”

According to Dr Yang, as of August 2022, the Transformer architecture is well understood, and there is an abundant amount of text on the Internet that can be used for model training.

“So, with sufficient computing resources, anyone can build an NLU model. However, we still do not have a good theoretical understanding on how the AI understands natural language,” he added.

Giving examples of NLU and the industries they are most applied, Dr. Yang said many people regularly use Apple Siri (or Amazon Echo/Google Home); individuals often write emails with the help of text autocomplete which is available in Outlook and Gmail; and many websites deploy chatbots to answer users’ questions in a customer service setting.

“A typical example of widely used technology in Qatar is the chatbot, which has an AI capability provided by major cloud-computing platforms such as Google Cloud and IBM Watson. Besides, machine translation between Arabic and English also presents a common use of NLP.”

He said data is essential for training any AI model. “For NLU, unstructured data can be used to train a generic model, a process sometimes called ‘pre-training’. Then, structured data can be used to train a task-specific model, a process called ‘fine-tuning’. For example, we can pre-train a model for the Arabic language understanding using unstructured data obtained from the Internet, and then fine-tune the model to interact with users in a chatbot for a specific domain such as telecom customer service, using structured data from this domain,” said Dr. Yang.

He said another important point is artificial intelligence ethics. For example, he said a chatbot trained on unregulated Internet forums tends to respond with offensive language such as profanity and racial slurs. To avoid this, the training data need to be “sanitized” by removing such offensive language samples.

“With a modern AI model architecture and sufficient data, the AI model eventually learns the language. The perceived difficulty of the language or accents are actually not the main challenge but the main challenge is that for some less popular languages, we do not have a large amount of data publicly available on the Internet,” he said.

This project led to two prestigious academic awards - first place in the NLP4IF competition at the EMNLP 2019 conference and Best Paper Award in the VLDB 2021 conference.