AI in practice: Generating video subtitles
In this practical example, we roll up our sleeves and put Scaleway's H100 Instances to use by leveraging a couple of open source ML models to optimize our internal communication workflows.
The year 2023 has been a turning point in the democratization of artificial intelligence, especially when it comes to the use of generative AI among consumers and businesses.
Now AI is no longer just limited to companies’ technical departments. It’s the focus of everyone's attention–including top management across industries–as witnessed by the buzz around the ai-PULSE event organized by Scaleway last month.
As a French scale-up specialized in voice services, Ringover was delighted to be one of the innovative exhibitors at ai-PULSE, notably as we were able to tell the story of our conversational analysis solution Empower, which we launched last spring.
Empower offers transcription and mood analysis functionalities which enhance understanding of end-customer needs and optimize managerial and operational processes.
So, how did Ringover build this conversational AI tool?
We began developing Empower in 2022, when we decided to incorporate artificial intelligence into our cloud telephony solution through an advanced transcription tool, powered by technology developed in-house.
To develop it, we first began a data collection phase.
We used this dataset to feed our neural networks using PyTorch tools.
Then, of course, the data had to be sorted, organized, and cleaned to guarantee a qualitative result, before being annotated. This annotation consists in marking out the terrain for the model by labeling the data. It ensures that data tokenization (segmentation into smaller sequences) and normalization run smoothly. This initial pre-processing phase makes a major contribution to facilitating model learning.
We then defined the neural architecture of our model, a structure in which each artificial neuron plays a key role in understanding the input and generating the output. Next, we calculated the loss function, i.e. the margin of error between model predictions and actual values.
Finally, we began the back-propagation process through which we weigh the results, adjusting the model's internal parameters to reduce this margin of error as much as possible. The smaller this loss function, the better the model's predictions are likely to be, and the better its performance.
After a few months, we extended the capabilities of our transcription tool with some of the features described above to make it a full-fledged conversational analysis solution.
This first version took three months to develop, and we launched it internally in February 2023. A handful of Ringover customers participated in the testing phase.
The technical team took into account the feedback from all the testers to fine-tune the learning manually, i.e. to adapt it to the tasks we had defined for our model and, ultimately, to improve contextual understanding for each transcript.
Two months later, at the end of April, we officially launched Empower. We then changed the software architecture to microservices. In addition to simplifying the development of the application, this switch had two objectives:
Since then, we've continued development, applying a few patches and continually adding new features.
Contextual understanding is one of the major challenges of conversational AI solutions. Having pre-trained models can save time, but in any case, the fine-tuning phase is essential to a high-quality solution.
To this end, we completed our dataset and injected a total of no less than 2000 hours of audio conversations to improve the quality of ASR (Automatic Speech Recognition). The other challenge common to all transcription engines, which we had to overcome, concerns proper nouns.
The "name game" can give the best artificial intelligence models a hard time. And for good reason: systems are not always able to break down names and identify their structure, not to mention special features such as surnames containing special characters. To limit these side effects, we are developing two areas of improvement:
The conversational analysis solution doesn't just use AI for transcription. The tool is also capable of performing other analytical and generative tasks, such as mood analysis, call summaries and translation.
Sentiment analysis for mood analysis
Mood analysis is possible thanks to sentiment analysis performed through text. For the time being, we are using a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model, because it offers better performance. However, we are already working on a model of our own, developed by our teams, with the aim of processing the speech signal in addition to the text signal. This will further improve the accuracy of mood analysis.
OpenAI for call summaries
We use in-house technology to extract highlights from the transcript text. These highlights can be accessed from Empower's "Recap" section.
Next, OpenAI comes into play. Our team issues a customized prompt to ChatGPT through the Open AI API, so that it can generate a summary from the highlights extracted beforehand.
Example prompt:
_Summarize the text below, highlighting the most important points in the form of a bulleted list.
Text: """
{text input}
"""_
Deepl for translation
All Empower functions are available in three languages: French, Spanish, and English. Summaries, transcripts, and highlights can be translated at the click of a button, which is very useful in multilingual environments. The Deepl API supports this function.
The choice of PyTorch as a framework was a natural one. The various team members had already worked with this user-friendly framework. Its Python-like syntax facilitates experimentation and debugging, giving our developers great flexibility and freedom in building models.
What's more, research into PyTorch is very active, so we benefit from regular updates. These are the reasons why we chose PyTorch over Google's Tensorflow framework.
In our day-to-day work with our sales and customer relations teams, we have ourselves encountered problems inherent to these departments: lack of visibility and time to listen back to calls and optimize coaching strategy from a management point of view, or obtaining quick and accurate feedback from the team or agent's point of view.
Solving these pain points is, in part, what guided us in the development of Empower.
The solution identifies key moments in each conversation through personalized word lists. In the course of a conversation, when a speaker mentions a dispute or a problem, the software is able to locate the words chosen by the user (through a text search) to identify these important Moments. This eliminates the need to listen to call excerpts.
Moment identification can also be used to identify hiccups and other language slurs that can sometimes interfere with speech or indicate hesitation on the part of the agent.
We move from raw audio data as input to enriched visual information as output, which informs the user about the relevance of his speech and his performance. With this information, the manager saves time, finds value in the tool and fine-tunes his coaching.
As for the agent, they know precisely what they need to work on, with the software displaying recommendations for each conversation.
The conversational analysis solution uses a transcription engine developed in-house. But what sets it apart is its ability to process large volumes of calls rapidly, without compromising on the quality of the content generated.
Thanks to its unique transcription engine, software architecture and powerful GPUs, Empower is able to process several calls simultaneously.
What's more, Ringover can count on the NVIDIA Tesla P100 (16 Go VRAM) GPUs offered by Scaleway, which provide the significant processing power we needed to make this solution work.
The exciting thing is that the possibilities offered by artificial intelligence are many and constantly evolving.
Our aim remains the same: to provide concrete answers to the new challenges facing businesses, and to solve problems that were, until now, only partially addressed.
We are already hard at work on new models and exploring alternative methods to improve existing ones and enrich the product with new functionalities.
In this practical example, we roll up our sleeves and put Scaleway's H100 Instances to use by leveraging a couple of open source ML models to optimize our internal communication workflows.
The first edition of AI conference ai-PULSE was one to be remembered. Here’s a first sweep of the most headline-worthy quotes!