Can we do better than ChatGPT for translation?
Testing out LLMs for Machine Translation & scoring them with COMET
Catching a break from new LLMs?
Forget about it!
Unsurprisingly, the team at Dubverse has been busy tracking and testing the endless stream of new and improved LLMs being released weekly by incredible researchers and open-source devs in the LLM community.
Our motive is to improve the current Machine Translation (referred to as MT subsequently) system at Dubverse while comparing it with the previously tested - ChatGPT powered translations (you can read about it here).
We tested two LLMs specifically finetuned for MT, IndicTrans2 by AI4Bharat, and NLLB by Meta AI, on English to Hindi translation (since it’s a very significant percentage of videos dubbed on Dubverse). We also tested a few other open-source LLMs not finetuned for MT. The primary metric used for scoring is COMET (referenceless), which takes as input the original text in English and the translated text in Hindi and outputs a score normalized using a z-score transformation. Human evaluation, on limited samples, was also done to verify these results.
So what did we stumble upon?
It’s still pretty difficult to rank MT with automatic metrics, and ChatGPT-powered translations are still the clear winners when it comes to most translations, but there are enough caveats to explore major learnings from this activity.
First, understanding how COMET works is important to make sense of the scores obtained.
Understanding COMET in brief
COMET (Cross-lingual Optimized Metric for Evaluation of Translation) is a neural framework for training multilingual MT evaluation models.
The model architecture of COMET (referenceless) consists of two main components: a sentence encoder and a ranking model. The sentence encoder takes as input the source sentence and the translated sentence and encodes them into fixed-length representations. The ranking model is trained to distinguish between good and bad translations based on the encoded representations.
The COMET metric computes a similarity score between the encoded representations of the source and translated sentences. This score reflects the quality of translation in terms of fluency, adequacy and other linguistic aspects.
Our results
Let’s look at COMET scores for a few videos we tested across the four different models/services, GCP, ChatGPT, IndicTrans2 and NLLB.
The average sentence length of Project 1 < Project 2 < Project 3. Mentioning this separately is necessary since the average COMET scores tend to reduce with longer input sentence sizes irrespective of the model in consideration. So, keeping in mind that most models struggle with obtaining relevant & contextual translations as segment lengths increase let’s dive deeper into understanding these numbers obtained above.
A few examples to understand the same:
A small sentence sample:
Input text: Do you want to start your own salon services?
GPT: क्या आप अपनी सैलून सेवाएं शुरू करना चाहते हैं?
GCP: क्या आप अपनी खुद की सैलून सेवाएं शुरू करना चाहते हैं?
IndicTrans2: क्या आप अपनी खुद की सैलून सेवाएं शुरू करना चाहते हैं?
NLLB: क्या आप अपने स्वयं के सैलून सेवा शुरू करना चाहते हैं?
The scores obtained were:
GPT: 0.8696
GCP: 0.8696
IndicTrans2: 0.8707
NLLB: 0.8611
All of these scores are high and close to each other as the sentence length is less and the outputs are nearly similar and have less variability.
Longer sentence sample:
Input text-> So you go sit down and her dog jumps all over you right away and she says, wow, she's usually nervous to meet new people, it must be a good sign.
GPT: तो आप वहाँ बैठ जाते हो और उसका डॉग तुरंत आप पर कूद पड़ता है और वह कहती है, वाह, वो आमतौर पर नए लोगों से मिलने में नर्वस होती है, ये तो अच्छा संकेत है।
GCP: तो तुम बैठ जाओ और उसका कुत्ता तुरंत तुम्हारे ऊपर कूदता है और वह कहती है, वाह, वह आमतौर पर नए लोगों से मिलने में घबराती है, यह एक अच्छा संकेत होना चाहिए।
IndicTrans2: तो आप बैठ जाते हैं और उसका कुत्ता तुरंत आपके चारों ओर कूद जाता है और वह कहती है, वाह, वह आमतौर पर नए लोगों से मिलने के लिए घबराती है, यह एक अच्छा संकेत होना चाहिए।
NLLB: तो आप बैठ जाते हैं और उसका कुत्ता तुरंत आप पर कूदता है और वह कहती है, वाह, वह आमतौर पर नए लोगों से मिलने के लिए परेशान होती है, यह एक अच्छा संकेत होना चाहिए।
The scores obtained were:
GPT: 0.8696
GPT: 0.1382598728
GCP: -0.02570376918
IndicTrans2: 0.07083564997
NLLB: -0.6904425025
These scores are low(since longer sentence length) and the variability in scores is a lot more in this case. In particular, the score assigned to NLLB is negative and appears inappropriate considering the translation obtained is relevant in this instance.
These variances tend to get averaged out when the evaluation is performed for a collection of sentences, leading to more appropriate results on the whole.
In essence, our observations are, the COMET metric ranks English versions of words written in Hindi high (i.e. nervous translated to Hindi as नर्वस is scored higher), and more commonly used words are ranked higher. These two traits are desirable for us and therefore make it a more relevant metric of evaluation as compared to BLEU and chrF (more widely used metrics) which use matching words as a scoring method. COMET is less dependent on matching words and more on the meaning conveyed.
But since these observations don't always hold true, as we observed with larger length sentences, a certain degree of human evaluation is still unavoidable and necessary in MT.
Are general purpose Open Source LLMs suitable for MT?
We tried the following LLMs:
Robin, Falcon, Guanaco and StabilityLM appear to not understand Indian languages much, they either output gibberish or just repeat a sample text from the prompt.
OpenAssistant and Vicuna show better results, but still, do not follow the rules provided in the prompt well enough e.g. adding a rule to not provide any explanations for the translation provided still results in the output having explanations for the same.
Whereas GPT-3.5/4 are really good at performing MT, as described in detail previously.
Navigating the numerous different licenses of these LLMs is also an important factor when selecting an LLM to build upon.
IndicTrans2 and NLLB
Coming back to the original point of this blog, where do IndicTrans2 and NLLB stand in comparison to GCP and GPT-powered translations?
Considering COMET scores at their face value, IndicTrans2 and NLLB appear to be noticeably inferior to GCP and GPT-powered translations as the sentence length increases.
Remember the mention of caveats at the beginning of this blog?
Well, IndicTrans2 and NLLB are open source so can be tweaked and fine-tuned to our liking.
Something interesting is definitely brewing at Dubverse😉
Thank you for reading this! We hope you learned something new today.
Join our Discord community to get the scoop on the latest in Audio Generative AI!
Do visit our website and follow us on Twitter.
Until next time!
Ruchir