Machine Translation: Tunisian Dialect — English is it possible?

Imene Ayari
5 min readSep 19, 2021

--

WordCloud made by the authors : most used word in Darija generated in format of Tunis map

Introduction

In the context of a final Specialty project we were given a month to develop an idea that demonstrates our newly acquired knowledge and techniques.

Choosing Machine Learning among all the available specialties was a risky step that we embarked and enjoyed it, mostly.

Covering its mathematical and statistical concepts, supervised learning, unsupervised learning, reinforcement learning we had a wide range of sub-fields to work on but we considered two factors: What time allowed us to do and what area we felt most able to work on. That is how our choice was directed primarily to Natural Language Processing but what idea exactly?

Last year we worked on a bedtime stories application that collected Tunisian folkloric stories and we were asked by one of the presentation jury if is it possible to implement a text to speech feature?

This feature was available in Modern Standard Arabic to English but not in Tunisian Dialect so we said we’ll look into it after taking our machine learning courses and that is how the idea got complete.

The first idea was to create a transcription and transliteration bot but during the intensive data digging phase we learnt that this required an immense audio data so we moved to a more possible idea which is the machine translation.

What is a Machine Translation (MT)?

Machine Translation (MT) or automated translation is a process when a computer software translates text from one language to another without human involvement.

a representation of the Machine Translation, in an abstract form

The various types of Machine Translation are:

  • Statistical Machine Translation (SMT)
  • Rule-based Machine Translation (RBMT)
  • Hybrid Machine Translation (HMT)
  • Neural Machine Translation (NMT)

In our project we developed the Neural Machine Translation.

Data Preparation

As previously mentioned the data search and collecting opened our eyes to many constraints and realities. It explains why this fundamental step consumed the most time allowed as it is crucial to be well executed to guarantee the success of the remaining steps of the pipeline.

After reading tons of research papers and countless articles we found a resource to scrape our data from and for that step we use the Parsehub as a tool.

Then, it was time to make our corpus therefore we used the famous Pandas library to transfer the data to csv format, composed of two columns that have the input(Derjja sentences or words) and the targeted(English sentences or words) translation .

screenshot of the dataset

Along this step we found out we were working on a fresh idea that has rarely been developed and actually takes a massive data set but there was no stepping back. We were sure we wanted to move forward with our plan and gathered data. We believed that we can enhance it through the process and extract good results.

The journey continued to creating a model that trains its input to predict an accurate translation.

Models Implemented

We followed our program guideline and we trained our data on sequence to sequence models using recurrent neural networks and then on Transformers.

The Seq2Seq model with Attention

Seq2Seq is a many to many recurrent neural network architecture where the input is a sequence and the output is also a sequence (where input and output sequences can be or cannot be of different lengths). This architecture is used in a lot of applications like Machine Translation(like ours), text summarization, question answering etc

We will not write the code implementation for this, but rather we will invite you to take a look at the resource code of the TensorFlow library.

The Transformer model

The Transformer architecture excels at handling text data which is inherently sequential. They take a text sequence as input and produce another text sequence as output in our example to translate an input Tunisian dialect (Derjja) sentence to English.

This model along with a multi-head attention revolutionized the world of NLP there is why it was essential to train our corpus through it with the parameterize we considered suitable.

Approaches

Some early results came out disappointing and there why we had to act, first we fine tuned the models we regularized some parameters, tried building up the data with Standard Arabic data and some close dialects as an Algerian Dialect translation data that we found available and FREE online.

All of these approaches varied between hits and misses that taught us a lot:

We investigated repeatedly the data and wondered what can we change. The models we build was seen layer per layer to precise our intervention for enhancement.

We gain a deeper understanding of used technologies and tools used (Tensorflow- numpy-pandas…)

The funniest thing is that we grasped a new perspective to our dialect and we laughed multiple times at some forgotten rarely used words and expressions we also knew for sure that our dialect is evolving and changing through times and adding to its corpus.

We eventually felt the application of NLP and how it will make the world close.

We think that in the near future we will be able to understand the whole globe dialects and languages.

Ethically we were not able to eliminate or change some hurtful and discriminating expressions needless to say it is part of any dialect.

Summary

This project turned into a journey that gave us the chance to put in practice our knowledge and to witness how technology advancements is incorporated in the smallest details of our lives and can be very helpful to get the world to be closer.

To check the progress of this portfolio project, please refer to the github, the results aren’t definite therefore any contribution or suggestion is welcomed.

Team Members:

Khawla Jlassi

Imen Ayari

--

--