Machine Translation: Tunisian Dialect — English is it possible?

5 min readSep 19, 2021

WordCloud made by the authors : most used word in Darija generated in format of Tunis map

Introduction

In the context of a final Specialty project we were given a month to develop an idea that demonstrates our newly acquired knowledge and techniques.

Choosing Machine Learning among all the available specialties was a risky step that we embarked and enjoyed it, mostly.

Covering its mathematical and statistical concepts, supervised learning, unsupervised learning, reinforcement learning we had a wide range of sub-fields to work on but we considered two factors: What time allowed us to do and what area we felt most able to work on. That is how our choice was directed primarily to Natural Language Processing but what idea exactly?

Last year we worked on a bedtime stories application that collected Tunisian folkloric stories and we were asked by one of the presentation jury if is it possible to implement a text to speech feature?

This feature was available in Modern Standard Arabic to English but not in Tunisian Dialect so we said we’ll look into it after taking our machine learning courses and that is how the idea got complete.

The first idea was to create a transcription and transliteration bot but during the intensive data digging phase we learnt that this required an immense audio data so we moved to a more possible idea which is the machine translation.

What is a Machine Translation (MT)?

Machine Translation (MT) or automated translation is a process when a computer software translates text from one language to another without human involvement.

a representation of the Machine Translation, in an abstract form

The various types of Machine Translation are:

Statistical Machine Translation (SMT)
Rule-based Machine Translation (RBMT)
Hybrid Machine Translation (HMT)
Neural Machine Translation (NMT)

In our project we developed the Neural Machine Translation.

Data Preparation

As previously mentioned the data search and collecting opened our eyes to many constraints and realities. It explains why this fundamental step consumed the most time allowed as it is crucial to be well executed to guarantee the success of the remaining steps of the pipeline.

After reading tons of research papers and countless articles we found a resource to scrape our data from and for that step we use the Parsehub as a tool.

Then, it was time to make our corpus therefore we used the famous Pandas library to transfer the data to csv format, composed of two columns that have the input(Derjja sentences or words) and the targeted(English sentences or words) translation .

Along this step we found out we were working on a fresh idea that has rarely been developed and actually takes a massive data set but there was no stepping back. We were sure we wanted to move forward with our plan and gathered data. We believed that we can enhance it through the process and extract good results.

The journey continued to creating a model that trains its input to predict an accurate translation.

Models Implemented

We followed our program guideline and we trained our data on sequence to sequence models using recurrent neural networks and then on Transformers.

The Seq2Seq model with Attention

Seq2Seq is a many to many recurrent neural network architecture where the input is a sequence and the output is also a sequence (where input and output sequences can be or cannot be of different lengths). This architecture is used in a lot of applications like Machine Translation(like ours), text summarization, question answering etc

We will not write the code implementation for this, but rather we will invite you to take a look at the resource code of the TensorFlow library.

Neural machine translation with attention | Text | TensorFlow

This notebook trains a sequence to sequence (seq2seq) model for Spanish to English translation based on Effective…

www.tensorflow.org

The Transformer model

The Transformer architecture excels at handling text data which is inherently sequential. They take a text sequence as input and produce another text sequence as output in our example to translate an input Tunisian dialect (Derjja) sentence to English.

This model along with a multi-head attention revolutionized the world of NLP there is why it was essential to train our corpus through it with the parameterize we considered suitable.

Transformer model for language understanding | Text | TensorFlow

This tutorial trains a Transformer model to translate a Portuguese to English dataset. This is an advanced example that…

www.tensorflow.org

Approaches

Some early results came out disappointing and there why we had to act, first we fine tuned the models we regularized some parameters, tried building up the data with Standard Arabic data and some close dialects as an Algerian Dialect translation data that we found available and FREE online.

All of these approaches varied between hits and misses that taught us a lot:

We investigated repeatedly the data and wondered what can we change. The models we build was seen layer per layer to precise our intervention for enhancement.

We gain a deeper understanding of used technologies and tools used (Tensorflow- numpy-pandas…)

The funniest thing is that we grasped a new perspective to our dialect and we laughed multiple times at some forgotten rarely used words and expressions we also knew for sure that our dialect is evolving and changing through times and adding to its corpus.

We eventually felt the application of NLP and how it will make the world close.

We think that in the near future we will be able to understand the whole globe dialects and languages.

Ethically we were not able to eliminate or change some hurtful and discriminating expressions needless to say it is part of any dialect.

Summary

This project turned into a journey that gave us the chance to put in practice our knowledge and to witness how technology advancements is incorporated in the smallest details of our lives and can be very helpful to get the world to be closer.

To check the progress of this portfolio project, please refer to the github, the results aren’t definite therefore any contribution or suggestion is welcomed.

Team Members:

Khawla Jlassi

Imen Ayari

Machine Translation: Tunisian Dialect — English is it possible?

Introduction

What is a Machine Translation (MT)?

Data Preparation

Models Implemented

The Seq2Seq model with Attention

Neural machine translation with attention | Text | TensorFlow

This notebook trains a sequence to sequence (seq2seq) model for Spanish to English translation based on Effective…

The Transformer model

Transformer model for language understanding | Text | TensorFlow

This tutorial trains a Transformer model to translate a Portuguese to English dataset. This is an advanced example that…

Approaches

Summary

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Imene Ayari

No responses yet

More from Imene Ayari

Activation functions in ML

In this article we will explain the purpose of activation functions and compares:

Khrafa

Khrafa was meant to be a portfolio project for the end of fundamentals year at Holberton but it turned to a lot more.

Two generation on Machine Learning: Grandpa and me!

Last time I was on a visit to my grandparents house I had a an interesting exchange with grandpa on future , how each one of us sees it…

The secret is in the object!

To start off with this article we need a good and thick introduction to OOP(OBJECT ORIENTED PROGRAMMING) and python.

Recommended from Medium

A Deep Dive into Fine-Tuning

Stepping out of the “comfort zone” — part 3/3 of a deep-dive into domain adaptation approaches for LLMs

Data Science All Algorithm Cheatsheet 2025

Stories, strategies, and secrets to choosing the perfect algorithm.

Lists

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Natural Language Processing

The New Chatbots: ChatGPT, Bard, and Beyond

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Tools I use that are cheaper than Netflix

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

LLM Architectures Explained: NLP Fundamentals (Part 1)

Deep Dive into the architecture & building of real-world applications leveraging NLP Models starting from RNN to the Transformers.

I used OpenAI’s o1 model to develop a trading strategy. It is DESTROYING the market

It literally took one try. I was shocked.