A Primer on Code Mixing & Code Switching!

When and where to use them?


I grew up speaking Telugu and English and have a fair command of Hindi. Even though I usually speak Telugu at home, I always notice other languages tend to find their way into my conversations. I use to wonder why??

Linguists refer to this use of two or more languages concurrently in a conversation as code-switching. Thought to be a natural outgrowth of multilingual usage, code-switching is considered to be distinct from other linguistic practices, including language transfer, and language borrowing. In addition to the use of language, code-switching also involves switching between dialects, styles of speech, gestures, body language, and vocal registers.

Researchers feel either consciously or unconsciously there exist many reasons why people use code-switching. One of the many reasons indicates a change from an informal to a formal situation such as switching from one’s native language to a second language easing interpersonal relationships. Personally, I often employ code-switching when I ‘lose’ a word or phrase in the language I am using and tend to automatically switch languages in order to find an appropriate word. At times, I use phrases in another language because there is no equivalent way to express a word, phrase, or emotion in English. Many multilingual speakers find that there are concepts that are more easily expressed in one language that lose an important part of the meaning when described in a different language.

Let’s see what actually these terms refer to!!

Code-Mixing refers to “the embedding of linguistic units such as phrases, words, and morphemes of one language into an utterance of another language.”

Here’s an example that illustrates the phenomenon of Code-Mixing:

Main kal movie dekhne jaa rahi thi and raaste me I met Sudha.

TranslationI was going for a movie yesterday and on the way, I met Sudha.

Simply, Code mixing is mixing of two or more languages while communicating. Now, it is often common for a speaker who knows two or more languages to take one word or more than one word from one language and introduce it while speaking another language.

If I know French as well as English, for example, there will be times when I will mix some English words in my French sentences. That’s, in fact, very common. Languages have this kind of effect on other languages. It is also very rare for Bilinguals to utter sentences that belong to purely one language.

Now, how is this different from Code Switching?

Code-Switching is simply a “juxtaposition within the same conversation of speech belonging to two different grammatical systems or sub-systems.”

Here’s an example that illustrates the phenomenon of Code-Switching;

I was going to a movie yesterday. raaste men mujhe Sudha mil gayi.

Translation I was going for a movie yesterday; I met Sudha on the way.

Note: We see that Code-switching is being misinterpreted to Code-mixing(in fact, many use the two terms interchangeably), as similar as they appear to be since they refer to a combination of two languages, there is a small difference. In a single conversation, if a language speaker who is speaking, for example, English switches to French (and again to English, maybe), it will be code-switching. Here, the speaker is not mixing just a few words of one language in between the other language. He is speaking one language and then switching to another language. One sentence is spoken in one language and the second in another and so on.

Also, please note that I have taken the example of English and French, but code-mixing and code-switching are possible between all languages.

“Code-Switching is usually inter-sentences while Code-Mixing (CM) is an intra-sentential phenomenon.”

Below are a few datasets popularly used by various authors to work on code-mixing analysis on Spanglish:

  • Yo-Yo Boing! (YYB) is a 58,494-word novel by Puerto-Rican author Giannina Braschi consisting of alternating chapters of poetry in English, Spanish and mixed Spanish-English.
  • Killer Cronicas: Bilingual Memories (KC) is a 40,469-word work written by Jewish Chicana author Susana Chavez-Silverman that is comprised of email messages entirely in ‘Spanglish’.
  • Spanish in Texas dataset (SpinTX) compiled by Bullock & Toribio consisting of over 500,000 words of transcriptions from interviews with 97 heritage Spanish speakers across Texas.
  • DSTC2 restaurant reservation dataset (Henderson et al., 2014a)

Below datasets are specifically used for Code Mixed Indian Languages mainly Hinglish (Hindi-English) analysis:

Apart from India, such code-mixing is also prevalent in other multilingual regions of the world, for example, Spanglish (Spanish-English), Frenglish (French-English), Porglish (Portuguese-English), and so on. To cater to such users, it is essential to create/annotate datasets containing code-mixed conversations and thus facilitate the development of code-mixed conversation systems.

I. The Multilingual Index (M-index)

It is developed from the Gini coefficient, is a word-count-based measure that quantifies the inequality of the distribution of language tags in a corpus of at least two languages. The M-index is calculated as follows-


where k (> 1) is the total number of languages represented in the corpus, Pj is the total number of words in the language j over the total number of words in the corpus, and j ranges over the languages present in the corpus.

II. CMI Index (Code-Mixing Index)


At the utterance level, this amounts to finding the most frequent language in the utterance and then counting the frequency of the words belonging to all other languages present. If an utterance x only contains language-independent tokens, its code-mixing is zero; for other utterances, the level of mixing depends on the fraction of language-dependent tokens that belong to the matrix language (the most frequent language in the utterance) and on N, the number of tokens in x except the language-independent ones (i.e., all tokens that belong to any language Li)

III. Language Entropy

The language entropy returns how many bits of information are needed to describe the distribution of language tags. Using the same conventions of notation as previously defined in M-Index, language entropy is calculated as


IV. I-Index (Integration-Index)

This metric describes the probability of switching within a text, it is simply a proportion of how many switch points exist relative to the number of language-dependent tokens in the corpus. In other words, it is the approximate probability that any given token in the corpus is a switch point. Given a corpus composed of tokens tagged by language {li} where j ranges from 1 to n, the size of the corpus, and i = j − 1, the I-index is calculated by the expression,


For now, I won’t dive deep into each of the above metrics, will be writing another primer especially for these metrics and methods to implement code-switching soon :)

🌍 Resources

Disclaimer: This list is not intended to be exhaustive, nor to cover every single topic in Code-Mixing and Code-Switching. There are plenty of amazing resources available and this is rather a pick of the most recent impactful works in the past few years/months mostly influenced by what I read. Here are a few picks for you:








In this article, we discussed a few important concepts in the trending research area of Code-Switching. Some common, Some lesser-known but all of them could be a great addition to your linguistic data exploration toolkit. Hopefully, you will find some of them useful in your current and future projects. Nevertheless, If you have reached till this section of the article, Congratulations🎉, you are now having a decent knowledge of how Code-Mixing and Code-Switching works!!

This is just a tiny taste of what you can do with code-mixed languages. In future posts, we’ll talk about other applications of Code Mixing and it’s computational modeling. But until then, please feel free to share your views!

Happy Exploring and Stay Safe!

NLP Enthusiast | Researcher | Final-year CS Student