Semantic comparison of the "Word of the year" selected worldwide using word embeddings

Introduction

Every year, countries around the world choose a word that sums up the biggest events or trends of that year. It's like a little linguistic time capsule, capturing the essence of the past 12 months. This tradition reaches far into the 20th century: The first German “Wort des Jahres” is from 1971; the first American “Word of the Year” is from 1990. These words give you a peek into what's been making headlines and sparking conversations in different parts of the globe. Take 1990, for instance. The American Dialect Society went with "bushlips". It's a playful spin on "bulls*it" and was inspired by George H. W. Bush's famous promise, "Read my lips: no new taxes." But, as it turned out, he didn't quite keep his word, and folks weren't too thrilled about it.

When it comes to the chosen words, there's a whole lot of variety. Some words bring a sense of joy or lightness, while others carry a more solemn tone, reflecting the events and challenges of the year. Also, it's not just about local stories – some words speak to global events, showing just how connected our world is. While some of these words may be officially recognized and included in dictionaries, others could be, like the earlier introduced word “bushlips”, contemporary slang or neologisms that have gained popularity within that particular year.

With all that in mind, the words of the year stand out as pretty intriguing linguistic creations. While I was collecting the words, I found myself frequently caught up delving into the events and circumstances that gave meaning to the particular word I was researching. However, what I was interested in, was not the meaning of a single word chosen, but the relation between all the words selected worldwide - I wondered if there was an objective way to uncover global topics and issues by revealing the similarity of the chosen words.

In this blog post, I will try to perform a semantic comparison of these words of the year using two different word embedding approaches.

Word embeddings are mathematical representations of words, where similar words have similar representations, allowing for analysis of words based on their contextual similarity.

Being new to this all, I am using this playful research question as an opportunity to learn, reflect and improve. Wikipedia refers to the “Word Of The Year” as “WOTY”, so for the sake of brevity I will do so too.

Data

Even though common, not every country chooses a WOTY. In fact it has been quite the challenge to identify countries having this tradition and to gather word lists from around the world. Furthermore, working with a diverse set of languages was posing a challenge from the outset. Currently, most WOTYs have local relevance since they address local issues and are not easily translatable for foreigners. For instance, translating "bushlips" is difficult without the natural association to "bulls*it" and without knowing the historical context. This is why WOTYs are typically reported by local news outlets, which can be difficult to comprehend if the language is unfamiliar. Using Google Translate on every randomly selected news page to look up, if there is the information I’m looking for can be laborious.

However, moving from one source to the other I was able to gather a sample of wordlists from 14 countries/regions. Namely:

Denmark
Germany
Netherlands
Ukraine
Spain
Portugal
Russia

Japan
Australia
USA
Great Britain
Latvia
Estonia
Norway

This list is not complete. Other countries like Lithuania, Luxemburg, Liechtenstein, Switzerland and Austria run a list simultaneously. European countries are unintentionally (and undesirably) overrepresented. However this sample still fulfils the needs of my project.

If a country had more than one WOTY I took the one listed first.

Method

There are two different methods worth considering to obtain these vector representations.

Context-independent methods: These methods, such as Word2Vec and GloVe, generate word embeddings by training a model to predict a word given its (narrow) context. The resulting word embeddings are the weights learned by the model. Because it doesn’t consider the overall context of the input, Word2Vec takes a single word as input and outputs a single vector representation of that word.
Context-dependent methods: These methods, such as transformer-based BERT and ELMo, generate word embeddings that vary depending on a broad context in which a word is used. This allows them to capture more nuanced semantic relationships, including polysemy (where a word has multiple meanings).

There is the possibility to extract word embeddings from both models (pre-)trained with context-independent and context-dependent methods. Since I lack the resources to train my own model, I rely on pre-trained models. Again: Currently, this suffices for the project's needs.

Visualization

There are two methods I used to interpret the relation of the word embeddings:

Dimensionality reduction with tSNE:
Maps high-dimensional data into a two- or three-dimensional map, in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points.
Cosine Similarity between vectors:
Measures the cosine of the angle between two vectors in a multidimensional space, indicating their similarity in direction and magnitude regardless of their absolute scale. Cosine similarity can be used for tSNE.

I made plots showing the data reduced by tSNE and heatmaps for the Cosine Similarity.

Initial attempt - GloVe

I began by examining my sample and saw, that for the year 2023, all I had were 14 plain words. Given the limited information available, I gravitated towards context-independent methods, which appeared simpler and more straightforward (this however turns out to be insignificant later).

As noted, there are two prominent model types: Word2Vec and GloVe. I arbitrarily chose to use GloVe. As far as I understood, the decision between these models is insignificant to this project.

There is one feature of context-independent models like Word2Vec and GloVe that comes pretty handy: When working with the word embeddings from these pre-trained models, you can use their vector databases. This is why you do not need the whole model, but just a file with the vectors representing the words they were trained on.

To begin the project, I imported Gensim, a library designed for working with models like GloVe. It provides useful utilities for extracting vector embeddings. The final process to extract the word embeddings from the model is easy: 1. load your file 2. load the vector database 3. look up the word embedding for your word in the vector database. Working on this level of abstraction, there isn’t really much magic to it. But, oops - suddenly this error message appears:

This makes sense. As previously mentioned, I was working with static embeddings extracted from a fixed vector database - the model’s “vocabulary”. A word can only be found in the vector database of the model, if the word has been part of the training data in sufficient amount. Words like ChatGPT are new and therefore underrepresented in training data - therefore the key “ChatGPT” is not present. This was a quick learning right at the beginning, stressing how difficult it could be working with all these distinct neologisms that have only been around for some months (working with older WOTYs delivers similar results).

Obviously one big challenge to this approach is the limitation of its vocabulary. I figured: there is no model like GloVe, that will be flexible enough to process all the neologisms that make up a great part of all WOTYs. Luckily however, there are other approaches.

Breaking things down - Bert

This is how I got lead to BERT (Bidirectional Encoder Representations from Transformers), a transformer based model designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context. This feature is however not the reason why I’m using BERT. The reason why BERT is interesting to me, is as it uses static subword embeddings. As “subword embedding” implies, BERT can describe words as a combination of it’s subwords. For example, this is how BERT describes the word “ChatGPT”:

BERTS vocabulary isn’t infinite, but the amount of words it can describe is very much. If you present a word to BERT it does not recognize at all, BERT will break down the word to it’s character level, if it must. This feature allows us to, to extract word embeddings to all our words, no matter how odd they might are.

Using the transformers library along with the pre-tained model “bert-base-uncased” I fetched the results.

But take a look at the excerpt of the cosine similarities:

{[..] 'professor vs. AI': 0.6547226568084288,
[..] 'искусственный интеллект vs. AI': 0.1590922438489104, [..]}

For our words, the resulting word embeddings are just as odd, as the words are themselves.
The words “Professor” and “AI” end up having one of the highest cosine similarities while “искусственный интеллект” (Russian vor artificial intelligence) and “AI” have one of the lowest cosine similarities. The plot emphasizes these odd relations.

Obviously this is not working the way we want. Not being an expert on this, I pondered the possibility that there might be a challenge associated with the concept of subword embeddings. While combining "#recurring" and "#ly" creates the term 'recurringly', combining "#gas" and "#lighting" does not convey the meaning of 'gaslighting.' But overall, BERT is a powerful model. If given context - it excels with the use of position embeddings and the transformer architecture. However, like any model, it depends on training. It cannot understand the meaning of a word without knowing the context in which it is used.

Another change in approach was necessary.

Generating Sentences

I eventually realised I needed more information about each word. As previously mentioned, obtaining the words was not easy, I guessed that gathering information about them would be even harder.

Asking ChatGPT for the definition of a word also didn’t work out. It could make a guess about a word’s meaning, but struggles to link it to a cultural event.

After extracting articles from foreign online newspapers and letting ChatGPT write a word definition based on the given input, I ended up with a list of word meanings for the year 2023 and 2022.

Finally, this could be it- SentenceTransformers

SentenceTransformers is a Python framework, which can be used to compute sentence embeddings. Sentence Transformer models are based on BERT but use a more suitable network to obtain sentence embeddings. SentenceTransformers is especially useful when trying to compare a large amount of sentences, however it provides practical tools for this project, so I decided to use it. I used the pre-trained model “paraphrase-multilingual-MiniLM-L12-v2”.

Sending the data through their pipeline is super easy and - finally - the results are satisfying.

Here are the results:

1. The plotted data using t-SNE

2. The cosine similarity between each vector displayed as heatmap

Conclusion

While getting to know these word embedding approaches and learning about these words has been so much fun, it's important to beware: none of these methods quite nail down the true essence of the relationships between the different WOTYs. Our visualizations still throw up some funky connections, perhaps due to similar sentence structures or some background noise. Even with meticulously crafted sentences for each word, understanding the significance of these yearly chosen WOTYs, which are deeply rooted in a country's culture, remains a complex task. It seems we'd need a substantial amount of data and a lot of digging to make sense of it all!

Published: 03/19/24