Diving Deep Into the World of Embeddings: Unveiling Lesser-Known Statistical Properties

Abhijit Gupta
3 min readJun 7


Opening the door to the world of Natural Language Processing (NLP), you’re greeted by a fascinating character: word embeddings. These mathematical maestros transform arbitrary words into rich, numerical vectors. Like expert tailors, they craft a custom ‘suit’ for each word, capturing its meaning within the cloth of a high-dimensional vector space.

Now, if you’re in the realm of machine learning or NLP, you’ve probably heard about these cool kids on the block, Word2Vec, GloVe, or FastText. You know they’re great at capturing semantic and syntactic relationships between words. However, have you ever pondered what lies beneath these vector-fitted words, or the hidden statistical properties they hold? Let’s dive deeper.

The first stop on our deep-dive journey is a lesser-known property: the distributional hypothesis. This gem posits that words appearing in similar contexts share semantic meaning. Word embeddings, particularly those trained with methods like Word2Vec, embody this principle. However, what’s less known is that the high-dimensional vector spaces created by these embeddings are not uniformly distributed.

Instead, they form a beautiful structure often approximating a multivariate Gaussian distribution. This structure offers intriguing insights. It hints at the natural clusters of words in language and provides a statistical foundation for word similarity. In simpler terms, if two words often party together in similar sentences, their vector representations hang out together in the vector space.

The takeaway? This statistical property allows us to use standard mathematical techniques on these embeddings, like calculating the cosine similarity for word similarity or even performing vector arithmetic for analogies. Think “King – Man + Woman = Queen”. Amazing, isn’t it?

Our next stop in this hidden world of embeddings brings us face-to-face with the intriguing concept of ‘Vector Offsetting.’ If you’ve dabbled with embeddings, you’re likely familiar with the idea of vector arithmetic. However, the underlying phenomenon here is vector offsetting, a property that is often overshadowed by its glitzy cousin, vector arithmetic.

In layman’s terms, vector offsetting is the consistent difference or ‘offset’ observed between analogous pairs of word vectors. For example, in a trained model, the vector difference between ‘King’ and ‘Queen’ is often quite similar to the difference between ‘Man’ and ‘Woman.’ This uncovers a profound idea: the ‘direction’ in the vector space holds semantic information, a property that is beautifully exploited to solve word analogies.

But why is this important, you ask? Understanding vector offsetting helps us comprehend how relationships between words are captured within the vector space, enabling us to fine-tune our models for better performance.

Let’s move on to another less-explored aspect: dimensionality. Now, we know that word embeddings are high-dimensional, often ranging from 50 to 300 dimensions or more. But the question that rarely gets the limelight is – why? What’s so special about these dimensions that we need so many?

Each dimension in the word embedding space can be thought of as capturing some aspect of the word’s meaning. These dimensions allow embeddings to encapsulate a plethora of information, such as the word’s sentiment, its grammatical role, or its level of formality.

However, the really cool, often overlooked aspect is this: not all dimensions are created equal. Some dimensions end up carrying more meaningful information than others. In fact, there’s a statistical property that the ‘importance’ or ‘semantic richness’ of dimensions often follows a power-law distribution. This understanding can lead to more efficient embeddings by reducing dimensionality while preserving key semantic information.

Finally, we touch on a crucial but often overlooked property of word embeddings: their dependence on the training corpus. It’s easy to think of these embeddings as static entities, fixed in their representation of words. However, the truth is that they’re highly dynamic and influenced by the data they’re trained on.

Consider different corpora – news articles, scientific journals, social media posts, or classical literature. Each has a unique linguistic style, vocabulary, and thematic focus. Consequently, the resulting embeddings will capture these nuances, leading to different representations even for the same word. This sensitivity to the training corpus underscores the importance of choosing a corpus aligned with your task. Embeddings trained on a mismatched corpus might not perform optimally, like using a map of Paris to navigate London!

And there we have it! Our journey through the lesser-known statistical properties of embeddings. Understanding these intricacies allows us to appreciate the complexities of these linguistic maestros and how to conduct them more effectively in the grand orchestra of NLP tasks. So the next time you work with embeddings, remember, there’s more than meets the eye in these mathematical marvels.



Abhijit Gupta

PhD, Machine Learning; Lead Data Scientist. I work on AI & Algorithms research and development.Link: https://www.linkedin.com/in/abhijit-gupta-phd-639568166