NLPWord RepresentationNeural Language Models

Neural Language Models and Word Representation

Notes on how language models moved from counting words to learning dense vector representations, and why that shift became foundational for modern NLP.

Apr 2026·Study Note·NLP / Representation Learning

One of the most important shifts in NLP was the move from symbolic language processing to learned representation. Earlier systems often treated words as discrete units: a token was simply a token, with no internal notion of similarity, analogy, or neighborhood. In that world, the model could count frequency, memorize co-occurrence, or look up handcrafted rules, but it struggled to express a more continuous view of meaning.

Neural language models changed that. Instead of representing a word as an isolated symbol, they learned to place words inside a shared geometric space. Words that appear in similar contexts begin to occupy nearby regions. This seems simple now, but it was a conceptual breakthrough. A model could finally encode that king and queen are related, that dog and cat are closer than dog and democracy, and that syntax and semantics can emerge from statistical structure rather than being manually written down.

Why this topic matters

I like this topic because it feels like the bridge between classical NLP and modern deep learning. When people first learn transformers, it is easy to jump directly into attention and large models without pausing at the more basic question: how does a machine represent a word in the first place?

Word representation is where that story really begins. Before a model can reason over a sentence, predict the next token, or classify a document, it has to turn language into something computationally meaningful. That is why embeddings matter. They are not just a technical trick. They are a way of converting the discrete surface of language into a continuous structure a model can learn from.

Distributional hypothesis

The intuition behind word representation is often summarized by the distributional hypothesis: words that occur in similar contexts tend to have similar meanings. This idea is older than neural networks, but neural methods made it operational in a much more scalable way.

Suppose a corpus repeatedly shows that doctor, nurse, and hospital appear near related contexts, while river, boat, and shore appear in another cluster. A good representation system should be able to reflect that statistical structure. The model does not need a dictionary definition first. It learns relational meaning from usage.

This is what makes embeddings powerful. Meaning is no longer stored as a symbolic entry in a table. It is distributed across dimensions in a vector space, and similarity can be measured numerically through distance or cosine similarity.

From counts to neural models

Before neural embeddings became standard, NLP often relied on sparse representations such as one-hot vectors or count-based features. In a one-hot representation, each word is a vector whose length equals the vocabulary size, and only one position is active. This is mathematically convenient but semantically empty: every word is equally far from every other word.

Neural language models improved on this by learning dense low-dimensional vectors jointly with a predictive objective. Instead of asking only how often a word appears, the model asks what words tend to appear around it, or what word is likely to come next given a context. Training then adjusts the embedding space so that useful statistical regularities are compressed into the vectors themselves.

Bengio and colleagues helped establish this idea by proposing a neural probabilistic language model in which words are mapped into continuous feature vectors and the model learns to predict the next word from previous context. That move toward continuous representations set the stage for later methods such as word2vec and, eventually, contextual language models.

CBOW and Skip-gram

The word2vec family made neural word representation especially influential because it was simple, efficient, and empirically strong. The two classic variants are CBOW and Skip-gram.

CBOW, or Continuous Bag of Words, predicts the center word from its surrounding context. If the context is something like “A cute ___ is reading,” the model uses neighboring words to infer the missing token. Skip-gram does the reverse: given the center word, it predicts surrounding context words. These objectives are local and lightweight, but across a large corpus they force the model to organize vocabulary into a meaningful geometric structure.

Figure: The proxy training tasks behind word2vec. CBOW predicts a target word from neighboring context, while Skip-gram uses the center word to predict nearby tokens.

What I like about this formulation is how modest it looks compared with modern language models. The task itself is not glamorous. It does not ask the model to summarize a document, answer a question, or generate an essay. It only asks the model to exploit local context. But that simple proxy objective turns out to be enough to learn representations that carry a surprising amount of syntactic and semantic information.

What the model learns

A useful way to think about embeddings is that they are compressed summaries of distributional behavior. During training, the model gradually adjusts vectors so that words which help solve the prediction task become arranged in a structured space. Similar words tend to cluster. Certain relational directions appear. In classic examples, vector arithmetic can reflect analogical relations such as king - man + woman ≈ queen.

Of course, these representations are not perfect meaning itself. They are learned approximations shaped by corpus statistics and training objectives. But they were a major step forward because they replaced rigid discrete symbols with flexible distributed representations. That change made transfer, generalization, and semantic similarity far easier to model.

Why neural language models matter

Neural language models matter not only because they improved benchmark performance, but because they reframed the core NLP problem. Instead of building a pipeline of handcrafted linguistic features, they allowed the system to learn its own internal representation directly from text. That shift is one of the reasons modern NLP became so powerful.

In hindsight, methods like CBOW and Skip-gram may look small beside transformers, but they introduced a logic that remains central today: represent language in dense vector spaces, train those representations with predictive objectives, and let downstream performance emerge from the learned geometry. Even now, when we use contextual embeddings, we are still operating inside that broader representation learning paradigm.

What I find interesting

What I find most beautiful about this topic is that it makes meaning feel partly geometric. Language is messy, contextual, historical, and ambiguous, yet these models still manage to carve out useful structure in a vector space. There is something elegant in the idea that a machine can learn semantic neighborhoods just by trying to predict what appears nearby.

I also think this topic is pedagogically important. If someone understands one-hot vectors, dense embeddings, CBOW, and Skip-gram, they are already much better positioned to understand why transformers need token embeddings, positional information, and large-scale pretraining. In that sense, neural language models are not an old chapter that became irrelevant. They are part of the foundation.

Closing thought

For me, neural language models are exciting because they capture the moment NLP started becoming truly representation-driven. They show that before a model can say anything intelligent about language, it has to learn how words relate to one another in the first place. That insight still feels fundamental. It is one of the reasons I like coming back to this topic whenever I try to understand where modern NLP really began.

References (APA)

Core readings for neural language modeling and word embeddings, formatted in APA style.

·Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155. jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
·Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv. https://arxiv.org/abs/1301.3781 arxiv.org/abs/1301.3781
·Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. arXiv. https://arxiv.org/abs/1310.4546 arxiv.org/abs/1310.4546
·Jurafsky, D., & Martin, J. H. (n.d.). Speech and language processing: Draft chapters on language modeling and word embeddings. web.stanford.edu/~jurafsky/slp3/