Information RetrievalSearch EngineArtificial Intelligence

Kieu Tale

A Truyện Kiều search project about memory, retrieval, and how a machine learns to locate meaning inside text.

This project began with a simple question: if someone remembers only a fragment of a famous poem, how can a system help them find the right verse? That curiosity led me into information retrieval, vector space models, ranking systems, and the deeper science of searching.

Overview

TLDR

Kieu Tale treats each verse of Truyện Kiều as a searchable document. By converting verses into TF-IDF vectors, the system can retrieve lines through exact overlap, cosine similarity, and indexed lookup. The project is ultimately about one thing: how search systems represent text well enough to find it.

I enjoyed this project because it made search visible. Normally, we type words into a box and results appear instantly. But once you build retrieval yourself, you realize how many decisions sit underneath that apparent simplicity.

What counts as similarity? What deserves to rank first? How does a system help someone who remembers only half a line, or the mood of a phrase, rather than the exact wording? Those questions made this project far richer than I first expected.

The science of searching

Search is fundamentally a problem of representation. A machine cannot directly understand poetry as humans do. It cannot feel rhythm, memory, or atmosphere. Instead, the text must first be transformed into a structured form the system can compare.

In Kieu Tale, each verse becomes a document. That document is cleaned, tokenized, and encoded into a vector space. Once every line has coordinates, retrieval becomes possible. Search is then no longer magic. It becomes geometry, weighting, and ranking.

Core insight

Search quality depends less on the search bar itself and more on how documents are represented before the user ever types a query.

Preparing the text

Before retrieval begins, the corpus has to be cleaned. I normalized Unicode text, lowercased the verses, removed extra punctuation, and standardized spacing.

This stage sounds minor, but it matters enormously. If the corpus is inconsistent, the system compares noise rather than meaning. Search quality often depends on these quiet details.

TF-IDF and vector space

Each verse is represented through TF-IDF features. This means words gain importance not only because they appear, but because they are distinctive across the entire corpus.

Very common words become less useful, while informative words carry more weight. This is what allows the system to retrieve meaningful lines rather than simply rewarding repetition.

Once represented numerically, every verse occupies a position in space. Similar verses sit closer together. Search becomes a matter of distance.

Ranking what matters

Retrieval is not only about finding matches. It is about ranking them. Which verse should appear first, second, or tenth? That ranking reflects assumptions about what users are trying to recover.

Overlap Search

Ranks verses by shared tokens with the query. Best for users who remember exact wording.

Cosine Similarity

Measures directional similarity between vectors. Better when the memory is partial or approximate.

Inverted Index

Maps tokens to documents for instant lookup. Useful when speed and token presence matter most.

I liked comparing these approaches because they reveal that there is no single perfect search method. Different forms of remembering require different retrieval logic.

Search as memory assistance

Literary search is especially interesting because users rarely remember text perfectly. They may recall two words, a season, an image, or the emotional tone of a line.

That means retrieval is partly a cognitive problem. The system must bridge the gap between imperfect human memory and exact machine representation.

In that sense, a good search engine is not just indexing text. It is helping memory recover itself.

Why Truyện Kiều

Truyện Kiều felt like the right corpus because it is culturally significant, linguistically rich, and often remembered through fragments. Many people know pieces of it without holding the entire text in mind.

That makes it ideal for retrieval. The user often arrives not with a perfect query, but with memory traces. Search systems are most interesting when the query is incomplete.

Reflection

This project changed how I think about search engines. I used to see them as interfaces. Now I see them as layered systems of preprocessing, representation, ranking, and human intention.

If I continue this project, I would explore semantic embeddings, approximate nearest neighbor retrieval, and richer Vietnamese tokenization methods. But I still appreciate beginning with classical TF-IDF because it makes the logic of search visible.

Kieu Tale may look like a poetry tool, but for me it was really an education in information retrieval.