Hypothesis Testing for Explainable Vietnamese Legal Relation Classification
A Vietnamese legal NLP project on PhoBERT, robustness under wording perturbations, and whether an AI model can learn meaningful legal patterns rather than memorizing surface phrasing.

This project sits at the intersection of two things I genuinely enjoy: NLP and Explainable AI. At the technical level, it fine-tunes PhoBERT to classify legal reference relations in Vietnamese documents. But for me, the more interesting question was never just whether the model scored well. It was whether the model was learning something meaningful, or whether it was only getting good at pattern matching on the surface.
Vietnamese legal documents are highly structured, densely intertextual, and difficult to process manually at scale. Legal texts routinely reference other laws, decrees, circulars, and amendments, creating a complex network of relations that is tedious for human readers to trace. A computational approach offers a more efficient and systematic way to organize and analyze these relationships, and this project is one attempt at building that.
Why Explainable AI stayed with me
I first started self-learning Explainable AI in Summer 2025, and I ended up liking it much more than I expected. What pulled me in was that it made machine learning feel less like a black box and more like something I could interrogate. It pushed me to ask harder questions: why did the model make this decision, what signal is it relying on, and should I trust that reasoning in the first place?
That was the moment I realized I do not only enjoy models that perform well. I also care a lot about models that can be inspected, challenged, and understood. That shift has shaped how I think about AI ever since, and this project is a direct product of it.
Research question
The main research question is: can an AI model accurately classify legal reference relations in Vietnamese legal documents, and do its predictions remain stable under legal wording perturbations?
The deeper motivation is a robustness hypothesis. The question is not simply whether the model predicts the right label, but whether that prediction holds when the wording of a legal sentence is modified without substantially changing its meaning. Was the model learning legal relation patterns, or was it only memorizing superficial phrasing? That distinction is what makes the project feel worth doing to me.
Type of research and approach
The project uses a quantitative and computational research design, with hypothesis testing as a central methodological component. The hypotheses are stated formally as follows:
H₀:The model's prediction stability under legal wording changes is not better than chance.
H₁:The model's prediction stability under legal wording changes is better than chance.
These are tested through a one-sided binomial test applied after the model is evaluated on perturbed and counterfactual examples. Hypothesis testing is not an optional add-on here. It is integral to the study's goal of determining whether the model behaves consistently when surface form changes but meaning does not.
The design is quantitative because it relies on labeled data, numerical evaluation metrics, and statistical reasoning. It is computational because it uses natural language processing and deep learning to automatically classify Vietnamese legal text. More specifically, the project adopts a supervised classification approach using PhoBERT, a pretrained Vietnamese language model, fine-tuned on a legal dataset.
Data and method
The main data source is the VNLegalText dataset, a publicly available collection of Vietnamese legal documents annotated for legal references and relation labels. The dataset comprises 5,031 preprocessed Vietnamese legal documents with annotated reference entities and relations. Documents are processed through a script that identifies tagged legal references and reconstructs their surrounding sentence context, converting each example into a model-ready input consisting of a legal sentence paired with a target relation label.
After extraction, sparse labels are grouped into reduced categories to make the classification task more statistically feasible. The dataset is then split into training, validation, and test sets using stratified sampling to ensure balanced label distribution across all subsets.
The technical stack includes the Hugging Face Transformers library for loading and fine-tuning PhoBERT as a sequence classifier, PyTorch for model training, and scikit-learn for computing evaluation metrics. SciPy is used to conduct the binomial hypothesis test, while pandas and NumPy handle data processing and numerical computation.
Beyond the original dataset, the project generates additional evaluation inputs through perturbation and counterfactual rewriting. Selected legal expressions are replaced or reformatted while preserving the original meaning as closely as possible. This step transforms model evaluation from a purely performance-based assessment into a hypothesis-driven inquiry.
Results
The results provide sufficient evidence to reject the null hypothesis. The fine-tuned PhoBERT model achieved an overall accuracy of 97.29% and a weighted F1 score of 0.9729 across 5,875 test samples, with consistent performance maintained across all six relation classes. The macro F1 score of 0.9497 further confirms that the model's stability is not an artifact of class imbalance, as strong performance was observed even on less frequent labels such as HHL (F1 = 0.893) and BTT (F1 = 0.930).

Figure 1 & 2: Model performance summary, confusion matrix across legal relation labels, and per-class precision, recall, and F1.
This outcome supports the alternative hypothesis: the model's predictions remain significantly more stable than chance when the surface wording of Vietnamese legal text is modified. The rejection of the null hypothesis suggests that PhoBERT, fine-tuned on the VNLegalText dataset, has learned meaningful representations of legal relation patterns rather than superficial textual features.
What I found interesting
What I liked most about this project is that it moved evaluation a little closer to hypothesis testing. Instead of stopping at accuracy, I could ask whether the model behaved more robustly than chance under wording changes. That felt intellectually much richer. It made the project less about scoreboard numbers and more about whether the model had actually learned something real.
The perturbation and counterfactual rewriting step was also more interesting to build than I expected. Having to think carefully about which parts of a legal sentence could be changed without altering its meaning forced me to read the legal texts much more closely than I would have otherwise.
Why this matters to me now
Looking back, this project reflects a pattern in what I naturally gravitate toward. I like technical work, but I am usually most interested in the layer underneath it: what a model is really doing, how we know, and what counts as trustworthy evidence. Explainable AI gave me an entry point into that kind of thinking, and this project became one of the places where that curiosity felt concrete.
It also situates within a growing body of work on Vietnamese legal AI. Previous work on reference extraction from Vietnamese legal documents provided the framing for this project, and more recent work like the LegalSLM Shared Task places it within the broader development of legal reasoning and domain-specific evaluation in Vietnamese NLP. I want to keep working in this space, because I think there is a lot of room for technically grounded, politically aware research on legal systems that are still underrepresented in the AI literature.
Closing thought
I still think Explainable AI is one of the most intellectually attractive areas in machine learning. It asks us not to be satisfied too quickly. It asks us to look again, to probe the model more carefully, and to think about whether performance is actually telling the whole story. That is probably why I liked it so much when I first learned it, and why I keep coming back to it.
Notebook
The full Python notebook with data processing, model training, perturbation experiments, and evaluation is available on Google Colab.
View notebook →References
Focused references for works cited directly in this project and core methods used.
- ·MLA Lab. (n.d.). VNLegalText. GitHub. github.com/mlalab/VNLegalText
- ·Nguyen, D. Q., & Nguyen, A. T. (2020). PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1037–1042). Association for Computational Linguistics. aclanthology.org/2020.findings-emnlp.92
- ·VinAI. (n.d.). vinai/phobert-base. Hugging Face. huggingface.co/vinai/phobert-base
- ·Bach, N. X., Thuy, N. T. T., Chien, D. B., Duy, T. K., Hien, T. M., & Phuong, T. M. (2020). Reference extraction from Vietnamese legal documents. In Proceedings of the 10th International Symposium on Information and Communication Technology. ACM. dl.acm.org/doi/pdf/10.1145/3368926.3369731
- ·Le, A.-C., Duong, T.-C., Nguyen, V.-H., & Le, T. V. Q. (2025). Overview of the LegalSLM Shared Task: Evaluating legal reasoning of Vietnamese small language models. In Proceedings of the 11th International Workshop on Vietnamese Language and Speech Processing (pp. 147–152). Association for Computational Linguistics. aclanthology.org/2025.vlsp-1.21