NLPExplainable AIFake News Detection

coherenteyes

Freedom from unacceptable risk. A SHAP-based interpretable fake news detection system, framed through AI Safety.

A misinformation detection project evaluating classical ML, ensemble, and deep learning models on the WELFake dataset, with SHAP-based explainability analysis and an AI Safety lens throughout.

Dec 2025·Project·AI Safety / NLP / Explainable AI·visit site ↗

Where this started

TLDR

After Cooperative AI, I wanted to go deeper on the technical side of AI Safety. I wrote another independent study proposal, convinced a professor to supervise me, and two peers joined along the way. We chose fake news detection. What the project actually taught me: class imbalance is harder than it looks, data quality matters more than model choice, and logistic regression performing competitively against deep learning architectures was genuinely surprising every single time.

After finishing Cooperative AI in May 2025, I felt something shift. I had spent a semester thinking through multi-agent systems, incentive design, and the conditions under which AI systems could coordinate toward better outcomes. It was exactly the kind of work I loved. But finishing it made me realize there was a gap I wanted to close: I wanted to be stronger on the technical side of AI Safety, not just the conceptual side.

So I did what had worked before. I wrote a full proposal, built out a syllabus, and went back to persuade a professor to supervise me on another independent study. The curriculum was adapted from Boaz Barak's machine learning theory seminar, which gave the study a more rigorous mathematical spine: generalization bounds, statistical learning theory, and the kinds of foundations that make it possible to say something honest about why a model works rather than only that it does.

The independent study

The syllabus drew heavily from Boaz Barak's ML theory seminar, which anchors machine learning in the mathematics of generalization and learnability. I wanted to understand not just how models perform, but why they are allowed to perform the way they do: what guarantees exist, where those guarantees break down, and what that means for safety.

That theoretical grounding changed how I think about evaluation. When you understand generalization bounds, benchmarks stop feeling like endpoints and start feeling like instruments with known limitations. That reframing mattered a lot for how CoherentEyes eventually took shape.

Curriculum reference

Adapted from Boaz Barak's ML Theory Seminar ↗. The seminar covers statistical learning theory, generalization, and the mathematical foundations that underpin modern machine learning.

Two people, one project

Somewhere along the way, two peers heard about what I was working on and wanted in. We spent a while thinking through whether to join forces, how to divide the work, and whether a group project could preserve the intellectual seriousness I cared about in independent study. Eventually we decided it could, that if anything, having to defend your thinking to other people who understood the material made the reasoning sharper.

We landed on fake news detection as our shared problem. It sat at exactly the intersection we all cared about: NLP, explainability, and AI Safety as a real-world concern rather than an abstract one. Misinformation is not a hypothetical harm. It has already reshaped elections, accelerated radicalization, and made it harder for people to coordinate around shared facts. That gave the project stakes.

The name CoherentEyes came from wanting something that felt both technical and intentional — the idea of a system that sees clearly and can explain what it sees. The website aesthetic, as it happens, was shaped by something much less deliberate: I was watching The Red Turtle while building it, and the warmth and stillness of that film seeped into every color choice I made. I am not sure that is a coincidence. There is something about a film with no dialogue that makes you think carefully about what communication actually requires.

CoherentEyes

coherenteyes is a fake news detection system built around a core argument: accuracy is not enough. A model that achieves 98% on a benchmark but cannot explain its reasoning, cannot handle distribution shift, and cannot be audited for bias is not a safe model. It is a confident one, which is a different thing entirely.

The project evaluates a broad range of models, classical, ensemble, and deep learning, on the WELFake dataset. WELFake is a large, balanced corpus that was designed specifically to address the limitations of earlier fake news benchmarks: imbalanced classes, narrow sourcing, and shallow feature representations. For each category of model, we also applied SHAP to make the decision-making visible and auditable, rather than treating performance as the final word.

The website built to make AI Safety and fake news detection accessible to a broader audience.

Methodology

We evaluated three families of models. Classical models — Logistic Regression, SVM, Decision Tree, KNN, Gaussian Naïve Bayes, Random Forest, XGBoost, and AdaBoost — were trained on TF-IDF representations to establish strong, interpretable baselines. Ensemble methods combined multiple classifiers through majority voting and stacked generalization. Deep learning architectures ranged from CNN and CNN-LSTM hybrids using static GloVe embeddings to BERT-based models that bring contextualized representations into play.

The architecture comparison was never meant to produce a single winner. It was designed to show where each model family's reasoning lives, what lexical or contextual signals it relies on, and where it becomes fragile. That framing shaped the whole project — we were not benchmarking, we were auditing.

Dataset

WELFake: a large, balanced corpus combining word embeddings with rich linguistic features across diverse topics, domains, and writing styles. Selected specifically because it was built to overcome the limitations of earlier fake news benchmarks.

Results

The headline result is that hybrid deep learning architectures outperformed classical models, but not by as much as you might expect. The CNN + LSTM with GloVe embeddings achieved the highest validation accuracy at 98.21%, followed closely by BERT-based models. But the strongest classical models — Random Forest and Logistic Regression — both reached 96.5%, which is a much smaller gap than the complexity difference between those approaches would suggest.

Model	Accuracy	Note
CNN + LSTM (GloVe)	98.21%	Best overall
BERT + CNN	98.15%	Strong semantic baseline
BERT + CNN + BiLSTM	98.13%
CNN + PCA (GloVe)	98.05%	Efficient tradeoff
CNN (GloVe)	97.14%
Random Forest	96.5%	Best classical model
Logistic Regression	96.5%
AdaBoost	96.2%

What I found more interesting than the numbers was the stability pattern. Most deep learning architectures converged around 98% accuracy regardless of architectural choices. That consistency is either reassuring or suspicious depending on how you read it. It tells you something about the dataset's learnability, but it also raises questions about what these models would do under distribution shift — which is exactly the condition that matters most in real deployment.

SHAP and interpretability

When I started self-learning Explainable AI the summer before, I did not expect it to stick the way it did. What pulled me in was exactly what I described in my Vietnamese legal NLP project: the moment I realized I did not only want models that performed well, but models I could interrogate. SHAP gave me a concrete language for that. It made the question of why a model makes a decision feel as important as whether it gets the answer right.

Applying SHAP to coherenteyes felt like a natural extension of that curiosity. SHapley Additive Explanations decomposes each prediction into feature-level contributions grounded in cooperative game theory, where each feature's value reflects its average marginal contribution across all possible subsets of features. It is model-agnostic, which meant we could apply it consistently across every architecture we tested and compare not just performance numbers but the reasoning underneath them. That comparability was the whole point. I wanted to see whether different model families were picking up on the same signals or completely different ones, and what that implied for how much we should trust any of them.

For Logistic Regression, SHAP values mapped directly to the model's global coefficients. Tokens like "taxes," "capitol," and "terrorist" pushed predictions toward the real class, while terms like "terror" and "suspect" contributed toward fake. The pattern made intuitive sense: conventional political reporting uses specific terminology that hyper-partisan content tends to distort or sensationalize. But seeing it in SHAP values made the reasoning legible in a way that accuracy scores never could.

XGBoost told a more nuanced story. Because it is a tree ensemble, SHAP values are local and instance-specific rather than global. A token like "war" could push toward fake in one context and toward real in another, depending on what else surrounded it. That context-sensitivity is what makes XGBoost more expressive, but it also makes the model harder to audit systematically. The word "surveillance" dominated XGBoost's global importance ranking by a wide margin, which is itself a signal worth investigating.

For the deep learning models, SHAP revealed something structurally different: the CNN model placed enormous weight on source-indicating tokens like "reuters ) -" as a signal for real news and URL fragments like "com/" as markers of fake. That is an accuracy shortcut, not a generalizable pattern. A model that classifies news as real because it detects a Reuters byline is not detecting misinformation. It is detecting source identity, which is a brittle heuristic that fails the moment someone spoofs a credible source or strips the byline entirely.

SHAP is not a perfect tool. For BERT-based models, the computational cost of computing exact SHAP values is prohibitive, and the additive independence assumptions underlying SHAP are poorly suited to contextualized embeddings where every token's contribution depends on every other token. We excluded BERT from the SHAP analysis for exactly those reasons, which is itself an important limitation to name clearly.

A note on data quality

One thing I kept coming back to while working on this project is that the dataset is doing a lot of quiet work that the accuracy numbers do not show. WELFake is better than earlier benchmarks, but fake news detection datasets are fundamentally hard to build well. The labeling process is either expensive and slow when done by experts, or noisy and inconsistent when crowdsourced. Binary labels flatten the actual spectrum of misinformation. And the biggest problem right now is that most existing datasets were built before LLMs could generate convincing fake content at scale, which means models trained on them are already becoming less relevant to the threat they were designed to address. High accuracy on WELFake is a meaningful result, but it is also a result about a particular distribution of content that is shifting under our feet.

AI Safety framing

The AI Safety dimension of this project is not decorative. It is the reason we structured the evaluation the way we did. Real-world fake news detection is a safety-critical problem. A system that confidently suppresses accurate information or allows misinformation to pass through because it learned the wrong heuristics is not just inaccurate — it is actively harmful.

Several safety concerns run through the project directly. Distribution shift is the most pressing: models trained on legacy datasets are already struggling against LLM-generated misinformation that mimics human writing far more convincingly than earlier fake content. Adversarial manipulation is a real threat — if a model's decisions are shaped by a handful of high-weight tokens, sophisticated actors can learn to avoid those tokens. Fairness and bias are harder to audit, but important: misinformation disproportionately targets marginalized communities, and a model that performs well on aggregate may still fail systematically on the cases that matter most.

What the project argues, in the end, is that interpretability is not a nice-to-have in this domain. It is a prerequisite for responsible deployment. A high-accuracy black box is nothing just a confidently opaque.

AI Safety dimensions addressed

Robustness under distribution shift

Interpretability and transparency

Fairness and bias mitigation

Adversarial and manipulation risks

Ethical governance and accountability

Balancing accuracy and safety

Closing thought

Coherenteyes started as an independent study project and became something I genuinely care about. The fake news detection problem is not solved by better models — it is shaped by the conditions under which information spreads, who controls the platforms it moves through, and whether the people affected by those systems have any meaningful way to understand or contest them.

That is why interpretability feels so important to me here. Not because SHAP plots are elegant, but because the alternative is a world where consequential decisions about what counts as true get made by systems no one can inspect. That connects back to the broader questions I think about in AI Safety and Governance: who gets to see inside the system, and who is subject to it without recourse.

I am still thinking about where this goes next. The dataset limitations are real, LLM-generated misinformation is already breaking older detection approaches, and the governance questions are nowhere near settled. But I think the project built something honest: a framework that treats trustworthiness as a design goal rather than an afterthought.

References

Focused references for works cited directly in this project and core methods used.

·Alshuwaier, F. A., & Alsulaiman, F. A. (2025). Fake news detection using machine learning and deep learning algorithms: A comprehensive review and future perspectives. Computers, 14(9), 394. doi.org/10.3390/computers14090394
·Bereska, L., & Gavves, E. (2024). Mechanistic interpretability for AI safety: A review. TMLR. arxiv.org/abs/2404.14082
·Bubeck, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4. ArXiv. doi.org/10.48550/arXiv.2303.12712
·Carlsmith, J. (2021). Is power-seeking AI an existential risk? arxiv.org/pdf/2206.13353
·Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. doi.org/10.1145/2939672.2939785
·Hendrycks, D., et al. (2021). Aligning AI with shared human values. ICLR 2021. arxiv.org/pdf/2008.02275
·Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. ArXiv. arxiv.org/abs/1705.07874
·Lundberg, S. M., et al. (2019). Explainable AI for trees: From local explanations to global understanding. ArXiv. arxiv.org/abs/1905.04610
·Machová, K., et al. (2025). Analysis of the effect of attention mechanism on the accuracy of deep learning models for fake news detection. Big Data and Cognitive Computing, 9(9), 230. doi.org/10.3390/bdcc9090230
·Molnar, C. (2019). Interpretable machine learning: A guide for making black box models explainable.
·Morris, J. X., et al. (2020). Reevaluating adversarial examples in natural language. Findings of the Association for Computational Linguistics: EMNLP 2020, 3829–3839. doi.org/10.18653/v1/2020.findings-emnlp.341
·Mouratidis, D., et al. (2025). From misinformation to insight: Machine learning strategies for fake news detection. Information, 16(3), 189. doi.org/10.3390/info16030189
·Perez, E., et al. (2022). Red teaming language models with language models. ArXiv. doi.org/10.48550/arXiv.2202.03286
·Põldvere, N., et al. (2023). The PolitiFact-Oslo Corpus: A new dataset for fake news analysis and detection. Information, 14(12), 627. doi.org/10.3390/info14120627
·Qazi, I. A., et al. (2025). Scaling truth: The confidence paradox in AI fact-checking. ArXiv. arxiv.org/abs/2509.08803
·Ribeiro, M. T., et al. (2016). "Why should I trust you?": Explaining the predictions of any classifier. ArXiv. arxiv.org/abs/1602.04938
·Rogers, A., et al. (2020). A primer in BERTology: What we know about how BERT works. ArXiv. doi.org/10.48550/arXiv.2002.12327
·Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1, 206–215. arxiv.org/abs/1811.10154
·Sallami, D., et al. (2024). From deception to detection: The dual roles of large language models in fake news. ArXiv. arxiv.org/abs/2409.17416
·Shu, K., et al. (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22–36. doi.org/10.1145/3137597.3137600
·Shu, K., et al. (2020). Mining disinformation and fake news: Concepts, methods, and recent advancements. ArXiv. arxiv.org/abs/2001.00623
·Tian, Y., et al. (2025). An empirical comparison of machine learning and deep learning models for automated fake news detection. Mathematics, 13(13), 2086. doi.org/10.3390/math13132086
·Tong, Z., et al. (2025). Generate first, then sample: Enhancing fake news detection with LLM-augmented reinforced sampling. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 24276–24290. doi.org/10.18653/v1/2025.acl-long.1182
·Verma, P. K., et al. (2021). WELFake: Word embedding over linguistic features for fake news detection. IEEE Transactions on Computational Social Systems, 8(4), 881–893. doi.org/10.1109/tcss.2021.3068519
·Xu, X., et al. (2025). A hybrid attention framework for fake news detection with large language models. ArXiv. arxiv.org/abs/2501.11967
·Zhou, X., & Zafarani, R. (2020). A survey of fake news: Fundamental theories, detection methods, and opportunities. ACM Computing Surveys, 53(5). doi.org/10.1145/3395046