Just as Generative AI exploded, I undertook a rigorous research project to benchmark how different deep learning architectures handle disinformation. The goal wasn't just to build a classifier, but to systematically evaluate how specialized architectures (like LSTMs) compared to emerging transformer models in detecting nuanced fake news.
I engineered an end-to-end evaluation pipeline, starting with a custom dataset curated from multiple open sources. I handled the full preprocessing stack, including lemmatization, tokenization, and bias removal, before training custom LSTM networks (with and without GloVe embeddings) and fine-tuning a DistilBERT transformer. To rigorously test generalization, I evaluated these models against a distinct out-of-distribution holdout set and compared them against a human control group.
The results were telling: while my fine-tuned DistilBERT model achieved 99.43% accuracy on familiar data, it struggled with the domain shift in the holdout set, dropping to ~60%. However, I also benchmarked GPT-4 Turbo (a novelty at the time), which achieved 73.24% zero-shot accuracy on the holdout set, significantly outperforming human participants. This project was my first deep dive into the "reality gap" in ML, exposing the difference between high test/train accuracy and actual model robustness in the wild.
To read my original paper with an intended audience of general readers with little CS and ML background, click the link below.