Projects - Kimi Chen | Data Science & Applied Mathematics @ UCSD

Disinformation Detection via Various DL Models

This was my high school senior project where I produced one of my first "real" ML-related project. I benchmarked different ML architectures and models to test their capabilities to combat fake news and disinformation. My aim was to demonstrate how specialized AI, particularly deep learning models, could not only identify fake news but also potentially outperform human judgement in this critical task.

To achieve this, I developed and utilized several deep learning models. The first step was to prepare a large fake news dataset by combining several smaller dataset found on the internet. Then, I merged news titles and texts, removed bias words like 'Reuters,' and applied lemmatization to standardize words. This preprocessed text was then tokenized and padded to a uniform length, making sure it was in a numerical format suitable for training. I then built custom deep learning architectures, specifically Long Short-Term Memory (LSTM) networks, both with and without GloVe embeddings, which are designed to process sequential data like text while retaining contextual memory. These models were trained with a binary cross-entropy loss function, with regularization techniques such as early stopping and dropout to prevent overfitting and enhance generalization. Additionally, I fine-tuned a DistilBERT model, a smaller yet powerful transformer-based language model, and utilized pre-trained large language models (LLMs) such as Llama-2 and GPT-4 Turbo to assess their out-of-the-box disinformation detection capabilities on a smaller but diverse sample.

To establish a human benchmark for comparison, I conducted a series of surveys and quizzes among high school students, evaluating their accuracy in distinguishing real from fake news passages compiled from the same sources that I used to train and test my models. After evaluating all AI models on out of distribution holdout sets, the results indicated that AI generally achieved higher accuracy than human participants. Notably, GPT-4 Turbo performed the best with 73.24% accuracy, demonstrating its advanced reasoning and extensive training data. While my custom-built models (LSTM, LSTM + GloVe, and DistilBERT) showed impressive accuracies on their respective test sets (e.g., DistilBERT at 99.43%), their performance significantly dropped on the distinct holdout set (ranging from 58% to 62% accuracy). This discrepancy, specifically the high number of false negatives where real news was incorrectly labeled as fake, suggests that these custom models may have overfitted to the specific characteristics of their training data and struggled to generalize effectively to new, more colloquial, and varied media found in the OOD holdout set.

This research, while promising for AI's role in combating disinformation, also highlighted key challenges and limitations. The scarcity and varied formats of high-quality, comprehensive disinformation datasets is still a significant issue for robust model training and generalization across different information styles (e.g., news articles, tweets, emails). Future work should focus on curating more diverse, multimodal datasets (including images, videos, and audio, where applicable) and continually integrating the latest advancements in generative AI models to build even more adaptable and accurate disinformation detection systems. A lot has changed since my original attempt at this project, and I'm sure that the newer LLMs would blast GPT-4 Turbo out of the water.

To read my original paper with an intended audience of high school teachers with little CS and ML background, click the link below.

Read the full paper Code

Saliency from the Sky

Annually, Triton UAS at UC San Diego participates in the Student UAS Competition, where we compete against over 70 teams globally at accomplishing various autonomous missions in the sky, including detecting objects, mapping the area, and flying waypoints, all without the guidance of a living being (unless things goes south).

Saliency detection on a generated synthetic image.

This year, I led the development and refactoring of our computer vision system, designed for efficient aerial image processing and target identification. A core component was the implementation and fine-tuning of YOLO v11, which was optimized for salient object detection in diverse top-down aerial datasets and ran via ONNX Runtime for efficient inference on our Nvidia Jetson Orin Nano. Our custom preprocessing pipeline involved efficient letterboxing and normalization, while post-processing handled un-letterboxing and confidence-based filtering to derive accurate bounding boxes.

This detection system was integrated into a comprehensive C++ pipeline. Incoming image data first passed through a custom preprocessor, which handled specific tasks like cropping sensor artifacts. After detection, the pipeline utilized a Ground Sample Distance (GSD) based localizer to convert 2D pixel coordinates of detected objects into precise real-world GPS coordinates, using available flight telemetry. The pipeline then generated annotated images with bounding boxes and metadata for each detected target, including class ID, confidence, and its geo-referenced position.

To ensure real-time performance and robustness on our resource-constrained flight computer, an asynchronous aggregation layer was built. This layer managed multiple concurrent CV pipeline runs using a worker thread pool and an overflow queue, which prevents bottlenecks by processing images in batches while new ones arrived. Mutex-protected shared data structures also allows for thread-safe access to aggregated results, including annotated images and collections of detected target bounding boxes and GPS coordinates.

Beyond individual object detection, I also implemented a robust image mapping module. This module allows for large-scale panoramic stitching of collected images. It supported both a direct, single-pass stitching approach for smaller datasets and a more memory-efficient, multi-pass chunking strategy with overlap, specifically designed for a bigger area that the competition requires. The mapping process utilized OpenCV's Stitcher API, with custom preprocessing applied prior to stitching for optimal output quality. Together, these components formed a powerful computer vision system capable of real-time object identification, precise geo-localization, and dynamic environmental mapping in the competition.

Code

File Order Randomizer for Premiere Pro

While creating a video for a school assignment, I encountered an annoying bottleneck in Premiere Pro: the inability to randomize file order for quick cuts or just creative inspirations. This forced me to rely on cumbersome external methods, such as manually renaming files via the command line or using third-party scripts to pre-process some metadata before importing them into Premiere. These workarounds were inefficient, prone to errors, and severely disrupted the creative flow within Premiere Pro itself.

After some digging, I realized many others also face the same problem. Thus, to address this inefficiency and annoyance of mine and many others, I developed a native Premiere Pro extension, directly accessible within the application's interface. My solution provides an intuitive and seamless way to randomly sort video and image assets, which removes the need to rely on external preprocessing and allows users to instantly generate randomized sequences from selected project items or an entire project.

The core of the extension is built using ExtendScript, Adobe's JavaScript dialect for scripting their applications, and it is integrated via the Adobe Common Extensibility Platform (CEP), which allows for UI development with HTML/CSS/JavaScript and communication with the ExtendScript engine. I extensively utilized their APIs to access and manipulate ProjectItem objects, representing individual media files, sequences, or project bins. A specific technical detail involved managing the ticks parameter for inserting clips by calculating precise end points and incorporating user-defined gaps to ensure proper sequential (though randomized) placement without overlaps. A significant portion of the logic also focused on handling of project bins (folders) by recursively navigating nested structures to include all contained items, effectively flattening the project hierarchy for comprehensive randomization while preserving original media references. Polyfills were also implemented to ensure broader compatibility across various ExtendScript environments.

The extension has been downloaded over 7,000 times and has a 4.9-star rating on the Adobe Exchange marketplace. Feedback has been positive, with several users, including an Adobe employee, noting how it simplifies what was previously a tedious process. It's helped make randomized editing more accessible without needing workarounds, even 3 years after its initial release.

Link