Home

Xavier Thomas (Rohan)

avatar

👋 Hi! I’m currently a Grad Student at Boston University. I completed my undergrad at Manipal Institute of Technology, India, and had developed a keen interest in all things ML during my first year and was fortunate to gain research experience along the way. Prior to joining BU, I worked with the Content and User Understanding team at ShareChat, and was fortunate to work on projects with the Serre Lab (Brown University), Human Dynamics Group (MIT Media Lab, Massachusetts Institute of Technology), ETS, Montreal and FOR.ai (now Cohere for AI).

At BU, I am fortunate to be advised by Prof. Deepti Ghadiyaram, and I’m currently exploring topics in computer vision, with a broad interest in representation learning and generative models.

CV / Email Me

Education

Ph.D. in Computer Science 2025 – Present
M.S. in Artificial Intelligence 2023 – 2025
Boston University
B.Tech. in Electronics and Instrumentation 2018 – 2022
Manipal Institute of Technology · Minor in Computational Intelligence

News

Research

Semantic Richness or Geometric Reasoning? The Fragility of VLM’s Visual Invariance

TLDR Humans can tell whether two shapes match under rotation or scaling, even without naming what they are. Leading MLLMs cannot. Accuracy holds on photos but collapses on sketches and rare scripts where semantic cues becomes sparser.

Jason Qiu*, Zachary Meurer*, Xavier Thomas*, Deepti Ghadiyaram
Preprint 2026
Generative Action Tell-Tales: Assessing human motion in synthesized videos

TLDR Humans can spot the wrong action or implausible motion in a generated video. MLLMs and existing metrics cannot. We learn a human centric representation of real movement and score how far a generated clip deviates from realistic action.

Xavier Thomas, Youngsun Lim, Ananya Srinivasan, Audrey Zheng, Deepti Ghadiyaram
CVPR VGBE Workshop 2026 Oral
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

TLDR If you ask what you hear in a video, people listen. Ask what you see, and they look. MLLMs ignore that distinction when audio, video, and captions disagree. We build benchmarks with deliberate conflicts and fine tune models to answer from the modality being asked about.

Tianle Chen*, Chaitanya Chakka*, Arjun Reddy Akula, Xavier Thomas, Deepti Ghadiyaram
CVPR Findings 2026 CVPR Sight and Sound Workshop 2026 Oral
What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization

TLDR We recognize a dog across photos, cartoons, sketches, and paintings, but classifiers often fail when visual style shifts. Diffusion latents already separate these style domains without labels, so we use them as pseudo domain features to help classifiers generalize to unseen domains, even beating methods trained with ground truth domain tags.

Xavier Thomas, Deepti Ghadiyaram
Revelio: Interpreting and leveraging semantic information in diffusion models

TLDR Generative models produce a rich world of images, but what do they encode internally, and how is that world represented? We uncover interpretable semantic features at specific layers and timesteps with sparse autoencoders, and show they transfer to classification and other vision tasks through lightweight probes.

Dahye Kim*, Xavier Thomas*, Deepti Ghadiyaram
ICCV 2025 CVPR MIV Workshop 2025 Oral
Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models

TLDR A painter sketches the scene first, then adds objects, then texture. Text to image models try to do all of that from one prompt. We break the prompt into the same coarse to fine steps and schedule them across denoising so the final image actually reflects what was asked for.

Ketan Suhaas Saichandran*, Xavier Thomas*, Prakhar Kaushik, Deepti Ghadiyaram
CVPR AI4CC Workshop 2025 Oral
Diversity vs. Recognizability: Human-like generalization in one-shot generative models

TLDR Given one handwritten character, people draw new examples that look like the same letter but vary in stroke. FID and likelihood miss this tradeoff, so we measure recognizability and diversity separately and compare one shot models to human samples on Omniglot.

Victor Boutin, Lakshya Singhal, Xavier Thomas, Thomas Serre
NeurIPS 2022
MAViC: Multimodal Active Learning for Video Captioning

TLDR Video captioning requires a human written caption for every clip. We rank unlabeled videos by multimodal uncertainty and caption semantics so annotators label the clips that most improve the model.

Gyanendra Das, Xavier Thomas, Anant Raj, Vikram Gupta
Preprint 2022
Adaptive Methods for Aggregated Domain Generalization

TLDR Training data often arrives mixed across domains with no domain tags. We cluster samples into pseudo domains and train a classifier on both the image and its cluster, reaching performance competitive with domain supervised methods.

Xavier Thomas, Dhruv Mahajan, Alex Pentland, Abhimanyu Dubey
Preprint 2021

For more see Google Scholar

Experience

Boston University
Graduate Researcher
Boston University
Jun 2024 – Present
  • Vision in Multimodal Large Language Models (MLLMs): Investigating limitations of visual understanding in MLLMs and developing methods to improve cross-modal alignment for robust multimodal reasoning.
  • Evaluation of Video Generation Models: Designing and implementing novel evaluation metrics to assess human action fidelity, temporal consistency, and motion coherence in generative video models.
  • Internal Representations of Diffusion Models: Analyzing diffusion models as representation learners by probing their intermediate states; demonstrating their effectiveness for downstream tasks such as classification, multi-modal reasoning, and domain generalization.
ShareChat
Machine Learning Engineer
ShareChat | Content and User Understanding Team
Jul 2022 – Jun 2023
Integrated advanced computer vision pipelines into production, improving content classification and moderation capabilities on ShareChat (180M+ MAUs) and Moj (160M+ MAUs).
Brown University
Research Intern
Serre Lab, Brown University
Sep 2021 – May 2022
Developed a novel evaluation framework for one-shot generative models, introducing new metrics for recognizability (human interpretability) and diversity (concept coverage) to enable systematic comparisons. Benchmarked 4 representative generative architectures against human performance on the Omniglot dataset.
MIT Media Lab
Research Assistant
MIT Media Lab
Jan 2021 – Nov 2021
Created a novel algorithm for privacy-preserving domain generalization that recovers domain information by removing class-specific noise from latent features, enabling the training of robust, domain-adaptive classifiers. Outperformed state-of-the-art methods that require domain supervision on multiple benchmarks.
ÉTS Montréal
Mitacs Globalink Research Intern
École de technologie supérieure (ÉTS), Montréal
Jul 2021 – Sep 2021
Extended sub-category exploration methods for Weakly Supervised Semantic Segmentation by clustering image features to generate more accurate pseudo-labels. Designed novel constraint-based refinements to enhance object localization in Class Activation Maps (CAMs), improving mIoU scores on PASCAL VOC 2012.
Advisor: Dr. Jose Dolz
For.ai
Researcher
FOR.ai (now Cohere For AI)
Oct 2020 – Aug 2021
Contributed to a large-scale benchmarking study of Out-of-Distribution (OOD) detection in computer vision models, establishing baselines for evaluating robustness under distribution shifts. Collaborated with researchers from Google Brain, University of Oxford, and Vector Institute.
Advisor: Sheldon Huang