Xavier Thomas (Rohan)

👋 Hi! I’m currently a Grad Student at Boston University. I completed my undergrad at Manipal Institute of Technology, India, and had developed a keen interest in all things ML during my first year and was fortunate to gain research experience along the way. Prior to joining BU, I worked with the Content and User Understanding team at ShareChat, and was fortunate to work on projects with the Serre Lab (Brown University), Human Dynamics Group (MIT Media Lab, Massachusetts Institute of Technology), ETS, Montreal and FOR.ai (now Cohere for AI).

At BU, I am fortunate to be advised by Prof. Deepti Ghadiyaram, and I’m currently exploring topics in computer vision, with a broad interest in representation learning and generative models.

CV Email

Education

Ph.D. in Computer Science 2025 – Present

M.S. in Artificial Intelligence 2023 – 2025

Boston University

B.Tech. in Electronics and Instrumentation 2018 – 2022

Manipal Institute of Technology · Minor in Computational Intelligence

News

Jul 2026 Paper

Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance accepted to COLM 2026.

Jun 2026 Paper

Generative Action Tell-Tales: Assessing human motion in synthesized videos accepted to ECCV 2026.

Apr 2026 Talk

Presented Generative Action Tell-Tales: Assessing human motion in synthesized videos at the ML Collective DLCT reading group.

Mar 2026 Oral

Generative Action Tell-Tales: Assessing human motion in synthesized videos accepted as an oral at VGBE and PhysHuman workshops, CVPR 2026.

Feb 2026 Paper

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs accepted to CVPR 2026 Findings.

Nov 2025 Talk

Presented Generative Action Tell-Tales: Assessing human motion in synthesized videos at NECV 2025 (oral).

Sep 2025 PhD

Started my PhD at Boston University, advised by Prof. Deepti Ghadiyaram.

Jun 2025 Mentor

Mentoring a high school student in the BU RISE Research Internship Program.

Jun 2025 Paper

What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization and Revelio: Interpreting and leveraging semantic information in diffusion models accepted at ICCV 2025.

May 2025 Paper

What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization accepted at the VisCon Workshop, CVPR 2025.

Apr 2025 Oral

Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models accepted as an oral at the AI4CC Workshop, CVPR 2025.

Mar 2025 Oral

Revelio: Interpreting and leveraging semantic information in diffusion models accepted as an oral at the MIV Workshop, CVPR 2025.

Research

2026

Semantic Richness or Geometric Reasoning? The Fragility of VLM’s Visual Invariance

TLDR Humans can tell whether two shapes are the same under rotation or scale, even without recognizing what the object is. Leading vision language models cannot. They need a familiar visual context to reason spatially.

Jason Qiu*, Zachary Meurer*, Xavier Thomas*, Deepti Ghadiyaram

COLM 2026

Webpage Paper

Generative Action Tell-Tales: Assessing human motion in synthesized videos

TLDR Humans can immediately spot a generated video where the action is wrong or the movement looks off. AI metrics give it a perfect score anyway. We build a metric grounded in how real people move so that bad motion actually gets flagged.

Xavier Thomas, Youngsun Lim, Ananya Srinivasan, Audrey Zheng, Deepti Ghadiyaram

ECCV 2026 CVPR VGBE Workshop 2026 Oral

Webpage Paper

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

TLDR If you ask what you hear in a video, people listen. Ask what you see, and they look. Multimodal models often don't make that distinction. We put audio, video, and text in direct conflict and train models to answer from the sense the question actually asks about.

Tianle Chen*, Chaitanya Chakka*, Arjun Reddy Akula, Xavier Thomas, Deepti Ghadiyaram

CVPR Findings 2026 CVPR Sight and Sound Workshop 2026 Oral

Webpage Paper

2025

What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization

TLDR We recognize a dog whether it is a photo, cartoon, sketch, or painting. Classifiers often fail when the visual style shifts. Diffusion model features naturally group images by style without any labels, and plugging them into a classifier improves generalization to new visual styles.

Xavier Thomas, Deepti Ghadiyaram

ICCV 2025

Webpage Paper Code

Revelio: Interpreting and leveraging semantic information in diffusion models

TLDR Generative models produce rich, detailed images, but what do they actually understand internally? We open them up, find that they encode meaningful visual concepts in specific layers, and show those internal representations are useful for other vision tasks.

Dahye Kim*, Xavier Thomas*, Deepti Ghadiyaram

ICCV 2025 CVPR MIV Workshop 2025 Oral

Webpage Paper Code

Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models

TLDR A painter sketches the scene first, then fills in objects, then texture. Text to image models try to do all of that from a single prompt at once. We split the prompt into those same steps and feed them in progressively during generation, so the output actually matches the description.

Ketan Suhaas Saichandran*, Xavier Thomas*, Prakhar Kaushik, Deepti Ghadiyaram

CVPR AI4CC Workshop 2025 Oral

Webpage Paper Code

2022

Diversity vs. Recognizability: Human-like generalization in one-shot generative models

TLDR Given one handwritten character, people draw new versions that are still recognizable but not identical. Existing metrics for generative models miss this balance entirely. We measure recognizability and diversity as separate axes and test how closely models match what humans actually produce.

Victor Boutin, Lakshya Singhal, Xavier Thomas, Thomas Serre

NeurIPS 2022

Paper Code

MAViC: Multimodal Active Learning for Video Captioning

TLDR Training a video captioning model requires a human written caption for every clip, which is expensive. We rank which unlabeled videos are most worth annotating, so the model learns faster with fewer labels.

Gyanendra Das, Xavier Thomas, Anant Raj, Vikram Gupta

Preprint 2022

Paper

2021

Adaptive Methods for Aggregated Domain Generalization

TLDR Training images often come from many different sources with no labels telling the model which is which. We automatically group them by visual style and train classifiers that adapt to each group, matching methods that had those source labels to begin with.

Xavier Thomas, Dhruv Mahajan, Alex Pentland, Abhimanyu Dubey

Preprint 2021

Paper Code

Full list on Google Scholar

Experience

Graduate Researcher

Boston University

Jun 2024 – Present

Vision in Multimodal Large Language Models (MLLMs): Investigating limitations of visual understanding in MLLMs and developing methods to improve cross-modal alignment for robust multimodal reasoning.
Evaluation of Video Generation Models: Designing and implementing novel evaluation metrics to assess human action fidelity, temporal consistency, and motion coherence in generative video models.
Internal Representations of Diffusion Models: Analyzing diffusion models as representation learners by probing their intermediate states; demonstrating their effectiveness for downstream tasks such as classification, multi-modal reasoning, and domain generalization.

Advisor: Prof. Deepti Ghadiyaram

Machine Learning Engineer

ShareChat | Content and User Understanding Team

Jul 2022 – Jun 2023

Integrated advanced computer vision pipelines into production, improving content classification and moderation capabilities on ShareChat (180M+ MAUs) and Moj (160M+ MAUs).

Research Intern

Serre Lab, Brown University

Sep 2021 – May 2022

Developed a novel evaluation framework for one-shot generative models, introducing new metrics for recognizability (human interpretability) and diversity (concept coverage) to enable systematic comparisons. Benchmarked 4 representative generative architectures against human performance on the Omniglot dataset.

Advisors: Dr. Victor Boutin, Prof. Thomas Serre

Research Assistant

MIT Media Lab

Jan 2021 – Nov 2021

Created a novel algorithm for privacy-preserving domain generalization that recovers domain information by removing class-specific noise from latent features, enabling the training of robust, domain-adaptive classifiers. Outperformed state-of-the-art methods that require domain supervision on multiple benchmarks.

Advisor: Dr. Abhimanyu Dubey

Mitacs Globalink Research Intern

École de technologie supérieure (ÉTS), Montréal

Jul 2021 – Sep 2021

Extended sub-category exploration methods for Weakly Supervised Semantic Segmentation by clustering image features to generate more accurate pseudo-labels. Designed novel constraint-based refinements to enhance object localization in Class Activation Maps (CAMs), improving mIoU scores on PASCAL VOC 2012.

Advisor: Dr. Jose Dolz

Researcher

FOR.ai (now Cohere For AI)

Oct 2020 – Aug 2021

Contributed to a large-scale benchmarking study of Out-of-Distribution (OOD) detection in computer vision models, establishing baselines for evaluating robustness under distribution shifts. Collaborated with researchers from Google Brain, University of Oxford, and Vector Institute.

Advisor: Sheldon Huang

Xavier Thomas (Rohan)

#Education

#News

#Research

#Experience

Education

News

Research

Experience