Manan Dey

At Salesforce, I lead DX Workspaces — a platform that provides isolated, sandboxed cloud environments for hosting untrusted workloads. My team builds the infrastructure powering Agentforce Vibes IDE, Codey Studio, and Vibes as a Service. Previously at Salesforce, I led the AI Data Seeding and Simple Deploy in DX Inspector initiatives.

I love contributing to open-source AI research. I have contributed to the BigScience workshop, where I worked on BLOOM, and initiatives like BigCode, MTEB, and Google BIG-bench. I was a primary contributor to the Data Provenance Initiative, focusing on consent and provenance tracking in AI training data.

My research interests include large language models, multilingual NLP, bias evaluation, and data provenance.

Email / Google Scholar / Twitter / Github / LinkedIn

Research

I'm interested in machine learning, deep learning, and natural language processing. My work focuses on large language models, multilingual NLP, bias evaluation and mitigation, and prompt engineering.

MVEB: Massive Video Embedding Benchmark
AE Assadi, R Solomatin, I Chung, and many others including Manan Dey
arXiv, 2026
arXiv

Large-scale benchmark for evaluating video embedding models across diverse tasks and datasets.

MMTEB: Massive Multilingual Text Embedding Benchmark
Kenneth Enevoldsen, Isaac Chung, and 60+ authors including Manan Dey
ICLR, 2025
arXiv

Comprehensive multilingual text embedding benchmark evaluating models across 168 datasets spanning 113 languages.

Bridging the Data Provenance Gap Across Text, Speech, and Video
Shayne Longpre, Nikhil Singh, and many others including Manan Dey
ICLR, 2025
arXiv

Investigates critical gaps in data provenance tracking across modalities and proposes frameworks for responsible AI data curation.

SHADES: Towards a Multilingual Assessment of Stereotypes in Large Language Models
Margaret Mitchell, Giuseppe Attanasio, and 30+ authors including Manan Dey
NAACL, 2025
paper

Multilingual framework for systematically assessing social stereotypes in LLMs across diverse languages and cultural contexts.

Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre, Robert Mahari, and many others including Manan Dey
NeurIPS, 2024
arXiv

Documents the rapid decline of the AI data commons, with over 25% of high-quality sources restricting AI access since 2023.

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs
Hailin Chen, Fangkai Jiao, and 10+ authors including Manan Dey
arXiv, 2024
arXiv

Benchmark for testing LLMs' ability to generate compositional structured outputs requiring multi-step reasoning.

StarCoder 2 and The Stack v2: The Next Generation
Anton Lozhkov, Raymond Li, and 50+ authors including Manan Dey
arXiv, 2024
arXiv

Second-generation code LLM trained on The Stack v2 with advanced filtering and deduplication techniques.

StarCoder: may the source be with you!
Raymond Li, Loubna Ben Allal, and 60+ authors including Manan Dey
Transactions on Machine Learning Research (TMLR), 2024
project page / arXiv / code / model / demo

Open-source 15.5B code generation model trained on 1 trillion tokens from 80+ programming languages with fill-in-the-middle capabilities.

SantaCoder: don't reach for the stars!
Loubna Ben Allal, Raymond Li, and 30+ authors including Manan Dey
ICLR Workshop on Deep Learning for Code, 2023
arXiv / models

Efficient 1.1B parameter code model demonstrating that aggressive data filtering enables smaller models to outperform larger ones.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Teven Le Scao, Angela Fan, and 380+ authors including Manan Dey
Journal of Machine Learning Research (JMLR), 2024
project page / arXiv / model

Open-access 176B multilingual LLM trained on 1.6TB ROOTS corpus covering 46 natural and 13 programming languages.

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Hugo Laurençon, Lucile Saulnier, and 50+ authors including Manan Dey
NeurIPS Datasets and Benchmarks Track, 2022
paper

1.6TB multilingual dataset spanning 59 languages with comprehensive documentation, opt-out processes, and PII redaction for BLOOM training.

How sensitive are translation systems to extra contexts? Mitigating gender bias in Neural Machine Translation models through relevant contexts
Shanya Sharma, Manan Dey, Koustuv Sinha
Findings of EMNLP, 2022
arXiv

Zero-shot debiasing method for neural machine translation using contextual sentences during inference without fine-tuning.

You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings
Zeerak Talat, and 15+ authors including Manan Dey
ACL Workshop on Challenges & Perspectives in Creating Large Language Models, 2022
paper

Examines bias evaluation methodologies in multilingual contexts and how frameworks fail to capture cultural nuances.

Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh, Albert Webson, and 30+ authors including Manan Dey
ICLR, 2022 (Spotlight)
paper / arXiv / model

Demonstrates that multitask prompted training enables zero-shot generalization; T0 model shows task diversity can compensate for model size.

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts
Stephen H. Bach, Victor Sanh, and 20+ authors including Manan Dey
ACL Demo Track, 2022
arXiv / code / demo

Platform for collaborative creation and sharing of natural language prompts, hosting 2,000+ prompts across 170+ datasets.

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
Sabrina J. Mielke, and 10+ authors including Manan Dey
arXiv, 2021
arXiv

Historical survey tracing tokenization evolution from pre-neural to modern deep learning approaches including BPE, WordPiece, and SentencePiece.

Evaluating Gender Bias in Natural Language Inference
Shanya Sharma, Manan Dey, Koustuv Sinha
NeurIPS Workshop on Dataset Curation and Security, 2020
arXiv

Evaluation methodology for measuring gender bias in NLI by pairing gender-neutral premises with gender-specific hypotheses.

Assessing Viewer's Mental Health by Detecting Depression in YouTube Videos
Shanya Sharma, Manan Dey
NeurIPS Workshop on AI for Social Good, 2019
arXiv

System for detecting depressive content in YouTube videos using transcript analysis and CES-D scores from viewer comments.

Open Source Projects

Data Provenance Initiative
A community-driven project tracking data provenance and consent in AI training datasets
project page / code / paper

Tracks consent, licensing, and attribution for AI training datasets across text, speech, and video modalities.

BigScience Metadata Project
Research on incorporating metadata during language model pretraining
code

Investigates how metadata incorporation during pretraining enhances LLM capabilities for zero-shot performance and structured generation.

Patents

Machine-learned architecture for structured synthetic data generation
Aditi Shreya, Manan Dey, Dharani Gopal Akkiraju, Joao Tiago Azevedo Belo, Hariharan Mani
US Patent App. 18/876,234, Filed 2026

ML-based system for generating structured synthetic data for testing and development.

Generation of diverse simulated data
Prashant Telkar, Dharithri Rai B, Meldon Malcolm Dcunha, Manan Dey
US Patent 12,153,742, Granted November 2025

Method for generating diverse simulated datasets for comprehensive software testing.

Framework for template-based test data
Prashant Telkar, Shwetha Kamath B M, Vishwas Agrawal, Manan Dey, Meldon Malcolm Dcunha
US Patent App. 18/748,923, Filed 2025

Template-driven framework for efficient test data generation and management.

System and method for test data healing
Prashant Telkar, Meldon Malcolm Dcunha, Manan Dey, Shwetha Kamath B M, Vishwas Agrawal, Anjali Mishra
US Patent App. 18/606,106, Filed 2025

Automated system for detecting and repairing corrupted or invalid test data.

Automatic update of user interface element identifiers for software artifact tests
Rohan B Sahu, Manan Dey, Manu Jose Philip, Archana Taneja, Naveen V
US Patent App. 18/404,753, Filed 2025

Self-healing test automation system that automatically updates UI element identifiers.

Template from Jon Barron's website