// Staff Data Scientist · Trust & Safety · Applied GenAI

Hi, I'm Varun
Medappa

I build multimodal ML systems that run at internet scale.

Founding member of IAS's $150M trust & safety classification suite. Detecting low-prevalence unsafe content across 3T impressions/day. Patented.

Varun Medappa
Impact

What I've Driven

👥 Managed 8 FTEIAS Social Classification team
🌱 Founding member, CV ResearchComputer Vision team at IAS
🚀 Drove 0 → $150M product lineBrand Safety video classification
$150M
Annual Revenue Powered
Brand Safety suite across TikTok, YouTube, Meta, Twitter
3T+
Daily Impressions Processed
Video classification engine, 23 countries
34%
YoY Revenue Growth
Social classification product line
99.5%
Labeling Cost Reduction
Multimodal retrieval + AI labeling pipeline
📜
U.S. Patent Holder
Multimodal Classification
🏆
Best AI Project
Vista Hackathon, 63 companies
Education

Where I Studied

🏫
M.S. Operations Research
Columbia University
New York, NY
🎓
B.E. Industrial Engineering
R.V. College of Engineering
Bangalore, India
What I've Built

Things I Shipped

Real systems running in production at scale.

All Projects
Staff DS IAS 2025 – Now
Senior DS IAS 2023 – 2025
Data Scientist IAS 2021 – 2023
Associate DS IAS 2019 – 2021
Analyst Capillary 2016 – 2017
🎥
3T imps/day
Integral Ad Science · Trust & Safety · Founding project

Multimodal Video Classification Engine

Built the core Trust & Safety classification product from zero. Detects hard-to-catch, low-prevalence unsafe content across TikTok, YouTube, Meta, and Twitter. 3 trillion impressions/day at 34% YoY growth. Deployed across 23 countries. Patented.

TensorFlow ECS Fargate Multimodal ML 23 Countries
🔍
99.5% cost cut
Integral Ad Science · Trust & Safety

Multimodal Video Retrieval & Labeling Pipeline

Built an internal Python package for end-to-end multimodal labeling. Vector search finds relevant content, LLM+VLM classifies it, human A/B testing validates, and DSPy optimizes prompts in a continuous RLAIF loop. Cut labeling cost by 99.5%.

DSPy RLAIF VLM / LLM FAISS
🤖
New product, 80% less compute
Integral Ad Science · Applied GenAI

Audio & Video Deepfake Detection

Led PEFT/LoRA adoption for deepfake detection models, slashing compute by ~80% and boosting experiment throughput 3x. No performance degradation.

LoRA PEFT HuggingFace A/V Models
🛡
22M videos/day
Integral Ad Science · Trust & Safety

Misinformation Detection System

Directed end-to-end development using multimodal and DistilBERT models extended for longer token classification. Catches low-prevalence misinformation, content that evades simple filters, across 22M+ videos per day in production.

DistilBERT Multimodal NLP Classification
90%+ cost reduction
Integral Ad Science · Media Team

Pseudo-Labeling & Synthetic Data Generation

Built a pseudo-labeling pipeline using SigLIP and Vicuna that cut ML development costs by 90%+ across 49 video categories. Separately led synthetic data generation with GPT-Neo that boosted Brand Safety precision by 49%.

SigLIP Vicuna GPT-Neo Data Pipelines
🌐
67% savings, 75T chars
Integral Ad Science · Media Team

Machine Translation at Scale

Built custom translation models with OpenNMT for 42 language pairs. Translates 75 trillion characters annually while reducing costs by 67% for the core contextual classification pipeline.

OpenNMT 75T chars/year 42 Languages 67% Savings
📈
8B impressions/day
Integral Ad Science · Incrementality Team

Online Conversion Lift Pipeline

Built a causal inference pipeline ingesting 8 billion ad impressions per day to measure real cause-and-effect on conversion rates for ad campaigns using Bayesian methods.

PyStan Bayesian Causal Inference ETL
💰
18% conversion
Capillary Technologies · Bangalore

Customer Churn Prediction

Built predictive models and churn strategies that identified customer behavior patterns and converted 18% of one-time buyers into repeat customers.

Python Predictive Modeling Churn
Featured

Project I Am Proud Of

★ Featured Project

Multimodal AI Labeling & Prompt Optimization Pipeline

Integral Ad Science · Staff DS · Trust & Safety

The Problem

We classify extremely low-prevalence, nuanced categories of unsafe content in social media video. Previously, building a new classifier meant weeks of work: sampling the right balance of rare data from a large amount of production data which is actually just safe content, designing experiments to extract signal, then paying for expensive, time-consuming human annotations on subjective user-generated video. Each iteration was slow, costly, and hard to scale.

What I Built

An internal Python package that closes the entire loop. It finds relevant content in production via vector search, accepts custom prompts (versioned in GitHub), and processes any modality. For video, it extracts keyframes, deduplicates frames, then sends everything to a cost-optimized LLM + VLM system for multimodal fusion. The output includes multilabel classifications, topic/subtopic stratification for comprehensive dataset sampling, and artifacts ready for model training or human review.

Vector Search Keyframe Extraction LLM + VLM Fusion Multilabel Output Human A/B Testing DSPy Prompt Optimization

The Feedback Loop

Human reviewers validate AI labels through a custom A/B testing UI. Their responses feed back into the package, where DSPy and GEPA optimize the original prompts automatically. This is essentially an RLAIF pipeline. The LLM-as-judge acts as a reward model, with engineered reward functions calibrated against human baselines to prevent reward hacking. The loop runs continuously: label, validate, optimize, retrain.

DSPy LLM-as-Judge RLAIF VLM / LLM FAISS Prompt Optimization Multimodal Fusion
~ ✧ ~
Skills

Stack Map

Not a laundry list. Every skill tied to something I shipped.

PEFT / LoRA Deepfake Detection
DSPy / RLAIF Prompt Optimization Loop
FAISS & Vector Search Video Retrieval Pipeline
VLM / LLM Multimodal Labeling & AI Judge
TensorFlow + ECS Fargate Video Classification (3T/day)
SigLIP / CLIP Pseudo-Labeling Pipeline
OpenNMT 42-Language Translation
DistilBERT / NLP Misinformation Detection
PyStan / Bayesian Ad Attribution (8B imps/day)
Also: Python · PyTorch · AWS · Databricks · Docker · PySpark · SQL
~ ✧ ~
Get in Touch

Let's Connect

Open to conversations about ML, data science leadership, and impactful opportunities.