jaydeep raijada
Jaydeep Raijada

Hi, I'm Jaydeep

I train small language models and study how far they can be pushed with reinforcement learning and post-training.

About

I design reward functions, build training environments, and check whether techniques that work on large models still work when you shrink them ~70×. By day I'm an Analyst at Lowe's. On the side I'm running post-training experiments and exploring continual learning and self-distillation for sub-1B models. Recently placed in the top 100 at the HuggingFace × Meta OpenEnv Hackathon in Bangalore.

Currently
Analyst at Lowe's · Bangalore, India
December 2024Present

Designed an LLM-based sentiment analysis and topic modelling pipeline on Voice of Customer (VOC) data, enabling scalable monitoring of customer feedback in production. Built a clustering-based experimentation framework to find high-performing survey introduction scripts, which lifted take rate from 9% to 14% (a 55% relative improvement). Built a multi-stage prediction system using HistGradientBoosting across hierarchical classification buckets, plus Power BI dashboards to make the insights usable day-to-day.

Research

Self-Distillation at Sub-1B Scale: Does SDFT Break When ICL Is Weak?

May 2026 – Present
Phase 1 (ongoing)Model: SmolLM2-360M·Target: ACL 2027

The two leading self-distillation methods were tested on models 7B parameters and larger. Neither paper looks at what happens below 1B, where in-context learning is much weaker. DynSDPB was built for small models, but no one has compared it head-to-head with the ICL-based approaches.

Hypothesis. Self-distillation gains depend on how good a model is at in-context learning. Below a certain size, ICL signals get too noisy to use as a teacher, and DynSDPB should outperform there.

Selected work

SHADE-GYM

April 2026

Top 100 at the HuggingFace × Meta OpenEnv Hackathon, Bangalore. Built an RL environment where a small monitor model learns to catch a frontier attacker (DeepSeek-R1) attempting hidden harmful behaviors across 9 enterprise scenarios. Every reward is a plain Python check, with no LLM-as-judge anywhere in the loop. After 10 reward design iterations, a simple linear reward beat the more elaborate composed one. Trained a Qwen-2.5-1.5B LoRA monitor from random performance (AUROC 0.500) to strong detection (0.893, Recall 0.88, FPR 0.12), closing about 40% of the gap to Gemini-2.5-Pro at under 0.1% of its per-call cost.

GRPO·RLVR·OpenEnv·TRL·Qwen-2.5-1.5B·Reward Design·Scalable Oversight·LoRA

Continued pre-training of SmolLM-135M on 138 arXiv ML papers (2024–2026) using QLoRA (rank 32) via Unsloth. Trained on an RTX 4090 in about 14 minutes. On held-out papers: −20.1% perplexity (22.97 to 18.36), +19.7% ROUGE-1, +25.4% ROUGE-L, +37.5% BLEU. Two takeaways: LoRA outperforms full fine-tuning on small datasets thanks to its regularization, and rank saturates at r≥16, which suggests the data (not the rank) is the bottleneck. Adapter (9.7M trainable params, 6.77% of total) published to HuggingFace.

CPT·LoRA·QLoRA·Unsloth·SmolLM-135M·Domain Adaptation·HuggingFace

Two diffusion language models, both built from scratch. ModernBERT (~150M): pretrained as a masked diffusion LM on Project Gutenberg (6.4M chunks, 20 hours on an RTX 4090), then SFT'd on Open-Orca (~4.2M Q&A pairs, instruction-token-only loss). TinyStories (45M): full architecture written from scratch and trained for 60K steps, with confidence-based iterative denoising at generation time (128 diffusion steps). Loss dropped sharply around step 25K, the point where structure starts to emerge. All checkpoints published to HuggingFace.

Diffusion LM·ModernBERT·Non-autoregressive·PyTorch·Masked Diffusion·From Scratch

TenderIQ

May 2026

End-to-end system for evaluating government tenders. The pipeline pulls criteria from the document with a DeepSeek LLM, retrieves matching evidence with sentence-transformers, and writes an explainable verdict with cited source clauses. OCR uses a three-tier fallback: PyMuPDF first, Tesseract next, DeepSeek Vision when confidence drops below 65%. Borderline verdicts (confidence 0.55 to 0.80) route to a human review queue with a full audit trail. Covered by a 43-check smoke test suite.

RAG·DeepSeek·OCR·Streamlit·Human-in-the-loop·Production·Pydantic
Writing

Notes on post-training, RL environments, and small-model experiments. Read on the blog or Substack.

Skills
Post-training & alignment
CPT·SFT·RLHF·PPO·GRPO·Reward Modeling·Preference Optimization·LoRA / QLoRA·Self-Distillation·Unsloth
RL environments
OpenEnv·Reward Design·Verifiers
Applied AI
LLMs·RAG·Agentic Workflows·Prompting & Context Engineering·Tool Use·Diffusion LMs
Multimodal
VLMs·Vision Transformers (ViT, Swin, DeiT, DETR, PaliGemma)·Audio AI (STT, TTS)
Frameworks & MLOps
PyTorch·Transformers·TRL·MLFlow·DVC·DagShub·CI/CD (GitHub Actions)
Languages & analytics
Python·SQL·Pandas·Power BI
Get in touch

The fastest way to reach me is by email at j.raijada25@gmail.com. Or find me on X, GitHub, and LinkedIn.