← Back IDEA-004 · NLP / Search Quality

Live Demo · Query Denoising

What if your typo doesn't just look wrong —
it means something completely different?

A denoising model that doesn't just correct spelling — it recovers the user's true search intent before the engine ever sees the query.

search.example.com
Without correction
Engine interpreted
"moth ball in my clozit"
Moth Ball trick shots — videos
youtube.com · cricket & snooker
Buy balls online — 43 results
amazon.com · sports
What is a moth ball?
Did you mean: mothball?
After denoising
moth ballmothball · clozitcloset
Engine interpreted
"mothball in my closet"
How to use mothballs safely
thisoldhouse.com · pest control
Cedar vs mothballs — guide
consumerreports.org
Best mothball brands 2024
wirecutter.com · home storage
The Problem

When spelling errors change meaning, not just appearance

Standard spell-checkers operate at the character level. They look at a word, find the closest dictionary entry, and swap it in. That works fine for simple typos like teh → the. But it fails completely when the misspelling produces a real, valid word in a totally different domain.

"moth ball" (two words, a cricket/snooker term) versus "mothball" (one word, a pest-control product) are both perfectly valid English — but they send a search engine into entirely different domains. A character-level correction model has no way to detect this. The query looks clean. The intent is broken.

This is the problem this project targets: semantic drift caused by spacing and compounding errors. Not character noise — meaning noise.

More Examples

Errors that change domain entirely

User typed moth ball in my clozit mothball in my closet
Cricket shot vs. pest repellent — completely different domains
User typed how to re move a mole how to remove a mole
Chemistry unit vs. skin lesion — intent is medical, not scientific
User typed best current see best curry recipe
Phonetic drift — ocean currents vs. Indian food
The Approach

Stage 1 — Detect

A sequence model (fine-tuned BERT) flags tokens likely to be noisy — either misspelled, wrongly spaced, or phonetically drifted. Unlike rule-based systems, it uses surrounding context to decide whether a word is suspicious.

Stage 2 — Correct

A seq2seq denoiser (T5-small) rewrites the flagged query. Crucially, it is trained on pairs where the semantic domain shifts — not just surface-level character edits. It learns to recover compound words and domain-correct alternatives.

Stage 3 — Validate Intent

A lightweight intent classifier confirms the corrected query lands in the right domain before it reaches the ranking model. If intent is uncertain, both queries are run in parallel and results are merged.

Evaluation

Measured on a synthetic dataset of meaning-changing errors plus a hand-labelled sample of real search logs. Metrics: domain accuracy, BLEU, and ranking NDCG delta.

Stack & Status

Tech Stack

PythonHuggingFaceBERT T5-smallPyTorchNLTK PhoneticsElasticsearch
Overall Progress30%