Live Demo · Query Denoising

What if your typo doesn't just look wrong —
it means something completely different?

A denoising model that doesn't just correct spelling — it recovers the user's true search intent before the engine ever sees the query.

Without correction

Engine interpreted

"moth ball in my clozit"

Moth Ball trick shots — videos
youtube.com · cricket & snooker

Buy balls online — 43 results
amazon.com · sports

What is a moth ball?
Did you mean: mothball?

After denoising

moth ball→mothball · clozit→closet

Engine interpreted

"mothball in my closet"

How to use mothballs safely
thisoldhouse.com · pest control

Cedar vs mothballs — guide
consumerreports.org

Best mothball brands 2024
wirecutter.com · home storage

The Problem

When spelling errors change meaning, not just appearance

Standard spell-checkers operate at the character level. They look at a word, find the closest dictionary entry, and swap it in. That works fine for simple typos like teh → the. But it fails completely when the misspelling produces a real, valid word in a totally different domain.

"moth ball" (two words, a cricket/snooker term) versus "mothball" (one word, a pest-control product) are both perfectly valid English — but they send a search engine into entirely different domains. A character-level correction model has no way to detect this. The query looks clean. The intent is broken.

This is the problem this project targets: semantic drift caused by spacing and compounding errors. Not character noise — meaning noise.

More Examples

Errors that change domain entirely

User typed moth ball in my clozit → mothball in my closet

Cricket shot vs. pest repellent — completely different domains

User typed how to re move a mole → how to remove a mole

Chemistry unit vs. skin lesion — intent is medical, not scientific

User typed best current see → best curry recipe

Phonetic drift — ocean currents vs. Indian food

The Approach

Stage 1 — Detect

A sequence model (fine-tuned BERT) flags tokens likely to be noisy — either misspelled, wrongly spaced, or phonetically drifted. Unlike rule-based systems, it uses surrounding context to decide whether a word is suspicious.

Stage 2 — Correct

A seq2seq denoiser (T5-small) rewrites the flagged query. Crucially, it is trained on pairs where the semantic domain shifts — not just surface-level character edits. It learns to recover compound words and domain-correct alternatives.

Stage 3 — Validate Intent

A lightweight intent classifier confirms the corrected query lands in the right domain before it reaches the ranking model. If intent is uncertain, both queries are run in parallel and results are merged.

Evaluation

Measured on a synthetic dataset of meaning-changing errors plus a hand-labelled sample of real search logs. Metrics: domain accuracy, BLEU, and ranking NDCG delta.

Stack & Status

Tech Stack

PythonHuggingFaceBERT T5-smallPyTorchNLTK PhoneticsElasticsearch

Overall Progress30%

What if your typo doesn't just look wrong —it means something completely different?