Live Demo · Query Denoising
A denoising model that doesn't just correct spelling — it recovers the user's true search intent before the engine ever sees the query.
Standard spell-checkers operate at the character level. They look at a word, find the closest dictionary entry, and swap it in. That works fine for simple typos like teh → the. But it fails completely when the misspelling produces a real, valid word in a totally different domain.
"moth ball" (two words, a cricket/snooker term) versus "mothball" (one word, a pest-control product) are both perfectly valid English — but they send a search engine into entirely different domains. A character-level correction model has no way to detect this. The query looks clean. The intent is broken.
This is the problem this project targets: semantic drift caused by spacing and compounding errors. Not character noise — meaning noise.
A sequence model (fine-tuned BERT) flags tokens likely to be noisy — either misspelled, wrongly spaced, or phonetically drifted. Unlike rule-based systems, it uses surrounding context to decide whether a word is suspicious.
A seq2seq denoiser (T5-small) rewrites the flagged query. Crucially, it is trained on pairs where the semantic domain shifts — not just surface-level character edits. It learns to recover compound words and domain-correct alternatives.
A lightweight intent classifier confirms the corrected query lands in the right domain before it reaches the ranking model. If intent is uncertain, both queries are run in parallel and results are merged.
Measured on a synthetic dataset of meaning-changing errors plus a hand-labelled sample of real search logs. Metrics: domain accuracy, BLEU, and ranking NDCG delta.