08 Feb 2011

Normalizing free form text input

If you’re accepting free text from users that needs make sense, you need to normalize it. For instance, if you’re accepting the name of a film actor, you need to understand that “Shahrukh Khan”, “Shah rukh Khan”, “SRK” or even “Sharukh Khan” point to the same actor.

You could use a dynamic drop-down to force a certain style of text-entry, but that gets a little intrusive.

Here I demonstrate a simple way to ensure consistency of this data without any intrusive UI elements. I’m using the Bing API, but any search API would do.

For film actors, I’ll use IMDB since it’s comprehensive and exclusive.

import urllib
import re

val = 'Shahrukh Khan'

# quick bing search with site:imdb.com parameter.
url = 'http://api.search.live.net/xml.aspx?Appid='+APPID+'&query='+urllib.quote_plus ('site:imdb.com '+val)+'&sources=web&web.count=1&adult=off'
response = urllib.urlopen (url)

# a simple regex (IMDB has "<actor> - IMDB" as title) and we have the normalized name.
normalizedName = re.findall ('<web:Title>(.*) -', response.read ()) [0]