Date: Mar. 2nd, 2012 04:26 pm (UTC)From:logomancer
While Google's algorithms are Sooper Top Sekrit, my educated guess is that they use a modified version of Bayesian filtering with a near-neighbor finding algorithm to detect simple letter transpositions. When someone types in a word with few results in a search bar, Google checks to see if there's a word that's close with more results, and see if that's what the person meant. If they click on the link, that's a yes, and the probability of the two being linked together goes up. If enough of this happens, then the substitution happens automatically. Of course, if they click on a different link, that's a no, and the filter takes that into account. In the end, it's all statistical analysis and a massive ton of storage, which is how machine learning has progressed for years now.
Naming systems in computer databases are pretty damned inflexible, really, and not just for people who case their name differently; a lot of the computer systems assume you have a Western-style name, with a given name, a middle name, and a surname, and maybe a title and a suffix. Spanish/Portuguese-style names, for instance, with more than one given name and more than one surname, are not properly reflected in most database structures. Arabic names are similarly problematic, with one given name and one surname, but multiple patronymics. As database structures tend to endure unless there's something seriously wrong with them, I anticipate that this is a problem that will last for quite a while, sadly.
no subject
Date: Mar. 2nd, 2012 04:26 pm (UTC)From:Naming systems in computer databases are pretty damned inflexible, really, and not just for people who case their name differently; a lot of the computer systems assume you have a Western-style name, with a given name, a middle name, and a surname, and maybe a title and a suffix. Spanish/Portuguese-style names, for instance, with more than one given name and more than one surname, are not properly reflected in most database structures. Arabic names are similarly problematic, with one given name and one surname, but multiple patronymics. As database structures tend to endure unless there's something seriously wrong with them, I anticipate that this is a problem that will last for quite a while, sadly.