When I'm creating words for the lexicon of a conlang, I often want to know what other existing words sound similar to make sure that I'm happy with any potential confusion / similarity. The way I used to do this was to scan my dictionary manually, but for any non-trivial number of words that quickly becomes impossible. You can look for words beginning with the same/similar sounds of course, but spotting any potential almost match is more than just having the same initial sound.
What I've been doing recently is using a simple spreadsheet that I built using OpenOffice Calc, that uses Levenshtein distances to try to find words with similar spellings. The Levenshtein distance between two written words is the minimum number of the following operations needed to make the two words the same:
1. insertion of a single character
2. deletion of a single character
3. replacement of a single character
The wiki article describes it in more detail:http://en.wikipedia.org/wiki/Levenshtein_distance
So, to use the wikipedia article, the Levenshtein distance from "kitten" to "sitting" is 3, because:
kitten -> kittin -> sittin -> sitting
Obviously, in order for this to work, your orthography needs to be mostly phonemic and have a 1 letter = 1 phoneme correspondence. If that isn't true then the distance may not be a good measure of how similar two words sound. For example, Levenshtein distance would be a fairly poor way to measure similarity of sound for English words, because there are a number of ways of representing some sounds, many of which use more than one letter. It may also be more difficult to apply in very synthetic conlangs.
It also, of course, doesn't take into account that some sounds are more similar to each other than others, e.g. b and v are more similar than b and s. I think this could be done by transforming to some kind of concise textual representation of phoneme features (e.g. b -> v is "replace s for stop with f for fricative", whereas b to s is "replace s for stop with f for fricative, replace b for bilabial with a for alveolar"), or I think it would also be possible to modify the algorithm itself to penalise substitutions differently depending on exactly what letter turns into what. But I think that's a level of complexity that's not needed as a first approximation.
In terms of implementation, there are a number of existing ones here for different programming languages:http://en.wikibooks.org/wiki/Algorithm_ ... n_distance
All I did was copy the Visual Basic for Applications implementation into an OpenOffice Calc spreadsheet (which surprisingly turned out to be 100% compatible without any modifications, although it looks like someone's changed the code since then), and then write a few formulas to do the following:
1. compute the distance between a given word and a list of all other existing words
2. sort all the existing words (and their glosses) in order of increasing Levenshtein distance, so that the words with the most similar spellings float to the top of the list
When I've tidied it up, I can post my little spreadsheet here as an example. I'm not sure if it will work without adjustment if I resave it in Excel format, but I can test that on a Windows computer later. Certainly I don't think adapting it should be that hard.
So, two questions:
1. Do other people use Levenshtein/edit distances or other similar measures to find similar words in their conlangs?
2. Is anyone interested in a simple spreadsheet that takes a list of words and a new word, and shows you which existing words have a similar spelling?