NLP Focus: Correcting Spelling for Better Insights

There are two main types of marketing research: quantitative and qualitative. The former allows for the quick and efficient analysis of thousands or even millions of responses but is limited by the pre-determined choices. The latter offers insights into product aspects that you, as a researcher, may not have even thought of. Yet, it requires a lot of time to analyze and is limited by the number of responses. What if you could leverage data science to get the best of both worlds?

Social media analysis is one type of qualitative research that is being streamlined thanks to data science methods, and more specifically thanks to natural language processing (NLP) methods. As discussed in a previous post, many insights can be gleaned through social media feedback from customers. However, all of these insights are limited by the spelling errors in the input.

Spelling errors present one of the major challenges when it comes to NLP analyses as they prevent data from being standardized and thus aggregated. Since text is unstructured data, performing data cleaning is often a challenging task. Thankfully, there are python libraries that can help with text cleaning: SymSpell, pyspellchecker, etc. All of these libraries can be used to correct text gathered from social media for example in order to improve the insights gathering stage.

The Spell Check Process

Regardless of the library, they all work in the same way. First, each word in the inputted text is compared with a pre-specified word dictionary. If the word is not present in the dictionary, the algorithm checks to which dictionary word that inputted word is most similar to. Similarity is often based on the number of permutations that need to be carried out in order to change the inputted word to the dictionary word. A permutation is defined as any letter insertion, deletion, replacement, and transpositions made for a word. For example, if you type “ber”, all of the following dictionary words have a distance of 1 to the inputted word: bar, beer, bear. The first word candidate has one replacement while the last two have one insertion each.

Having gathered all of the words with the closest distance to the inputted word, the algorithm then picks the word with the highest frequency in the English language. The idea there is that the word that is most used in the English language is likely the word that the user meant to type. The number of candidate dictionary words depends entirely on the maximum number of permutations set by the analyst. The maximum number of permutations to be considered for each word can be set by the user. Generally, the maximum number of permutations is set to either one or two as allowing for more permutations can completely change the meaning of the word. For example, at 2 permutations, “ber” could become “bay” and at 3 permutations, “ber” could become “sky”. It all depends on the word that has the highest word frequency in that given language.

Always room for improvement

One drawback of these algorithms is that they do not take context into account. This could mean that:

As Bob walked through the forest, he encountered a ber.

could easily turn into

As Bob walked through the forest, he encountered a beer.

instead of “…a bear” as the writer is most likely to have meant. While this may be easy for a human reader to guess, it is a much more difficult task for a machine to perform. If you are analyzing a specific context, one way to get around this drawback is to create a word-frequency dictionary based on typical texts from that context. This would result in “bear” having a higher frequency in your model even if in general “beer” is a more frequent word in the English dictionary.

While this is not a bad workaround, one way to improve these spell checkers is to create a word frequency matrix that calculates the likelihood of a particular dictionary word to be the correct word based on the context of all of the other words in that sentence. There is always room for improvement 😉

Once all the words are corrected, different types of statistics and analyses, such as an entity sentiment analysis, can be run on the collection of documents to synthesize insights from customer feedback.

In what other situations could spell checking algorithms be useful?

Leave a Reply