Text Mining in Excel
Count Words to Find Hidden Patterns in Text Data
A healthcare company's service reps entered notes about why insurance claims were denied. This text data, stored in a mainframe computer, was a rich source of information about the cause of denied claims, but extracting the data was considered too difficult because there were over 35,000 individual notes about individual cases. Every note was somewhat different depending on the rep who did the analysis. Analyzing it by hand would have taken weeks. This is where text mining comes to the rescue.
You're already familiar with various forms of Text Mining: search engines, spam filters, spelling and grammar checking and language translation. Text mining can also be useful for Six Sigma.
Companies use data mining to explore large amounts of data to discover meaningful patterns and rules. Text mining, a subset of data mining, explores the written word to find hidden patterns. By some estimate, stored text contains as much as 80 percent of the key information needed for business improvement. What are some sources of text data for mining?
- Service rep or help desk comments
- Customer complaints
- Transcriptions of recorded voice such as doctor's notes
- News stories
Text data, by its very nature, is unstructured and varies from person to person. So categorizing the text becomes difficult. In the healthcare denied claims example, reps spelled Medicare in various ways: MDCR, MedCR, Medicr, and so on. This variability makes text analysis more complex and challenging than plain numbers.
There are several types of text analysis:
- Root cause analysis
- Sentiment analysis to understand attitudes
- Clustering of documents (e.g., affinity diagram)
Root Cause Analysis
The first flavor of text mining attempts to look at the root cause of comments, a key method for Six Sigma. The goal is to summarize the various words and phrases to see if there's a Pareto pattern (80/20) to the text data. To do this, you will need software to parse the text into words and phrases.
QI Macros Word Count tool will handle this quite easily. It may take awhile if there's lots of records, but it's much faster than trying to do it by hand or using Excel's COUNTIF statement. Most help desk software allows an export to Excel, so it was easy to get the text file. Here's a subset of the 35,000 healthcare denied claims comments:
To use the Word Count tool, simply select the data then click on the QI Macros Menu and and choose Word Count from the Data Mining sub menu.:
The Word Count tool will parse the selected text into words and two-word phrases, then use Excel's PivotTable to summarize the frequency of phrases and sort them in descending order:
Word Count strips out common English words such as "the" or "and". What's left is a Pareto pattern.
Most common words: DUP and DOS (which stands for Date Of Service).
Most common phrases: DUP DOS, multiple visits, visits same day.
Ah Ha! Insurance companies were rejecting claims because they had an overlapping (duplicate) date of service. This led the healthcare company to examine these rejects and they discovered that the patients were admitted from the Emergency Department resulting in one claim for the ED and one for the nursing unit, rather than one claim for both. Procedures were changed and this type of denied claim vanished saving millions of dollars and endless hours of rework.
Text mining can use more complex natural language methods, but Word Count's "bag of words" approach will serve most improvement teams. Text mining is a powerful tool for Six Sigma and the QI Macros Word Count tool helps make it simple.