A data scientist can be defined as a “person employed to analyze and interpret complex digital data, especially in order to assist a business in its decision-making.”
Yet too often data scientists fail to make truly useful the most complex and valuable digital data of all: the words of your customers. Why is that?
Textual data such as focus group transcripts and open-ended survey questions have long been a staple in most companies’ market and product research processes. So you’d think the proliferation of textual data created by social media mentions, online reviews, and online surveys would be a huge boon for companies in regard to good decision-making.
At long last, a chance to listen to your customers at scale!
But think back…
When was the last time you or your company made an important decision based on a rigorous analysis of textual data?
How often do data scientists give you reports with a fancy analysis of every quantitative variable under the sun while leaving you to read through hundreds or even thousands of open-ended survey results or internet comments on your own? Even if you did glean a key insight from all that work, how did you justify your position without those fancy graphs backing you up?
Or, are you at a more cutting-edge company whose data scientists employ natural language processing (NLP)? In that case, you’ve probably experienced receiving a sentiment analysis of all that textual data showing you the portion of comments skewing positive vs negative. But was that really useful in figuring out why customers had a positive or negative impression and making decisions based on that information?
While there are promising developments in the works at the bleeding-edge of natural language processing, the truth is, most organizations are not making full use of the textual data they collect because most data scientists are not experienced at or trained in extracting the maximum possible value from textual data.
Because of this, those in the field of data science are liable to discount the value of textual data. We like things that are easily measured and quantified. If you can’t do significance testing on it, it might as well not exist to us.
But the real secret is, textual data can be measured, quantified, and put through significance testing and when it is, it can be the most powerful source of counter-intuitive insights and blind-spot warnings.
The reason it’s not used this way more commonly is it’s difficult to measure and quantify. It takes a data scientist who is not talented in only math and coding, but one who is equally talented in extracting essential meaning from words quickly.
Imagine for a moment that you’re working at an organization growing their online presence and offerings. What would you need to know to maximize your chances of success?
You probably know the benefits your customers traditionally derive from your products. But what value do they derive when you combine the benefits of your products and services with the benefits of the internet? What should you prioritize when shifting online?
To start, you might want to figure out what different customer segments value generally from the internet so you can incorporate those home-run benefits into your online strategy from the beginning. This is the perfect opportunity for textual analysis.
As an example, let’s look at a light textual analysis of an internet attitudes survey from pew research. This data comes the Internet Attitudes Survey conducted in January of 2018. Respondents were asked:
“Overall, when you add up all the advantages and disadvantages of the internet, would you say the internet has mostly been [ROTATE: (a GOOD thing) or (a BAD thing)] for society?”
If they answered that the internet has been mostly a good thing for society, they were asked “What is the main reason you think the internet has been a good thing for society?”
This analysis looks exclusively at the meaningful answers (don’t know/refused excluded) to this open-ended question in an attempt to get at the heart of why America values the internet.
At first glance, you might look at this analysis and assume so many insights could not possibly have come from the answers to one single question. But that is the power of qualitative analysis. We can extract near-endless amounts of value from a single question given hundreds to thousands of responses.
Python code used for this analysis: GitHub
Dataset: “Jan. 3-10, 2018 - Core Trends Survey.” Pew Research Center, Washington, D.C. (April 30, 2018) https://www.pewresearch.org/internet/dataset/jan-3-10-2018-core-trends-survey/
There were 1,322 meaningful responses to this question which were split up into four major themes: information, connection, economic, and tool.
Information includes any mentions to different types of information, actions taken with information, and the benefits the internet brings to those information types and actions.
Connection includes references to being connected with others, communicating with others, or using social media.
Economic refers to mentions of using the internet for work or entrepreneurial purposes, such as finding a job, accomplishing work tasks more efficiently, or selling goods or services online.
Tool includes specific references to practical tasks accomplished outside of work or entrepreneurial activities, such as using a GPS, making purchases online, or using translation tools.
If you’re a company with a specific customer segment, your first question might be, are there any significant differences in how various segments of society value these different aspects of the internet?
To find out, we can start with a heat map showing how strongly different themes correlate with various demographic characteristics. We might notice, for example, that language seems to have a relatively high correlation with the economic and information themes and explore that more specifically.
When we take a deeper dive and compare the results of surveys conducted in English vs Spanish using t-tests, we do find a statistically significant difference between information, economic, and tool themes. It seems those who prefer taking surveys in English value information at a higher rate and those who prefer taking surveys in Spanish value economic and tool-based benefits at a higher rate.
If you are a B2B company that specializes in making work or entrepreneurial activities more efficient, you might consider making sure your products and services work as well in English as they do in Spanish in order to avoid missing out on this important customer segment.
But let’s say you’re a company moving to the internet that specializes in catering to either men or women. By looking at the major themes, we’d conclude there are no statistically significant differences in why men and women value the internet. But is that the end of the story?
Maybe not. A main benefit of qualitative analysis with large datasets is that we can dive much deeper into the data to get a granular view. For example, look at how many sub-themes we can split the information theme into, all while still have having a large sample size of 837 responses.
When we look at the heat map of these granular categories, we see there are indeed some stronger correlations with gender.
In fact, there are four sub-themes that have statistically significant differences according to their respective t-tests.
Men value: general information and sharing information (meaning broad distribution of information on a societal level, not the communication and sharing of information on the individual level included in the connection main theme.)
Women value: Educational content and research. General learning and search are included in separate categories in order to make formal educational content and academic research more distinctive categories. These are the categories disproportionately favored by women.
Demographic differences are not the only interesting differences we can find in our granular sub-themes. We can also group and compare similar sub-themes to see what Americans in general comparatively value at statistically significant levels.
The most obvious example of this is in information types. Respondents value general information at a very high level while references to the word “knowledge” specifically, educational content, and news were mentioned at a lower rate but no statistically significant differences were found between them.
When we look at the actions taken with information, things get more interesting. Though research and general learning have no statistically significant difference, there is a clear difference in preference between accessing information generally, in searching for information (especially googling information), and researching or learning.
When it comes to the ways the internet adds benefit to these information types and actions, there were no statistically significant differences between ease, speed, and quantity of information available but there was a statistically significant difference between those categories and the quality of information, including specifically the diversity of information available.
This has implications for any company moving online. If you specialize in quality, you’re likely courting a smaller segment of society. Whatever your business, it’s important to make sure customers can access your products and services easily and quickly. This sounds simple and obvious but is important to keep at the forefront of your mind when designing your online presence because it’s very easy to forget about the user experience, for example, prioritizing what feels like quality by putting as many features as you can on your website above simplicity and ease of use.
Hopefully, by being able to look at the cold, hard numbers and charts, you can keep what’s really most important to your customers at the forefront of your teams’ minds. That is the power of a good textual customer data analysis.