Bright future for the analysis of (big) textual information sources? Dr. Kristof Coussement interviews Tom H.C. Anderson.

Welcome everyone; my name is Dr. Kristof Coussement ([KC]), and I am a Professor in Marketing Analytics at IESEG School of Management (http://www.ieseg.fr) and Academic Director of the MSc in Big Data Analytics for Business (http://www.ieseg.fr/msc-big-data).Today I’m speaking with Tom H. C. Anderson ([THCA]), Managing Partner and Founder of Anderson Analytics, developers of the OdinText text analytics platform. In 2005 Tom’s firm became the first in the market research and consumer insights industry to leverage modern text analytics, and his firm one several awards from trade organizations such as the American Marketing Association, Advertising Research Foundation and ESOMAR for their ground breaking work in the text mining field. Anderson Analytics recently launched a new Next Generation Text Analytics platform called OdinText that has been featured in several high profile industry case studies by clients from Disney to Shell Oil.

Let’s ask Tom’s opinion on several issues related to text analytics.

[KC] Hi Tom, Great to have you with us today. Before we get into the detail about text analytics, I think maybe text analytics can mean different things to different people. How would you define it?

[THCA] Good question, that certainly seems to be the case. The more I work with text analytics, the less distinction I make between text analytics specifically and any analytics in general. This is because we are often trying to solve exactly the same problems using almost exactly the same data.

In other words a company wants to understand what drives customer satisfaction, return behavior or sales up or down by how much. An analyst uses available data to answer this question. The difference is that often times there is also text or “unstructured” data available, and if you’re not using that data you will not be able to answer these questions nearly as well. So Text Analytics = Analytics, just as Data Mining = Text Mining. It’s all about pattern recognition, understanding the data and developing a model that can be used to drive a desired outcome.

[KC] Which industries have fully proven their dependence towards text mining software, and which ones are on the rise?

[THCA] The industries that were using them first are also the most secretive about how they are used, so it’s hard to judge exactly where they are. So here I’m talking to some degree about National Security and Military intelligence agencies in different countries, especially countries like the USA and China.

Secondly Investment banking/finance were also very early users trying to predict stock price. Now obviously if you build a good model to do this you’re probably not going to want to share it with anyone, so you certainly won’t be writing a paper about it or presenting how to do it at a conference.

These two examples are very different uses of course. In the first case, on the one hand it could of course be meta-analysis to help with more general things like analyzing news etc. to predict unrest, natural disaster etc. a fairly straightforward problem. The second use case for intelligence type organizations, and far more difficult problem is to use text analytics to identify individual malicious activity, someone doing something bad and trying to hide it from you. That’s a very different problem/and one that’s a lot harder to solve for.

So while these two were first, I would say everyone else from medical/pharma research, to legal discovery, and consumer/insights/market research (our area of strong expertise) have evolved somewhat separately, and I think to some degree that makes a lot of sense.

I can tell you that no one is at the level you see depicted in Hollywood films, though I’m sure the CIA would like the bad guys to believe that. I think the general public believes that we are at a higher level than we actually are. And many companies like IBM to name just one out of many try to propagate this myth with PR heavy, but very light on actual usefulness, not fully finished products like Watson.

However, on the flip side of this, there are many tools that do an excellent job solving for specifically the job they were meant to solve for even if they are nowhere close to looking anything like Skynet in T2.

Our tool, OdinText for instance does a fantastic job doing what it was designed to do, helping marketers understand what drives consumer behavior.

[KC] What are the success factors of successful text mining software implementation?

[THCA] So this is an interesting question, and unfortunately I see a lot of companies asking this very question first, rather than asking the questions that should be asked prior to even deciding to implement text analytics. The first questions to ask ideally in order of importance are: 1. What is it we need to know in order to make a decision? 2. What available data is likely to give us these answers? And only then, if the answer is that it is some amount of valuable unstructured/text data would you continue to ask the question, what software would be best in helping us get these answers from this data? At that point, you’ve answered the ROI question. If you answered question #2 and #3 correctly, you will have gotten what you need for question #1, and if that’s the case you are probably driving revenue higher if you are a business, or saving lives if you are some other Gov/NGO type organization.

[KC] What are the sources of textual information that are frequently analyzed?

[THCA] So it’s a truism that we are collecting more information now than ever before. Basically within the next 2 years we will collect more information than we have collected since the beginning of time. Secondly another truism is that about 80% or more of this information is in unstructured/text format.

Now, not all of this information is necessarily super useful to anyone. Even within a specific data source of importance, it’s often about filtering out the useless info or noise which is even more common in unstructured data than structured data.

Now unless you’ve been living in a cave, you’re probably no stranger to social media monitoring. This is a very small portion of the unstructured data that gets A LOT of play in the media. Social media monitoring is mainly about Twitter, which at least in US is only about one tenth of the population (the same one tenth that blog (very heavy overlap at least), and contains a lot of spam/noise.

There are of course far many more data sources, of far more value to businesses than social media.

Any news story from newspapers, TV programs, radio and on the internet for instance. Not just current ones, but historically.

Our software was initially designed for large scale customer satisfaction surveys. For those surveys you get after buying a car or staying in a hotel, or visiting a website. But it’s now also being used by companies to analyze emails and telephone calls they get from customers. For smaller surveys they run on an adhoc basis. For reputation management, to understand how companies are mentioned in media vis a vis their competition, and yes social media monitoring as well.

But of course there are more application for text analytics than I can possibly mention or think of now. Think about all those many areas where text information is generated. Doctors/patient notes for instance, your own emails, I’s endless really…

[KC] How much data is enough data to get it text mined?

[THCA] Frequent question which is hard to answer. Short answer is that text mining is about pattern recognition, so you need a certain amount of data typically for there to be any patterns to recognize. So having big data (assuming it’s valuable data, cause not all big data necessarily is IMO), is nice.

That said, if you have very small data, assuming it is very important, text analytics can still offer some pay back. For instance during the last US Presidential campaign we analyzed the debates and compared what Obama said VS what Romney said. It tends to quantify this information in a different way than what would be done by a political commentator watching the debate. We predicted that Obama did better based on this quantitative analysis.

But rarely would I recommend using text analytics to understand what just 2 people are saying. In terms of ROI, you are probably better off reading such data yourself.

[KC] How does the IT infrastructure need to be adapted to implement a text mining methodology?

[THCA] A lot of TA software, the vast majority actually is SaaS, so in most cases there is not much technical implementation needed. Basically a decision about whether the data should be uploaded manually or refreshed automatically via API calls.

Then there are other questions which are related to the data. Is there PII in the data? Does there need to be? In our case usually it’s not, or there at least does not need to be. If there is no PII in the data, or better yet if it is already public information like social media data, then security is not a big concern.

There is also a question of whether or not it is ‘business critical information’, again, if it is then chances are you are looking at internal application, if it is not then again security and 100% guaranteed uptime is not really something that is needed.

Most SaaS vendors including our company offer encryption between your user and our servers. Depending on your data, where these servers are located may also be something you need to decide on. If the data does have PII and you are an EU co. then your vendor should offer servers in Europe. If you are a US co. that has government or other semi sensitive information you may prefer US based SaaS servers.

For most data we see having to do with consumer insights/market research as well as social media data, IT gives it a pass from other software that runs business critical processes with PII data. So implementation is very easy. Nothing to install, IT/risk managers delight.

Then there is the other part of implementation, making sure insights are socialized properly, there is buy in among key stake holders, and information is acted upon. This is a more important implementation question in my mind, and one that is case specific and thus harder to answer in a few minutes.

[KC] What are the most important text mining applications nowadays, and which ones are on the rise?

[THCA] Prediction is what we are working on. It’s easier to predict whether someone will buy something than predicting whether someone will blow something up or whether someone is lying about their rating on a site about a product because they are being paid to do so or because they work for that company.

Right now a lot of companies are looking at social media data, because its free and just out there, hoping they can use that to predict campaign success, brand equity, sales and stock price etc. The jury is still out on this IMO, we are working on it with a couple of companies right now.

[KC] What will be the role of visualization in the next generation text mining packages?

[THCA] Visualization should be viewed as an iterative way to quickly explore data. Secondly to help communicate findings. The former is more important for text analytics software. The latter requires customization, and can easily be done in tools like PPT and Excel anyway.

We have some fairly good visualization tools right now. We are hoping to continue to improve them, they are important, but they are not a solution in themselves. The worst thing I see is when a client buys a software just because it has a pretty visualization rather than for what it can actually do for them. Then you know they will only be getting very limited value. No visualization will ever answer all your questions, you need an analyst who is thinking about the problem and how to solve it and answering the question iteratively by trying different hypothesis. Visualizations play a part in this process, but it’s not the most important part.

[KC] How should we see text mining in the future of self-service analytics (software that makes analytics more independent from the IT department)?

[THCA] Certainly, IT people have no training in analysis. Makes little sense for them to be involved. Software belongs in the hands of whomever can use it best.

[KC] Do we need text mining analytics as a necessary complement or substitute in the analytics portfolio?

[THCA] Most definitely, I can tell you that again and again I see the unstructured text data is often the most important source for analytics in many problems. Problems that couldn’t have been solved any other way. If you’re working with any decent amount (I’m talking in the thousands) of unstructured or mixed data, and you’re not using text analytics, then I question whether you should be calling yourself an analyst/researcher.

[KC] Thank you very much Tom for this interview!

[THCA] The pleasure is mine!