It may not be obvious, but social network (SN) data has numerous applications that go beyond simple socialization. Beside the voyeuristic and self-promoting aspects, SN data is brimming with fresh, cheap, and accurate target information. This includes age, demographics, purchasing habits, buying power, education, brand loyalty, influence, and income, just to name a few.
This is pretty powerful stuff, as the insight that can be gleaned from millions of users posting near real-time could revolutionize the way products are launched and marketing decisions are made. It’s no longer necessary to guess what buzzwords will resonate with users for your next campaign – users are already using those words in their public conversations. There’s no longer a reason to take spraying and praying advertising approach in the hopes that an add will be seen by a fraction of the right buyers. Now, you can easily determine where your target population hangs out and pursue them directly.
So with such promise to disrupt the market, why hasn’t big soft moved into this address space yet? Where are products like the Microsoft Social Media Analytics Server or the IBM Social Network BI Aggregator? After all, large volume data analytics have been around for quite some time. Over the past 20 years, giants the likes of Microsoft, IBM, and Oracle have invested hundreds of millions of dollars in developing enterprise analytics and decision support solutions. Why not adapt their existing platforms to harvest the SN analytics as a cloud solution?
The answer to these questions has a lot to do with the problem of low data quality and inconsistency. A close examination of blog, forum, Twitter, or Facebook data reveals that they are all a hodgepodge of tidbits of personal information, non-threaded conversations, and poorly typed, spelled, and formatted communications, which renders them virtually useless for structured or unstructured analytics engines.
You may argue that at least some of the SN analytics companies must be doing something right. That may bee so, but there is no quantifiable way to gauge how much of their analytics are based on real math and how much is snake oil salesmanship and slight of hand. Many of the SN analytics providers claim that they developed patented technology to sort through volume, noise, and poor data quality. Others insist that their “secret sauce” algorithms allow them to calculate engagement, find patterns, and even accurately track memetic propagation. Most of these claims are dubious at best and can’t be verified because we don’t have a ground truth data. But even if you could verity the accuracy to the results, there are also these major factors:
- Most SN analytics providers don’t harvest their own SN data and those that do certainly don’t do so in real-time. Rather, they subscribe to data scraping services like Compete, comScore, Hitwise, Nielsen, Quantcast, etc. The data harvesters only collect data from a small fraction of the relevant websites, blogs, or forums. They do so on a schedule that could be as long as 2 weeks. Obviously, password protected and membership-only sites are off limits. What you get then is a tiny sliver of a weighted sample population that could be weeks old.
- Companies that scrap platforms like Facebook or Twitter do it via the native platform API. Due to system performance concerns, the SN vendors throttle the amount of data they expose via this API. If you are looking at a real-time monitoring solution of any of the social networks, be prepared to have very large data gaps and timeouts in your dashboard.
- Algorithms for determining text sentiment, theme, writer’s gender, age, and education are only effective on large and well-formatted compositions. They were designed to work on structured essays that are around 1000 words long. The likelihood of accurately determining any of this characteristic from a 140 char tweet or a blog posting that is riddled with expressions like LOL or OMG is as good as a coin toss.
- Even the largest data providers only scrape less than 1 percent of relevant Internet data. The analytics you are viewing probably represent information found across no more than a handful of sites, blogs, or forums. Making multi-million dollar advertising decisions based on such low quality and small data sets could be risky.
- Due to the growing availability of automated tools for the creation of blogs, websites, and posts, we are starting to see a significant amount of machine-generated content that is designed to pump-up SEO visibility for adware sites. Data scrapers are unable to distinguish between machine generated and human typed content, which can result in skewed analytics.
- Data feeds frequently go through secondary processing before they are presented to users. This additional refinement may include the removal of partial records (i.e., missing dates, user names, etc.) or offensive message content like cursing, pornography or spam. All this data ‘massaging’ further reduces the population size and the accuracy of the results.
So what is the moral of the story? If you are on a quest for the SN analytics holy grail, you won’t find it, because it all depends on how much YOU are willing to compromise in terms of data sample size, quality, and accuracy.
If you are in the market for an SN analytics tool, don’t take any chances by committing to one solution before doing your homework. Ask the vendor to explain to you in 8th-grade level English how they address the six items mentioned above. Arrange for a trial period with at least three vendors and then compare their analytics to each other using a benchmark and a ground truth known to you. This should give you a sense of how accurate the tool is and its margin of error.
© Copyright 2010 Yaacov Apelbaum All Rights Reserved