Data Miners: Panning For Statistical Gold

Filed Under Visualization

Data Miners Panning For Statistical Gold
Tyler Graf

September 10, 2012

Every credit card swipe, Google query, and online vehicle registration renewal contributes to the terabytes of digital data zipping along fiber optic cables and soaring through the airwaves towards giant computer servers. Access to this stream of information is useless without the ability to analyze the data, identify patterns, and translate them into decipherable figures. This process is referred to as “data mining”—or “niche analytics”—and its prominence has increased over the past decade as business, finance and government sectors have all accumulated vast deposits of valuable but often unsorted raw data.

The largest such cache is a giant database occupying 23,000 computer servers in Conway, Arkansas, owned by the Acxiom Corporation—a company relatively unknown to the public, but a giant in the information storage game. Acxiom began as Demographics Inc, a data analysis partner of the Democratic Party. Acxiom now holds the world’s largest commercial database on consumers of all political persuasions, and sells that data to banks like Wells Fargo, financial services like E*Trade, and auto giants like Toyota and Ford.

Data miners have become invaluable players in the information economy by transforming reams of digital data held by Acxiom and similar storage companies into interpretable consumer and demographic information for marketers. Data miners are also putting their skills toward predictive analytics for insurance industry and academic clients.

Despite date miners’ increased profile, the modeling tools and statistical software behind the process has often attracted more media attention than the researchers themselves. The open source programming language R, for example, has received favorable coverage in The New York Times and Wired. The latter publication highlighted R’s use by an NYU political science grad student to analyze thousands of Wikileaks documents. The resulting data set created a remarkable visualization of increasing insurgent attacks in Afghanistan.

The introduction of consumer-friendly programs like Google Analytics may serve to introduce data mining and statistical analysis to a broader audience. But there are reasons to be cautious as well. Gerhard Pilcher, a senior scientist at Elder Research, warns that such tools might also fool people into thinking they’ve made a big find when the relationship they’ve identified isn’t actually significant. With more advanced programs, the need for cautious—and professional—analysis only increases.

“The danger with some of the (data mining) tools is they do too much for you,” says Pilcher. “But if you don’t know what’s going on under the hood, then you could be fooled into thinking you have a good model.”

In other words: your high school calculus teacher was right to make you show the work instead of relying on a nifty graphing calculator. Results independent of carefully structured analysis have little to no value. Pilcher stresses that proper analysis should also employ different algorithmic models and test against different data samples. Not unlike the world of crowdsourcing, the “power of many gives you a better answer than a single model.”

Data analysts at Elder Research have compiled a top ten list of data mining mistakes common in business. Topping the list is “failure to define an objective,” which should be distinguished from failure to develop a hypothesis. Pilcher emphasizes that data miners shy away from hypothesizing about information in order to maximize the objectivity of the results.

Some other common mistakes include the following:

  • “Starting too big.” An objective should be narrow and easily defined.
  • Lack of support from the “keepers of the data.” Forging relationships with information owners is a necessity.
  • “Waiting for perfect data.” Sometimes the best data available is what’s in front of you. Make it work.
  • Believing you have “perfect data.”
  • Relying “too heavily” on software alone to analyze information.
  • Not including the “domain subject matter experts” in research and analysis.
  • Failing to understand the different level and types of analytics.
  • Not planning for deployment. Information is only valuable if it can be used.
  • “Rushing the process.”

Elder has also compiled a more general list of data mining mistakes outlining common pitfalls.

Data mining can clearly be an invaluable tool for businesses. But any procedure heavily reliant on complicated mathematics won’t produce easy answers. Pilcher says he has a 95 percent success rate in finding technical answers and a 70 percent success rate in translating those to a beneficial outcome for customers. Data miners are ultimately only as effective as the data sets they analyze. Like with a lot of things, it comes down to the quality of the ingredients. The better the info, the better the results.

Image Source: