Math is the Secret to Big Data

Math Mark

VentureBeat had an article the other day discussing how math is the real secret to unlocking “Big Data“. I agree and in fact would go to agree with the oft quoted notions that math is the universal language and the ultimate truth. It’s hard to argue against a concept if there is a mathematical underpinning of it and particularly if a formal proof has been produced.

The intersection of math and “Big Data” is in the algorithms used to first to explore the data looking for patterns (either as commonalities or as outliers) which come to represent something new we’ve learned about the data- a new insight; and second the algorithms that get developed to take advantage of this new insight. In the ‘old days’, the intuition of domain experts was relied upon to give us direction as to where to look in the data to find these ‘insights’.

We’ve always had data, and it has always had the potential to be ‘big’. What’s different today is that we have the ability (physically and financially) to collect that data to the point that its potential to be big is realized and that we have readily accessible to us the technology and tools to process and query these large amounts of data. If we have the appropriate algorithms and appropriately large data sets against which to run these algorithms, the intuition of domain experts is less of a necessity.

The author of the article, Narinder Singh, makes the comment that a well known professor once said to him, “Big data is misnamed in our (academic) world, because data sets have always been big. What is different is that we now have the technology to simply run every scenario. Before, intuition was critical as you could otherwise spend months chasing a concept. Now, set up correctly, we can just run or solve the model like an equation.” I’d like to meet this professor— she sounds to be quite smart.

Dansk: Dedikeret til matematik

Math is the universal language. It crosses all domains. If we have enough of the right data and have leveraged our access to domain expertise appropriately and used math to create algorithms which abstract the problems at hand away from the domain, the available expertise can then be redeployed as we now have a definition of our problem which no longer has a deep reliance on the domain or it’s expertise.

Two additional quotes from the article:

[…] Math is to data what abstraction layers are to software development. With it, we can translate physics and energy for the International Space Station, drug discovery, DNA search efficiency, finding the tomb of Genghis Khan, and nearly everything else into a common vocabulary for problem statements.

[…] Once the math is done, domain knowledge is not as important; it allows for skills to be leveraged across industries/domains, and empowers broader use of core technologies to help in the optimization and solving of key questions. It creates an API to our problem.

Big Data is Still Data

Yesterday I wrote a post entitled, “Big Data’s Little Secret”. In that post, I noted it is my belief that the ‘fear of missing out’ has become concomitant with the ever quickening pace of technology’s advance. I further opined that because of the fear of missing out there is increasing potential that we’ll lose sight of the fact (if we haven’t already) that “Big Data” is still Data.

We’ve accumulated many many years of expertise and understanding in the domains of Data and Data Management. We understand backup and recovery and replication. Specialties and specialists have arisen from the sub-domains of data governance as well as security and auditing. We recognize the business’ need for data when we consider its availability and do disaster preparedness planning. There are so many facets of data and it’s management that I don’t believe I am able to list but a very small percentage of them, yet I am hard pressed to come up with a single example which applies to what we classically call Data and not to “Big Data”.

It must have been serendipity, then, that I came across this Infographic at the IBM Big Data Hub.

From the Infographic:

There are certain things that cannot be overlooked when dealing with data. Best practices must be instituted for the care of big data just as they have long been in small data. […]

Enjoy …

Infographic: Taming Big Data - From IBM

Infographic: Taming Big Data – From IBM

Big Data’s Little Secret

In a recent Forbes magazine article, Howard Baldwin took an opportunity to whack the reader with an “obvious stick”. In this age of the ever accelerating freight train that is Moore’s Law, its becoming easier every day to succumb to the fear of missing out. This is no more prevalent than with the topic of Big Data. It is my fear that once we forget that “Big Data” is, at its core (and center and surface for that matter), our long familiar friend data, albeit with a new hairstyle, we face not the possibility, but rather the likelihood of repeating data disasters of the past. We suffered for too many years and endured too much pain and anguish to allow a bit of buzzword excitement to blind us to reality or the obvious. We cannot afford to forget the data lessons of decades gone by.

Lest I be misunderstood, I am by no means advocating putting on the brakes and calling for an international summit and standards organization to be formed around Big Data before we continue on our merry way. I am saying, though, that if we back-burner any of the controls we currently have in place for the sake of expediency in this brave new Big Data world, that we do so with our eyes wide open and with the clear understanding of what controls we’re relaxing and why.

From the article:

Big data doesn’t make data management easier. It makes it harder. Companies that have had a difficult time mastering structured data aren’t going to magically master unstructured data. There are little stumbling blocks such as taxonomies, consistency, hierarchies, and so on that have always made getting to a single source of truth a challenge. Is it a zip code or a postal code? Is it a car, a truck, or a vehicle?

Without applying some rules, you could end up being more confused, with data that’s less reliable and less trustworthy than before. My advice: don’t start tackling big data unless you’re really confident that you’ve mastered data of any size.

English: Data Flow Diagram Example

Lastly, to Baldwin’s point, if you’ve not mastered ‘small data’, or at the very least went and got the T-shirt and you are still heading down the ‘Big Data’ trail with guns-a-blazin’, I wish you luck and invite you to give me a call when you’re mid-tunnel and discover that the light you’ve been heading toward is in fact an oncoming locomotive and not daylight as you had hoped.

The article mentioned can be found at: Big Data’s Little Details – Forbes