Why Big Data 101?

While the following events did not occur verbatim, they did happen in spirit.

Big Data: water wordscape

At a recent family event I had the opportunity to talk with my younger brother. He graduated from the University of Wisconsin in Milwaukee a couple years ago with a degree in computer science. Since graduating he’s found a full time paid internship and was able to convert it into being taken on as a full time engineer, putting him at having 3 (ish) years of practical work experience.

During course of conversation, I kept having this nagging feeling of being misunderstood. It felt as though what I was saying was missing a common frame of reference for the topics of the conversation; that I was, in essence, speaking gibberish. I understand that my brother is young– much much younger than I. I also understand we come from two very different ways of having been taught our craft, but still, we’re both technical and should have common ground; something was very

Big Data

off.

I stopped the conversation and bluntly asked him, “Do you know…”

  • what is meant by the term “Big Data”?
  • what Hadoop is?
  • what the mapreduce paradigm is?
  • why a NoSQL database would be used vs. a traditional RDBMS?

To each of these questions, the answers were similar. He had heard all of the terms I asked about, but he had no actual experience with any of them. He went on to tell me that his way of learning things was to wait until he was either told by a manager he would need a new skill or knowledge-set for an upcoming project or he came to discover it himself. In other words, he would learn something new only if it was required; there was no exploring or being proactive.

After I quickly reorganized my thoughts following the shock and awe that was just dropped before me, I saw an opportunity to reach out and be a mentor. I explained to him the importance of taking control over his own destiny and to actively manage his learning and career direction. In the world today, waiting for someone to tell you what to do just doesn’t cut it.

Apache Hadoop Elephant

Further, he had been telling me how he was being put more and more into a leadership and project management role and while found he was good at it– and further had been recognized as being good by his managers, it wasn’t what he wanted to be doing. Of course, I made the foolish mistake of asking him what it is that he wants to do, only to get back the answer I deserved for such a silly question– “Gee, I’m not sure, I just know I don’t want to do that.”1

I told him that I would write up a list of links and come up with a self-education task plan for him around the big data space if he were interested. And that brings us to where we are now. That is why over the next several weeks I’ll be going back to the basics and creating a “Big Data”, Hadoop and NoSQL primer baseline series of articles and tutorials.


  1. This is something that I’ve come to call the ‘Law of Hate’. It typically gets applied to customers and specifically in user interface design situations. In essence, it is impossible for a customer to tell you, the consultant, what they want. They can only tell you what they hate after you have showed to them the 2 dozen mock-ups you have prepared. 

Metacademy: Machine Learning and Probabilistic AI Learning Resources

I’ve recently come across a tremendous resource for the discovery of various machine learning and probabilistic artificial intelligence topics and associated educational materials. The site I’m describing is Metacademy.Metacademy Large Cropped Home PageMetacademy is a community-driven, open-source platform to facilitate the collaborative construction of a web of knowledge by domain experts meant to help individuals efficiently learn about any topic of interest (supported by Metacademy and the domain experts). The experts responsible for Metacademy are Roger Grosse and Colorado Reed. In addition to building the site, they organized roughly 350 machine learning and probabilistic artificial intelligence concepts along with related training and learning materials.

While Metacademy is currently focused on machine learning and probabilistic artificial intelligence topics, eventually, it has the goal to cover a much wider breadth of knowledge; e.g. mathematics, engineering, music, medicine, computer science, etc.

The premiss of Metacademy is that a user will search for and click on a concept of interest. Metacademy then produces a “learning plan” which includes the prerequisite concepts which were identified in the web of knowledge previously created by the domain experts. This component of identifying for the student the list of prerequisite concepts is what sets Metacademy apart from other learning sites or course catalogs.

As posted at Metacademy:
… But try learning something of conceptual depth by sifting through Google search results … and you’re in for a lot of agony. Before you learn this concept, you need to learn its prerequisite concepts (sometimes you’re not entirely sure what these are), and the prerequisite concepts may have prerequisites themselves. Pretty soon, you’re deep in dependency hell, switching between twenty different tabs trying to understand the various [pre]prerequisite concepts in order to understand the tutorial article Google returned …

Metacademy’s learning experience revolves around two central components:

  • a “learning plan” in a tabular ‘list view’
    Metacademy Logistic Regression List View
  • a “graph view” of the learning plan which is meant to help explore relationships among concepts
    Metacademy Logistic Regression Graph View

Clicking on the check-mark next to the title of a concept in either the graph or list view marks that [prerequisite] concept as being understood. To not show those concepts which have been marked as being understood, click the “hide” button in the upper right. Note that Metacademy will remember the concepts marked as understood and hidden and will automatically re-apply these selections at future visits.

As Metacademy is a work in progress and limited in scope, please keep an open mind when visiting, but I think that you will find it an interesting, unique and valuable resource, particularly if you are, as I am, actively exploring the world of machine learning.

http://www.metacademy.org

Free Big Data Education Resources

Daniel Gutierrez wrote a series of articles (three of them) at Big Data Republic presenting a number of free education opportunities focused on Big Data. His first article was focused on fairly high level, not really technical, resources with a target of upper or mid-level managers.

Daniel suggests joining a local Meetup.com group with an orientation toward a shared learning experience as well as looking into one of the many special interest groups with a big data focus homed at LinkedIn.com. He points to the Big Data/Analytics/Strategy group on LinkedIn as being one of his favorites.

We’re reminded by Daniel that there is a wealth of knowledge shared and available to watch on YouTube. He calls out the “What is Big Data” part 1 and 2 videos by IBM as specific examples as well as “What is Big Data,” by Explainingcomputers.com.

White papers, Webinars and Blogs are also resources that we are reminded not to forget as a great resources for learning. He offers “Big Data: Harnessing a Game Changing Asset” (registration required) at InformationWeek.com as a specific White Paper example and then points us toward Big Data Republic Webinars section. Noting that the blogosphere is alive with big data educational content, Daniel encourages us to check out an aggregator of big data blogs, Planet Big Data, to get a handle on whats available.

Satellite Sees Two Tropical Cyclones Chase Tro...

The last resource we’re given in this article is the granddaddy of big data educational resources: Big Data University. It is pointed out that the “Big Data Analytics – Demos” course is a good choice for managers, as it provides scenarios and demos showing big data analytics at work.

The whole of this, first in a series of articles, can be read at: Free Big Data Education: A Management Perspective.


The second article in Daniel’s series is focused on technical topics In the realm of big data with the intended target of serving the needs of IT personnel, managers with technical responsibilities, consultants, and developers who are new to this area. He points to IBM’s Big Data Hub as a particularly good big data portal and enumerates the following videos as a good place to start to learn about the different components of the Hadoop platform:

It is also noted that the many vendors of Big Data products produce and make freely available training material. Cloudera’s video library is called out along with a video of Cloudera CEO Mike Olsen entitled, “What is Hadoop”.

Image representing Hadoop as depicted in Crunc...

Image via CrunchBase

Daniel goes on to reiterate the suggestion from his first article to seek out local Meetup.com groups before recommending a powerful learning resource called Coursera. Coursera is a growing collaboration of 33 well known schools including Stanford, Caltech, Princeton, Duke, Brown, and Columbia. Specifically called out is the course, Web Intelligence and Big Data. Also mentioned is another valuable free online course resource is Edx.org, a not-for-profit consortium among MIT, Harvard, and UC Berkeley.

The second article of Daniel’s series can be read at: Free Big Data Education: A Technical Perspective.


The third and final part of this Free Big Data Education series finishes up with the area of big data known as data science. Data science— and the driving force behind it, machine learning— is the process of deriving added value from data assets. Commerce and research are being transformed by data-driven discovery and prediction. The skills required for data analytics at massive levels span a variety of disciplines and are not easy to obtain through conventional curricula.

Daniel again recommends seeking out local Meetup.com groups. As insight into these groups, we’re directed to visit the Field Report category at his blog, Radical Data Science, where he’s written intimate accounts of various meetings attended.

Now is the time to take advantage of the many free education resources available.

Using resources found within the Massive Open Online Course (MOOC) movement Daniel has built a Data Science “pseudo degree program” to follow. These free courses (some offer certifications) offer an excellent path toward obtaining the requisite background for becoming a data scientist.

A selection of free data science books is also presented:

This final part of Daniel’s series can be read at: Free Big Data Education: A Data Science Perspective.