Data Science 101: Machine Learning – Probability

Srinath Sridhar, an engineer with BloomReach, has recorded a 5-part video series giving a fairly comprehensive introduction to the world of Machine Learning. If you’ve ever asked yourself, “What is Machine Learning?”, or wondered how it actually worked, these videos are for you!

This first video goes over some of the fundamental definitions of statistics that are necessary as a foundation to understanding and analyzing the machine learning algorithms that are to be examined in following videos. The presentation defines random variable, sample space, probability, expectation, standard deviation and variance and goes over examples of discrete and continuous probability distributions.

Enhanced by Zemanta

GigaOm: Let’s build a semantic web by creating a Wikipedia for relevancy

A Note About this Post:
For those who have been unable to get to the GigaOm content I commented on earlier, I’ve made it available in this post.

This article is a great discussion of natural language processing and semantic search. It is well worth the time to read and then re-read!


SUMMARY: The relevancy-defining, edge-weighting algorithms of Google’s Knowledge Graph, Facebook’s Open Graph and Gravity’s Interest Ontology are closely guarded company secrets. Imagine if that data was available to everyone — it would be as disruptive as Amazon Web Services. The internet would be a better place.


Everyone is always asking me how big our ontology is. How many nodes are in your ontology? How many edges do you have? Or the most common — how many terabytes of data do you have in your ontology?

We live in a world where over a decade of attempted human curation, of a semantic web has born very little fruit. It should be quite clear to everyone at this point that this is a job only machines can handle. Yet we are still asking the wrong questions and building the wrong datasets.

Understanding NLP

The exponential growth of data created on the web has naturally led to a desire to categorize that data. Facts, relationships, entities — that is how those of us who work in the semantic world refer to structuring of data. It’s pretty simple actually. Because we are humans, it happens so quickly in our subconscious minds that it’s incredibly easy to take it for granted if you don’t work on teaching machines to do it.

It’s also not a new field; deconstructing human language into structured data (natural language processing) has been around for almost 40 years. NLP can take the sentence “Jim is writing an article about why people ask the wrong questions about ontologies” and structure it into

NNP = Proper Singular Noun

VBZ = Verb, 3rd Person Singular Present
VBG = Verb/Present Participle
DT = Determiner
NN = Singular Noun
IN = Preposition
WRB = Adverb
NNS = Plural Noun
VBP = Verb, Non 3rd Present Singular Person
DT = Determiner
JJ – Adjective

That’s pretty impressive — a machine just did that. I bet you couldn’t label all of those (maybe your high school English teacher could). But you can understand what the sentence means less than a hundred milliseconds after reading it, and that’s what really matters. The machine has no understanding of the information the sentence conveys. Its job is to decompose unstructured language into structured data that another system might be able to understand.

That’s where semantics come in. Semantics try to understand the relationships between things (we call them entities, or nodes, if you really want to go down the rabbit hole).

Jim [PERSON] -> writes [ACTION] -> sentence [THING]. Seems like a something a child could do right? The human brain is amazing.

Semantic analysis isn’t easy

Try this one: ”I paddled out today, and dude, I look like a lobster.”

What does that mean? We know someone is talking about himself because of the leading personal pronoun. NLP won’t help us with the rest, but with “today” most good entity extraction engines can tell us we’ve got a time period (maybe even future intent — exciting!). We can use publicly available ontology data from Freebase, Wikipedia or DBpedia (or many others) to determine paddle [disambiguates to CANOEING], dude [ PERSON – TYPE OF GENDER] and lobster [COMMERCIAL CRUSTACEANS].

So we’ve got:


This is like an ad server’s dream! Whoever tweeted this needs to be pummeled and retargeted with Red Lobster ads for months. I actually have set up sites with this sentence and tasked many other IAB-focused ad systems to recognize it — and it’s all Red Lobster all the time. I’ve enjoyed many half-off cheese biscuits in the last 12 months (R&D sometimes bears not only fruit but also cheesy biscuits).

But I wasn’t talking about canoeing or lobsters. When I’m not working I surf and, unfortunately, occasionally I do get sunburned — sometimes to the point of being told I look like a lobster. That’s what I was conveying in my tweet. Why is it so easy for us to understand but so hard for a machine to understand?

But maybe this is just a funny edge case. You can confuse any computer system if you try hard enough, right?

Unfortunately, this isn’t an edge case. Lexicons used to be considered different languages or different colloquial terms specific to particular industries before Twitter. This is no longer true: 140 characters has not just changed people’s tweets, it has changed how people talk on the web. More and more information is being communicated in smaller and smaller amounts of language, and this trend is only going to continue. #exponential

So why is there not a semantic web? Why can’t we solve this yet? Why can’t computers understand that “I’m a lobster” means you are sunburned and not that you want cheesy bread?

Not just connections, but connections that matter

I believe the reason that there are not hundreds of companies exploiting machine learning techniques to generate a truly semantic web is the lack of weighted edges in publicly available ontologies. “Lobster” and “sunscreen” are seven hops away from each other in DBpedia — way too many to draw any correlation between the two. (Any article in Wikipedia can be connected to any other article within about 14 hops, and that’s the extreme. Meanwhile, completely unrelated concepts are often just a few hops from each other.) But by analyzing massive amounts of both written and spoken English text from articles, books, Twitter and television, it is possible for a machine to automatically draw a correlation and create a weighted edge that effectively short circuits the sevens hops otherwise necessary.

Many organizations are dumping massive amounts of facts without weights into our repositories of total human knowledge because they are naively attempting to categorize everything without realizing that the repositories of human knowledge need to mimic how humans use knowledge.

For example: As of today, Kobe Bryant is categorized under 28 categories in Wikipedia, each of them with the same level of attachment (one hop in a breadth- or depth-first traversal).

But when you are at a coffee shop and overhear the person next to you mention Kobe Bryant, what are you able to infer they are talking about? “Basketball” or “American Roman Catholics”? How can the human brain infer that so quickly yet machines get so confused? It is not due to lack of technical processing power, Moore’s law slowing down or the thickness of our silicon wafers — it’s because of the data.

This is a small example of what someone who works with graph theory would come up with if he or she were to run a standard few-hop depth first traversal from Kobe Bryant on Wikipedia and attempt to coalesce around a common category:

So when someone tweets about Kobe Bryant, are they talking about people born in 1970,Pennsylvania, Food & Drink, or Canadian Inventions? This is a common example of how confused a machine can become when the distance of unweighted edges between nodes is used as a scoring mechanism for relevancy.

But what happens if we weight our edges? The same Wikipedia nodes with path costs can be run through a traversal algorithm that calculates those costs and we get the following:

Our machine is starting to think like a human.

Algorithms and processors aren’t enough

Weighted path traversals are not new. Dijkstra’s algorithm was invented in 1956 (the answer has been around for a long time), but the processing power and memory necessary to leverage a traversal algorithm like Dijkstra’s — and score path costs and not just distance between nodes across massive data sets — has only in the last few years become available to the average startup. That’s a huge win for all of us, but the data and ontologies to actually do it are still not publicly available.

I propose as an industry we begin to focus more on relevancy and less on factual accuracy. The above flawed traversal is actually 100 percent factually accurate. Kobe Bryant was born in 1979, he is (or was) sponsored by Gatorade, Gatorade is a drink and basketball was invented in Canada. But even now that you know all those facts, when you hear someone talk about Kobe Bryant tomorrow you will still know they are talking about basketball.

The only way we will actually get to a truly semantic web is when machines are able to think (or, more accurately, perform) as we do. The processing power, technology and algorithms to do that exist today. Unfortunately, said power is unleashed on inherently flawed datasets, and that is why we still see Red Lobster ads on sunscreen pages. We need to become much less focused on adding facts to Freebase, DBpedia and the other publicly available ontologies, and much more focused on weighting the edges between the facts that we are adding.

That is how we create a truly semantic web. The answer lies in the data, but not in the data available on a web page or in a set of thousands of web pages available to be recommended by a particular algorithm. Informational retrieval and categorization techniques such as LSI, PLSI, and LDA are only aware of the context of the information in the datasets fed to them. These base algorithms (LDA, especially) are incredibly useful, but without the context of a global human knowledge base, you cannot build an interest graph, and you cannot build a semantic web.

Ontologies become absolutely necessary as we attempt to solve this problem. If you feed into any of the above algorithms 10 articles a particular person read about snowboarding, they will successfully recommend other snowboarding articles, but are unaware that snowboarding and surfing are two sports that go hand in hand. People who enjoy one usually enjoy the other. An ontology with weighted edges is necessary to make that relevant yet tangential connection, which is a crucial step to avoid the dreaded “filter bubble.”

Benedetto (center) at Structure:Data 2012

Benedetto (center) at Structure:Data 2012

A Wikipedia for weighted edges

So to all of my semantic colleagues out there, maybe we should shift our thinking and begin to use a different yardstick to measure the quality of our knowledge repositories. For 99 percent of all use cases we have enough nodes. We have successfully deposited the majority of places, events, people, thoughts, and most other tangible and intangible things in the world into our data stores — and a good portion of the population has access to all of it from the smartphone in their pocket. That is an incredible feat. But it’s only half of the equation. We still have yet to map the data into a format that mimics how the human mind thinks.

The way to do this is to begin weighting the edges that interconnect the nodes and facts that we are adding every day. It requires us to raise the bar from factually accurate to actually relevant. Kobe Bryant -> Philadelphia is factually accurate, but Kobe Bryant -> Basketball is actually relevant. Today’s ontologies make no distinction between those two facts, and without that distinction a machine will never be able to create the semantic web we have been working towards for almost a decade.

Every fact in Wikipedia was added by a human. Weighting all of the edges between those facts sounds like a monumental task. But crowdsourcing the creation of a central repository of all human knowledge sounded impossible a little over a decade ago, and we’ve done a very good job of that.

It wasn’t too long ago that running an elastic cloud infrastructure was something that was available to only the largest internet companies in the world. Amazon changed that. Now, one smart engineer can turn an idea into a company for $50 a month. But there is still a large divide between one smart engineer and companies that can use machines to perform web-scale semantic analysis of content.

The relevancy-defining, edge-weighting algorithms of Google’s Knowledge Graph, Facebook’s Open Graph and Gravity’s Interest Ontology are closely guarded company secrets. Imagine if that data was available to everyone — it would be as disruptive as Amazon Web Services. The internet would be a better place.

At Gravity, we have combined many publicly available ontologies with our own internally generated facts and weights to create a large interest-based undirected graph that leverages many forms of edge weighting to solve the above problem. For many years, we built and protected this as a company secret. In the last year we have realized that our mission — building a web-scale personalization platform — takes a lot more then an ontology with weighted edges. It’s an iceberg problem that looks simple when you are designing collaborative filtering for an app or yield-optimizing by user for your site, but our mission is a platform for the entire web.

A relevancy-based ontology with weighted edges is absolutely necessary, but it is just the beginning. That’s we are also formalizing a plan to develop an open, centralized place to allow human and machine curation of ontology edge weights for the community. We plan on contributing a significant amount of our data to get the project started. More on that to come.

Until then, as a community, I believe we should begin to focus more on relevancy and relationships, and less on the continued addition of facts to our publicly available semantic resources.

Jim Benedetto is co-founder and CTO of Gravity.



by Jim Benedetto

NOV. 24, 2013 – 1:30 PM PST


The original article in its entirety can be found and read at:

GigaOm: Let’s build a semantic web by creating a Wikipedia for relevancy

# # #

Metacademy: Machine Learning and Probabilistic AI Learning Resources

I’ve recently come across a tremendous resource for the discovery of various machine learning and probabilistic artificial intelligence topics and associated educational materials. The site I’m describing is Metacademy.Metacademy Large Cropped Home PageMetacademy is a community-driven, open-source platform to facilitate the collaborative construction of a web of knowledge by domain experts meant to help individuals efficiently learn about any topic of interest (supported by Metacademy and the domain experts). The experts responsible for Metacademy are Roger Grosse and Colorado Reed. In addition to building the site, they organized roughly 350 machine learning and probabilistic artificial intelligence concepts along with related training and learning materials.

While Metacademy is currently focused on machine learning and probabilistic artificial intelligence topics, eventually, it has the goal to cover a much wider breadth of knowledge; e.g. mathematics, engineering, music, medicine, computer science, etc.

The premiss of Metacademy is that a user will search for and click on a concept of interest. Metacademy then produces a “learning plan” which includes the prerequisite concepts which were identified in the web of knowledge previously created by the domain experts. This component of identifying for the student the list of prerequisite concepts is what sets Metacademy apart from other learning sites or course catalogs.

As posted at Metacademy:
… But try learning something of conceptual depth by sifting through Google search results … and you’re in for a lot of agony. Before you learn this concept, you need to learn its prerequisite concepts (sometimes you’re not entirely sure what these are), and the prerequisite concepts may have prerequisites themselves. Pretty soon, you’re deep in dependency hell, switching between twenty different tabs trying to understand the various [pre]prerequisite concepts in order to understand the tutorial article Google returned …

Metacademy’s learning experience revolves around two central components:

  • a “learning plan” in a tabular ‘list view’
    Metacademy Logistic Regression List View
  • a “graph view” of the learning plan which is meant to help explore relationships among concepts
    Metacademy Logistic Regression Graph View

Clicking on the check-mark next to the title of a concept in either the graph or list view marks that [prerequisite] concept as being understood. To not show those concepts which have been marked as being understood, click the “hide” button in the upper right. Note that Metacademy will remember the concepts marked as understood and hidden and will automatically re-apply these selections at future visits.

As Metacademy is a work in progress and limited in scope, please keep an open mind when visiting, but I think that you will find it an interesting, unique and valuable resource, particularly if you are, as I am, actively exploring the world of machine learning.