Many ML projects fail because of this misunderstanding about ML

ML doesn’t add intelligence but simulates intelligence that a human feeds to ML in training data

Photo by Manik Roy on Unsplash (source)

Calculator vs Human

Who is more intelligent: a calculator or a human? It depends on what operations one needs to perform.

Even a basic calculator can instantly multiply 6-digit numbers. The smartest humans can hardly compete with that.
On the other hand, any human can do some tasks that even the best supercomputers can’t do without a tremendous amount of resources, e.g., recognizing visual objects or understanding natural language.

So, the advantage of a calculator is to be able to do tons of mechanical (non-intelligent) operations quickly such as addition, multiplication, subtraction, if-then-else, etc.

ML has very the same strength: it can mechanically learn many “atomic conclusions” (e.g. if X, then Y), combine them into bigger ones (e.g. if Y and Z, then A), and eventually output some “final conclusion”, e.g. the object O belongs to the class C. This is known as learning from training data. A computer where a ML algorithm runs can test thousands or even millions of hypotheses and ways to combine them in order to find ones that produce a final conclusion closer to the truth.

Any ML algorithm is just a chain of mechanical calculations

Let’s consider a simple example: we need to classify news titles into categories (Sports, Business, Health, Science, etc.). Let’s apply a basic ML algorithm that simply finds typical words for each category. To do that, the algorithm counts how often a given word appears in Sports, Business, Health, and the other categories. The typical words are the words that occur mostly in one category, but rarely in the others. Thus, a typical word would be a signal to classify a given title with the category of that word. A human could do very the same ‘counting’ (a statistics-based approach) on their own, but using a computer for such mechanical work is, of course, way more convenient.

Theoretically, it applies to absolutely any ML algorithm: a human could manually calculate even thousands of parameters of a deep neural network for visual object recognition (though that would not utilize human’s strengths — a computer would do that work disparately faster and precisely whereas a human doesn’t need any interim calculation to perceive an image). Any more complex algorithm is a sort of statistical approach too, but a human can no longer interpret what exactly such an algorithm is counting, i.e. it is more of a black box that probes tons of hypotheses in order to approximate the truth. So, a more complex algorithm is just a longer chain of calculations.

Don’t better ML algorithms matter?

They do because they suggest a better approximation of the real world’s truth. For example, just a set of words in the title might not be enough to classify some text; in particular, word order matters too, and negation inverses the meaning of a text.

With plenty of standard libraries for ML, trying all existing algorithms is basically effortless. Nevertheless, don’t brute force the problem by blindly trying all ML algorithms. Instead, try to understand the “essence” of the problem by looking into training examples and manually finding some insights first (e.g. notice that each news category has some typical words). In other words, imagine in your mind first what a ML algorithm is supposed to find (e.g. calculate occurrences of each word in each category, which will show typical words). Then, your intelligence in finding relations and causations will be combined with a ML algorithm’s ‘superpower’ — finding the most prominent relations and most credible causations out of a vast number of hypotheses.

When designing a ML algorithm, solve a problem only with mechanical calculations

A ML algorithm is a human’s approach to a problem that was converted into the form of mechanical calculations. If a human understands the language of the news titles to be classified, the problem of classifying a title is trivial. But a machine cannot easily understand the meaning of a title and a category like a human, and cannot group titles by categories. For a machine, the titles are just a series of digits. A machine can do only some mechanical operations, e.g. split a title into words, compare two words and/or count the number of occurrences of a word. In other words, a machine can do operations like a human who doesn’t understand that language of the titles. How would such a human solve this problem?… Right, the human would try to find some correlation between a title’s category and the words in it, i.e. would notice that some words appear in a category more often than others. As you can see, we convert a human’s approach into mechanical calculations. Solving a problem with the operations that a machine can perform can be converted into a ML algorithm.

The human could get more insights, e.g.

Some words help in classification only if they are combined with other specific words.
Word order matters.

This is how more advanced ML algorithms (e.g. N-gram model, LSTM) can be suggested, which are closer to how a human understands a text.

Assessing a training dataset’s size

Understanding what your ML algorithm is supposed to find would also help to assess how much data you need. In the example above, you may need to ask yourself the following questions (all numbers mentioned below are approximate):

How many typical words does each category have? As news titles can be quite diverse, there should be at least a couple of thousands of words.
To be confident, you may want to take a word as typical only when it occurs a minimum number of times, say five.

So, we expect to find about 10 thousands typical words per category. Say, each title would have two-three words that are typical for a category. So, we need about 3–5 thousands training examples per category.

A ML algorithm doesn’t really get rid of noise in training data

A ML algorithm doesn’t really get rid of noise in training data, but replicates everything found there proportionally to a number of occurrences of a given mistake. For example, if the word ‘soccer’ (a typical word for the Sports category) occurs in Fashion a number of times, the word may mislead an algorithm to classify a title with ‘soccer’ as a fashion title.

We may assume that training data has only a small number of mistakes like this, which is normally a valid assumption that good examples prevail. Then, one can simply ignore the counts less than some threshold and therefore ignore mistaken examples. However, note that ignoring small counts may also prune some true but rare occurrences that could help in classification. One cannot simply cut off only mistaken examples — some good examples will be cut off too. The more bad examples vanish, the more good examples do as well. Likely, the area to be cut off — the area of rare occurrences — includes more bad examples than good ones, which is a desired result.

Thus, ignoring rare occurrences sacrifices coverage (some titles with those true rare words will not be classified) in favor of precision (the mistaken examples, which are supposed to have small occurrences and are cut off, will not affect classification). This is why a larger dataset may not help to increase precision: if mistaken examples are multiplied too, a ML algorithm will make conclusions based on these examples too.

A ML algorithm, like a calculator, doesn’t make mistakes in technical calculations, but such an algorithm cannot fix mistakes in training data, like a calculator cannot fix wrong numbers entered by a human.

Summary

A ML algorithm is a human’s approach to a problem converted into a long chain of mechanical operations.
Before applying any ML algorithm, imagine what the algorithm is supposed to find. Could it be expressed in mechanical operations? Or is it something that you cannot even explain to yourself, e.g. how you recognize visual objects?
A better ML algorithm is something that approximates more closely the way a human solves a problem.
A ML algorithm cannot make the training data more precise on its own, but a human who collects training data may cut off parts of them where misleading examples prevail.