Entropy Properties of a Decision Rule Class in Connection with machine learning abilities
Many methods of Machine Learning are based on the idea of empirical risk minimisation. It is to find a decision rule or a model from some set which most perfectly fits the data presented in the training set. This idea is based on the large number law: empirical risk converges to real risk, if the training set is large enough. But if the class of decision rules or models is too large (in some sense) one meets the problem of oferfitting, the model perfectly corresponds to the data presented in the training set, but shows large errors on new data. It is due to the fact that only uniform convergence of empirical risk to the real risk guarantees closeness of the optimal model behaviour on the training set and on the new data. We introduce the notion of entropy of a decision rule class over a fixed sample sequence as log of the number of possible classifications of the sequence by the rules of the class. Maximum entropy over sequences of a fixed length l determines sufficient condition of the uniform convergence and corresponding estimates. But only average entropy H(l) behaviour determine necessary and sufficient condition of the uniform convergence. The condition is that H(l) / l (average entropy per symbol) should go to zero when the sequence length goes to infinity. If the condition does not hold then there exists a set of objects with non zero probability measure, such that almost all sequences of arbitrary finite length from this set may be divided in all possible ways by the rules of the class. One can easily see, that in this case overfitting is inevitable. Similar results are found for real dependencies instead of decision rules.
Author: Alexey Chervonenkis, University Of London