On the Beauty of a Neuron Expressivity Matrix
A biological neuron is a very powerful computational device (read here to know how much powerful). Current mathematical models like threshold, ReLU and other artificial neurons are much simpler than the real neuron. Still, we don’t fully understand even the simple ones. It took many years to realize that ReLU works better than sigmoid in practice. Where else are we mistaken?
Maybe, before making neural networks bigger and bigger we need to better understand the first principles?
Common view, that a neuron is just a parametric function y =f(x,w). And by varying parameters (weights, w) we change this function. How can we say that one model(parameterization) is better than the other? The expressivity of a model is one of the criteria.
The expressivity of a neuron is the number of functions it can realize. Imagine it as a table where the first column shows every possible input and the next columns encode different functions. Here is an example for a binary input x of size N and a binary output.
Let’s find this expressivity for a simple case: the threshold neuron with binary inputs and weights.
Try to think about the analytical expression yourself. Here is the code at google collab to test your answers. Read the solution below.
Let’s answer the easier question first: how many functions there are at all? How many ways to map binary vectors of size N to a binary number?
Our input is a binary vector of size N and can take 2^N (two power N) possible states. The output has only two states, so the number of possible functions is 2^2^N. In a general case, if you want to map X elements to Y elements, then there are Y^X possible mappings (functions).
Now, the solution for a neuron.
- First, let’s count the number of functions with fixed weights but a varying threshold. We select weights w with exactly q ones (picture below) and enumerate all thresholds (t=0,1,2,…,N). We receive q+2 functions. Plus 2 because we get two additional functions: for t>q the neuron is not active (y=0), and for t=0 the neuron is always active y=1.
2. Next, vary the weights. The number of ways to choose q ones among N is С(q,N) (the number of combinations q from N).
Thus for a fixed number of ones we have C(q,N)q+2 distinct functions. We do not multiply combinations by “+2” because they all give the same functions (y=0 or y=1).
3. Finally, weights can have any number of ones. Thus, we sum over all q and get the answer:
Later I accidentally saw that it can be simplified to:
You can try to prove this equality. Note, that it is smaller than the total number of parameters ((N+1)*2^N), which means that different parameter combinations give the same functions.
Also, we see that expressivity is much lower than the total amount of functions 2^(2^N). That is why we need to assemble many neurons into the network to receive greater expressivity together.
And now to the beauty. Let’s plot these functions. Here is the picture for N=10.
Vertically are shown all x, enumerated from 000..0 to 111…1, horizontally — all functions that correspond to some parameters w, t. Blue shows the active neuron (y=1), white — inactive (y=0).
Here is the animation for various N for a fixed threshold(t=3). Note, the weights (w) are enumerated from 000..0 to 111…1 from the top left corner, similarly to x.
These pictures create beautiful patterns, something similar to fractals. Indeed, the expressivity matrix can be generated recursively: the matrix of size N depends on the matrix of size N-1.
Most likely, the expressivity of a threshold function for binary inputs and weights was found many decades ago and the solution is archived somewhere in the papers. But these pictures are hard to get without a computer. I have not seen them before, and perhaps no one had. You will be among the first :) If someone saw them earlier, let me know in the comments.
We can easily increase the expressivity almost twice. Just use the weights as plus-minus one (w=-1, w=1) with the threshold in the range [-N, N].
The most beautiful case is when both the inputs and the weights are {-1, 1} and the threshold is put to zero (t=0). Here the expressivity equals the number of possible input vectors 2^N, thus we get a square matrix. The case for N=12 is shown at the top of the article. For other N see the animation:
If we take a discrete case w=1,2,3,…, it appears there is an upper bound when discretization does not give new functions. For a binary, input this bound is [-N, N], so it makes sense to take only 2N+1 states of weights. The real weights will not improve expressivity further. That is the maximum we can squeeze from this model.
Motivation
Why do we care about the expressivity of a neuron and such a simple threshold model with discrete parameters?
First of all, it gives a better infinitive understanding of what learning is. For classification with a teacher, there is some unknown function f*. The teacher shows examples (x, y) generated from this function y =f*(x). The goal is to find f* or at least the most similar to it. The neuron (or a neural network) gives possible candidates (our expressivity) what this function might look like. If the teacher gives all 2^N examples (for a binary vector) this would be the same as if it gives the function f* itself. In this case, our goal is to find the parameters w,t that realize the most similar function to f* (remember, it is just a column in the expressivity matrix). The greater the expressivity, like in the neural networks, the more similar function we can. However, data is always not enough. So we need to find the function that is the most similar to what we know (in-sample or the training error). As well, it should be close to what we do not know (out-of-sample, generalization, or the test error). Interestingly, larger expressivity is not necessarily better, it could harm the generalization. There is a balance that is called bias-variance tradeoff that lies in the heart of a statistical learning theory.
For unsupervised learning the target function f* is not set by a teacher, we need to define it ourselves. For example, for data compression, we need to choose the function that on average would activate the neuron at 50% times. In this case, it has the maximum entropy and preserves the most information about the input x.
Lastly, models with discrete parameters are more efficient in hardware realization. The trend to move deep learning models into portable devices is on the rise. Still, most methods rely on backpropagation for learning. Understanding the basic principles of discrete parametric functions may lead to more efficient models of learning. In addition, it even can give hints to how the real neurons learn. Hopefully :)
Also, we need to remind ourselves that we might ask the wrong questions. Commonly assumed, that learning is the change of parameters of a function according to some criteria. Mostly, people try to improve learning algorithms and complicate architecture (parametrization). What if there is something deeper and more fundamental?
Maybe we need to understand at first how to write information in distributed systems like neural networks? The next articles of this blog will elaborate more on this question.