October 09, 2020

About the Aika Neural Network

Aika (Artificial Intelligence for Knowledge Acquisition) is a new type of artificial neural network designed to more closely mimic the behavior of a biological brain and to bridge the gap to classical AI. A key design decision in the Aika network is to conceptually separate the activations from their neurons, meaning that there are two separate graphs. One graph consisting of neurons and synapses representing the knowledge the network has already acquired and another graph consisting of activations and links describing the information the network was able to infer about a concrete input data set. There is a one-to-many relation between the neurons and the activations. For example, there might be a neuron representing a word or a specific meaning of a word, but there might be several activations of this neuron, each representing an occurrence of this word within the input data set. A consequence of this decision is that we have to give up on the idea of a fixed layered topology for the network, since the sequence in which the activations are fired depends on the input data set. Within the activation network, each activation is grounded within the input data set, even if there are several activations in between. This means links between activations serve two purposes. On the one hand, they are used to sum up the synapse weights and, on the other hand they propagate the identity to higher level activations.

The Aika network can be characterized by the following key features:

  • A linking process ensures that only links consistent with the identity of the output activation are added to the network. In other words, in order for an input link to be added to an activation it needs to be grounded in the same input data as at least one of the other input links for the activation.
  • There are different types of neurons that serve different roles within the network. Pattern neurons are conjunctive neurons that are activated if enough input features are present to infer the presence of a given pattern. Pattern part neurons are activated if a certain input feature occurs as part of an overall pattern. Pattern part neurons are also conjunctive. One of their tasks is to ensure that the input features occur in the correct relation to each other. Pattern neurons and pattern part neurons are connected in a positive feedback loop to each other. Patterns formed by a pattern neuron and the group of pattern part neurons associated with it can be stacked on top of each other.
  • Inhibitory neurons are another type of neurons. These are disjunctive in nature and are connected to pattern part neurons of different patterns. Their output synapses form negative feedback loops with the pattern part neurons such that the pattern part neurons of different patterns are able to suppress each other. Negatively weighted synapses, however, are special, in that they require the introduction of mutually exclusive branches within the activation network. These branches are isolated from each other and represent a certain interpretation of parts of the input data set. These branches can, for example, be associated with a certain interpretation of the parse structure of a sentence or an image. A good example of where these branches become recognizable in human perception is with input data sets that offer several equally likely interpretations, like in the sentence "rice flies like sand" or in hidden face pictures.
  • Since this type of network contains a lot of cycles, the usual backpropagation algorithm will not work very well here. Also, relying on handcrafted labels that are applied to the output of the network can be highly error-prone and can create a large distance between our training signal and the weights that we would like to adjust. This is the reason for the huge number of training examples required for classical neural networks. Hence we would like to train the network more locally from the patterns that occur in the input data without relying on supervised training labels. This is where Shannon entropy comes in quite handy. Take, for example, a word whose input features are its individual letters. In this example, we can measure the amount of information of each letter by calculating the Shannon entropy. Then we can look at the word pattern neuron as a way of compressing the information given by the individual letter neurons. The word neuron requires a lot less information to communicate the same message as the sum of the individual letters. This compression, or information gain, can be formalized using the mutual information, which can then be used to derive an objective function for our training algorithm.
  • A consequence of using entropy as a source for the training signal is that we need to know what the underlying probability distribution is for each neuron and for each synapse. That is, we need to count how often each neuron is fired. But determining this statistic involves some challenges as well:
    • Not all the neurons exist right from the start. Therefore we need to keep track of an offset for each neuron, so that we can compute how many training examples this neuron has already seen.
    • Not all activations cover the same space within the input data set. For example there are a lot more letters than there are words in a given text. This has to be taken into account when determining the event space for the probability.
    • Then there is the problem that right after a new neuron is introduced, there are very few training instances for this statistic, yet we still want to adjust the weights of this neuron. Hence we need a conservative estimate of what we already know for certain about the probabilities. Since entropy is based on the surprisal (-log(p)), which gets large when the probability gets close to zero, we can use the cumulative beta distribution to determine what the maximum probability is that can be explained by the observed frequencies.
    • Another problem is related to concept drift. As we are using the distribution to adjust the synapse weights, this leads to changes in the activation patterns and therefore to changes in the distributions again.
  • Initially, the network starts out empty. New neurons and synapses are then added during the training process. There is an underlying set of rules that determines when certain types of neurons or synapses are induced.