Fractal Brain: Learn More About Lifelong Learning AI Models

FractalBrain (FB) is a new connectionist AGI algorithm. FB leverages the key strengths of artificial neural networks (ANN) while attempting to correct for their inaccuracies with a biological brain, towards circumventing the critical roadblocks on the ANN roadmap to AGI. What sets FB apart is that: (i) It grows its parametric model (in both the training and execution mode) versus assumes a fixed number of parameters; (ii) It employs local, Hebbian-type learning, and thus works with non-differentiable domains and is sparse by-design; (iii) It learns a continuous, temporal model of the world, versus a discontinuous set of world batches and (iv) It interacts with the environment via an agent controlled, active sensing module, that feeds the agent with the selective snippets of the environment multi-modal observations, and at selective time points. All these features conspire towards FractalBrain being a first-of-a-kind AGI system that is explainable, aligned and continually learning.

Motivation

Following a period of initial excitement (mid-1970s) and later subdued interest, recent years (2010s) have again seen a rise of interest in artificial neural networks. A subclass of gradient based, supervised machine learning methods, ANNs have been known for decades to be trainable for examples using an application of the differentiation chain rule to fit the labeled examples to a given parametric model. What has recently made ANNs attractive was that they have benefited greatly from the abundance of training data (e.g. LLM pre-training corpora) as well as the development of the semi- and fully dedicated hardware to accelerate the aforementioned differentiation process. And the replacement of a standard sigmoid activation function with a half-linear rectifier unit has allowed for easier gradient propagation through the network, allowing for training deeper ANN architectures end-to-end. These fundamental advances in training ANNs have then conspired towards a more efficient training of approximate dynamic programming reinforcement learning agents that use ANNs to approximate their underlying state spaces. Of the modern methods that employ this principle, the policy-gradient methods such as the Asynchronous Advantage Actor-Critic (A3C) or the value-gradient methods such as the Deep Q-Learning (DQN) have gained most traction, demonstrating their soundness on domains ranging from Atari2600 games and the ancient game of Go to the fine-tuning of LLMs using human feedback.

Despite the remarkable success in applying the ANN equipped ADP algorithms to a range of video games or LLMs, it starts to appear that their applicability to solving arbitrary problems out of the box – as necessitated by AGI – is questionable. For example, even an individual, visual processing CNN module that often accompanies ANN-based agents can easily be fooled by changes in a scene hue or saturation, rotations or even single pixel alterations. It similarly is remarkably simple to provide an adversarial prompt to an ADP fine-tuned LLM to steer said model towards unintended prompt completions. And while this obviously is disappointing, it is also potentially risky and detrimental to the rate of progress at which the scientific community pursues the AGI: As more and more compute and human resources are pooled together and devoted to solely one approach to AGI, alternative approaches unequivocally suffer from the deprivation of these resources. To better illustrate that risk and motivate our development of FractalBrain, in the following, listed are the top-10 limitations of ANN based ADP algorithms, that are believed to jointly conspire towards rendering the ANN based approaches towards AGI impractical.

Problem 1: Task Ignorant Model Architecture

This report on ANN critical limitations (on the roadmap to AGI) begins with a focus on the arguably first thing that comes to mind when talking about ANNs: the network architecture. Of late, great efforts have been devoted towards dissecting the AGI problem and proposing fancy architectures (for isolated AGI sub-problems e.g. vision, control etc.) that excel at given benchmark problems with the overall consensus being that there is no “one-shoe fits all” design when it comes to architecture design.

In fact, there has been an outstanding dilemma as to whether to employ rather general-purpose architectures (e.g. fully connected ResNets wherein the trained network itself can choose an optimal subset of connections or layers to use; or Transformer architectures with over 100 global attention layers to let the network itself pick the appropriate number of layers to effectively use) that unfortunately lead to slow convergence, or, problem-specific architectures (e.g. ConvNets or RNNs that employ (local) reusable receptive fields over a chosen number of layers, optimal for a given problem at hand) that are much easier to train. Because it is likely that a future AGI agent will need to tackle new tasks that it has not seen before (e.g. tasks that are much different from the tasks in the tiny set of tasks the agent has seen during training) it is unfortunately also likely that the agent architecture will not be optimal for said new task. That is, even if the agent was allowed to retrain itself (adjust its ANN weights) at evaluation time, it is almost certain that its pre-existing, problem agnostic architecture would not be a best fit to the new problem at hand.

Because the ANN model architecture (as well as the initial values of its parameters) needs to be chosen and fixed prior to seeing the training data, said network architecture will rarely be optimal for a given problem at hand. This is in direct contrast to FractalBrain that starts with an FPGA-type, blank compute fabric and actually grows its network of relevant connections, towards an optimal encoding of the data that the algorithm is presented with. Remarkably, the brain cortex appears to be employing a somewhat similar learning principle: The number of cortical connections of a newborn is not fixed, but instead undergoes a rapid growth, until a baby is around 3 years old.

Problem 2: Task Interference through Model Inflexibility

Averaging NN performance over many tasks

The expectation of researchers that pursue ANN-based AGI is that an agent trained on a sufficiently large number of diverse tasks will then be able to generalize to essentially cover all the remaining tasks that the agent can encounter. However, if it turns out that the agent does encounter new tasks that it struggles to solve, or if it needs to adapt to changes of non- stationary domains, the agent would essentially need to be retrained or fine-tuned. That is, the ANN controlling the agent would need to be presented with the new training data and the learning rate would need to be increased accordingly. And while this may indeed allow the agent to learn how to handle the new task, the agent would run the risk of forgetting how to handle the old tasks, as the optimized ANN that used to encode how to handle old tasks would now have to share its memory with the encoding of how to handle the new task. In essence, the new agent knowledge that has been encoded in the ANN, would interfere with the old knowledge, impacting the agent overall performance. For example, an agent may first be trained to gain a skill to drive a manual-transmission car in the US (where the stick is operated with the right hand), and later (re)trained to gain a skill to drive a manual-transmission can in the UK (where the stick is operated with the left hand). These two learned skills may then start critically interfering with each, resulting in a non-zero probability of the agent wanting to simultaneously use both hands to operate the stick.

As an alternative strategy for the agent to handle the new task without forgetting how to handle the old tasks, the agent could be instructed to freeze its old ANN and glue it with an ANN constructed for the new task at hand. (Note, that such gluing requires adding not only the parameters of a new ANN but also potentially inter-ANN parameters, from the neurons of the old ANN to the neurons of the new ANN, in hopes of reusing the already optimized filters of the old ANN during training of the new ANN.) However, given the exponential increase of the ANN number of parameters with each new task to be learned and the vast number of novel real-world problems that an AGI agent could potentially encounter, such expansionist strategy is unlikely to be scalable. (The strategy would also be in conflict with learning at a meta-level, as an AGI agent should itself be capable of discovering if and when to expand its network to handle a new task properly.)

One recent strategy to learn new tasks at test-time without interfering with the old-tasks relies on the so-called “in-context learning”. In essence, the Transformer architecture trained auto-regressively has been shown capable of approximating the functionality of induction-heads which in turn allows them to perform the associative recall (AR) tasks that gives rise to in-context learning. However, we do not consider AR to be learning per-se, because it is not persistent: As soon as the prompt window moves outside of the region where AR data is located, the Transformer can no longer perform the AR task / in-context learning on said region of data.

These two critical limitations of ANN based AGI agents: interfering or inefficient learning of novel tasks after the agent has been deployed, are a direct result of the inflexibility of ANN models. That is, being a monolithic black-box of a fixed size, an ANN cannot be slightly extended to handle novel tasks in a stable and scalable fashion. Granted, multiple ANNs can be composed together to form a bigger ANN (e.g. the Mixture of Experts LLM approach), yet they cannot be extended in a lightweight fashion, to handle only the unique features of a new task at hand. (A good analogy here is with the object oriented programming languages wherein new classes can be created both using composition of already existing classes as well as using lightweight extension / inheritance from existing classes. The latter approach is especially efficient as it results in a new subclass that shares the common properties of the super-class to which it applies its differentiating set of features or patches.)

FractalBrain assumes an elastic rather than inflexible network architecture and thus has the ability to slightly expand it when needed (e.g. increment the number of network layers or assigned to it compute units / neurons) as well as to contract it whenever possible (to release the rarely used compute units / neurons for later reuse). In a sense, a compute fabric of FractalBrain acts similarly to an FPGA or a biological brain cortex wherein the corresponding programmable logic blocks or cortical microcolumns respectively have the ability to be assigned and reassigned to the continuously expanding and contracting model. FractalBrain continuously recycles the unused parts of its compute fabric to later use them to produce patches to its existing model, to account for the new tasks / changes in the old tasks that its existing models fail to properly address.

Problem 3: Unidirectional Pass Through the Model

What further impairs ANN extendability is the fact that the major information flow through an ANN is diametrically different from the information flow through a brain cortex. That is, whereas in the ANN the information first enters the input layer, then flows unidirectionally through the network (with possibly recurrent cycles) until it reaches the output layer at the other extreme of the network, the information in the brain cortex flows bidirectionally; it starts and terminates in the very same input / output layer at the one and only extreme end of the network (the other extreme end is unspecified/open)

Specifically, in a brain cortex, the raw sensory information enters the bottom layer in the cortical hierarchy of layers, gets processed and integrated in the underlying layer compute units (groupings of neurons referred to as minicolumns) and sent to a higher layer for further processing, if necessary. This operation is then repeated in higher layers in the hierarchy until the information reaches some desired high-enough layer, e.g. in the prefrontal cortex where abstract information integration and higher level planning occur. The information flow responsible for the execution of plans then descends the cortical hierarchy of layers and gradually becomes less abstract, in that the information leaving the cortex is a stream of low-level motor commands that trigger the corresponding actuators.

There are two direct implications of such open-ended, bidirectional information flow strategies employed by the brain cortex that are of critical importance for continually learning AGI agents. Firstly, because the information flow does not have to always pass-through all the prespecified layers in the hierarchy (unlike in an ANN), but only to ascend to- and then descend from a desired, task specific level, a continually learning agent does not have to worry about its network being too shallow or too deep for a variety of tasks that it will have encountered. And secondly, the agent can always stack extra layers on top of its existing network, towards potentially increasing the agent performance, or even remove some layers from the top of the network towards reducing the network memory footprint while impairing the agent performance only gradually.

Towards building a continually learning AGI agent, the major information flow strategy in FractalBrain follows the above-described information flow strategy in the brain cortex. It likewise is bidirectional and starts and terminates in the same network I/O layer at the bottom of the hierarchy of layers. As such, the agent can likewise expand or contract its network, towards maximizing the agent performance simultaneously on a variety of tasks, while maintaining a desired memory footprint.

Problem 4: Bounded Temporal Credit Assignment Window

Of particular importance for AGI agents is to be able to make decisions in context of its observations from potentially arbitrary past. This is especially problematic for ANN based agents as ANNs have been known for a long time now to suffer from the temporal credit assignment problem. And the problem is relevant to not only feed-forward ANNs but also recurrent ANNs (or State-Space Models) that come equipped with memory cells meant to overcome it.

Feed-forward ANNs (e.g. Transformers) cannot remember the observations from the arbitrary past. The network output is conditioned solely on the network input which itself is of only limited size, prescribed by a given network architecture. Consequently, only a fixed number of observations can make it into the network input layer, resulting in the underlying temporal window to either have a fixed temporal span or have ad-hoc temporal gaps. And although the size of this temporal window can be big (e.g. over 32k for a GPT-4 family of models) or grow exponentially with the network depth (as demonstrated in the WaveNet ANN architecture or more recently, in the long-convolutions SSM architectures), in practice, the amount of memory that such ANN architectures prescribe to encode particular temporal dependencies is fixed and likely greatly inadequate for an arbitrary temporal credit assignment problem at hand (as recently demonstrated in the Hyena / Long Conv / RWKV or H3 models).

The reason why recurrent ANNs (e.g. LSTMs) may be unable to remember the relevant information from the past is more subtle. Recall that a recurrent ANN is likewise trainable using an application of the derivative-chaining rule, and as such, it too requires input vectors of a given fixed size. What is happening is that a recurrent ANN during training needs to be unrolled through T time steps, forming a feed- forward network of T modules of shared weights fed with consecutive T chunks of the input vector data, corresponding to T consecutive time steps. And what the network then learns is essentially how to best optimize its parameter space given the independent training examples of which each spans no more than T consecutive time steps. The result of such a learning strategy is that, if there is some temporal correlation between two observations separated by more than T time steps, these observations will not jointly be part of any training example. And consequently, there will be no reason (or opportunity) for the optimization process to encode said correlation in the parameter space of a trained model. (While this problem can sometimes be alleviated by initializing the RNN neurons with a result of its pre-processing of a few more initial data-points of the underlying time series, this strategy could in practice work for only very short temporal dependencies, due to the problem of vanishing gradients over time.) For example, if an ANN is trained to predict the outside temperature in London in one hour intervals and T=24 then the model will potentially learn that there is a day-and-night temperature cycle but will have no opportunity to learn that the temperatures generally tend to be higher in the summer versus in the winter.

The immediate consequence of the ANN's general inability to learn to always remember the relevant information from the past is that the ANN based AGI agents such as A3C/DQN/Chat-GPT are often (relatively) memoryless. The notion of what is relevant for them is predicated by what was relevant during the narrow temporal windows during agent training. And consequently, the agents may fail to condition the expected utility of their actions on all the relevant observations from the past. To see the consequence of this, consider an A3C/DQN agent playing e.g. the “Montezuma revenge” game. Herein, the agent often finds itself in a room wherein the optimal policy is to either go one way (if the agent does not have the key) or another (if the agent has the key). Yet, an agent whose policy is conditioned only on a few recent observations (down-sampled to a resolution that no longer permits the agent to see if it has said key) can only follow a policy that is ignorant of the agent's possession of the key. Likewise, a recurrent ANN based agent would first need to be trained on a curriculum of toy level, short-duration “collect the key then open the door” tasks to improve the chances that its network registers the events when the agent collects the keys.

The brain approach to remedy the temporal credit assignment problems consists of circumventing the problem of fixed temporal windows by not having to rely on global back-propagation, but instead employing a temporally delayed version of a localized Hebbian-learning rule. Likewise, FractalBrain employs a local, Hebbian-type learning algorithm rather than global gradient descent.

Problem 5: Bottom-up Learning of Final Representations

Surprisingly, the most notable features of ANNs: the compressed representations of domain signals that they find, are actually neither biologically accurate nor attractive enough for lifelong learning, robust AGI agents. That is, the ANN learned representations are final (versus refine-able over the course of an agent life), hard to transfer to new domains (as they are anchored to specific, fine-grained sensory signals from an existing domain) and non robust (and hence easily to fool), warranting an entirely different approach.

To illustrate the biological inaccuracy of the current ANN representation learning it is worth noting that a biological brain cannot possibly be employing the bottom-up strategy of first learning the low-level, fine-grained representations, then use them to learn higher-level representations etc. because the brain simply does not perceive the fine-grained signals at all times. That is, the raw signals (visual, auditory, somatosensory) first pass through the dedicated signal aggregators and amplifiers (e.g. eye fixations on parts of the visible light spectrum or the amplification of the frequency bands on the cochlea) and as such, the vast majority of the information that the brain receives from the body sensors are actually in quite coarse resolution. And as long as this coarse resolution information is sufficient for the agent to achieve its objectives, no further signal amplifications/refinements are warranted. For example, even though the spots on a car windshield are visible at all times, they are for most of the time imperceptible to a trained driver who rarely chooses to amplify the visual signal at such short focal lengths.

Only once the signals perceived at the coarse resolution are no longer sufficient to achieve a given agent objective, the agent will make an effort to perceive signals at finer resolution: To this end, the agent will issue respective commands to its sensory signal amplification mechanisms to explore and magnify the chosen parts of the sensory spectrum in hopes of encountering snippets of higher- resolution signals that will have provided it with useful information for a given task at hand. It other words, the agent will build its internal representations of the environment incrementally, in a top-down fashion (as opposed to ANNs that build their representations in a bottom-up fashion, anchored at the bottom to high-resolution raw sensory signals), starting from blurry, low-resolutions signal approximations and gradually refining them with more detail, if necessary. Consequently, domain objects that appear to be in uniform resolution, will actually end up being represented internally in the agent brain with variable resolution, depending on the required sensory complexity of the tasks that they have been involved with.

A direct opposite to the ANN representation learning strategy, the above-described brain strategy overcomes the issues with the refine-ability, transfer-ability and exploit-ability that plague ANN learned representations. To begin with, notice how the ANN learned representations are non refine-able. That is, once ANN learning concludes, the network parameters are final (optimized for a given task and a given resolution at which the task provides input signals to an ANN). Consequently, if this resolution at which the task signals are entered to the network later changes, the network will likely no longer perform as intended. For example, an image classification ANN would not be able to work out-of-the-box if the resolution at which the images are presented to it is doubled. (This is in direct opposition to e.g. the family of pseudo-Hilbert curves that allow for a progressive refinement of signal representation and are not derailed when the signal resolution increases.) And because of this non refine-ability of ANN representations, the network will likely need to be retrained, which unfortunately may no longer be possible if an AGI agent is already deployed in the field.

The transfer of learned representations, equally important for AGI agents, has also been problematic for ANN based agents. Partially responsible for this is that it is much harder to transfer across the domains the high-resolution object representations versus their coarse-grained approximations. For example, in the Atari 2600 learning environment, it is harder to transfer the high-resolution car sprites from the Enduro game to the Packman game where car sprites are gone and replaced with high-resolution sprites of creatures. If however these two distinct objects are represented using refinable resolution representation (as we conjecture may be the case in the brain cortex), the transfer may actually succeed. For example, an agent that learned to avoid bumping into cars in Enduro, represented in low-resolution as moving white blobs, may perform reasonably well when attempting to avoid being eaten by creatures in Pacman, also represented in low-resolution as moving white blobs.

Last but not least, the fact that the ANN learned representations include fine-grained signal filters in the lowest layers of the network exposes the network to adversarial attacks that are often imperceptible to a human observer. A cleverly designed attack exploits the amplification of the errors of ANN filter activations with the information propagating up the network: It involves the almost imperceptible (involving as little at just one pixel!), targeted perturbation of the fine-grained input signal that results in the lowest layer network filters to mis-categorize their input which in turn fools the higher-layer filters and so on. This often results in a stunning error, for example, where a pair of images that for a human observer appear to be identical (remember that a human observer perceives these images in low-resolution unless she chooses to magnify some parts of it) are recognized with close to 100% confidence to belong to entirely different categories. In contrast, because the human visual system first perceives the entire scene with low-resolution and only later chooses (if at all) to recursively magnify parts of it, it cannot easily be fooled by the imperceptible input signal perturbations.

A viable representation learning system of an AGI agent ought to produce representations that are continually refinable with improving signal resolution, easily transferable to new domains and resilient to adversarial attacks. Bottom-up process of learning final ANN representations has failed at producing representations that satisfy these three critical requirements. FractalBrain remedies this situation using an active sensing, dynamic resolution approach for learning representations.

Problem 6: Episodic Learning

Of the existing learning paradigms that ANNs embrace that are particularly unrealistic for either biological or AGI agents, episodic learning is especially notable. Originally proposed to facilitate agent learning through the decomposition of a given agent domain into smaller, self-contained domains, episodic learning introduces two critical limitations for future AGI agents.

The first of these limitations is straightforward to understand: Because it is ultimately a role of the human task designers to decide how to distill from an agent domain its smaller chunks referred to as episodes, the partitioning itself is unavoidably ad-hoc. As such, there is a risk that the isolated episodes will not contain all the relevant information that an agent needs, to be able to learn all the skills required for its success in the greater domain. For example, if in some episodes an agent encounters a seemingly useless object of type A (e.g. carbon dioxide) whereas in other episodes the agent encounters another seemingly useless object, of type B (e.g. hydrogen), then the agent will not have an opportunity to learn to combine these two objects to produce a potentially useful object of type C (methane, in our example). A continually learning AGI agent may in contrast have a greater chance of encountering over the course of its life both of these seemingly useless objects (A and B) and experiment with combining them, to reveal that they are in fact critical components to manufacture a useful object of type C.

The second limitation for AGI agents that episodic learning entails results in the underlying learning process to appear non-stationary. To understand the reason for that, it is important to first recall that the agent world is meant to always be reset prior to the beginning of each episode. This (arguably unrealistic) experiment design choice is deliberate and very convenient for ANN-based agents, as they no longer have to remember what happened in past episodes. (Note, that ANNs have been known to perform poorly in long-term memory tasks outside of their training domain, as explained earlier.) However, from the perspective of a continually learning agent, if the world state (e.g. that includes the agent made changes in said world) is silently reset in-between the episodes, the entire learning domain-appears non-stationary and inherently non learnable. For example, in a Starcraft video game, a continually learning agent may remember that it has already harvested all the crystals from a given part of the world, and without being explicitly told that the episode has been restarted, never choose to send a harvester to that part of the world again.

Though it facilitates an ANN-based agent training, the ad-hoc human distillation of the agent domain into much shorter, seemingly self-contained episodes is an artifact of non-AGI research. Not only does episodic learning potentially deprive the agents from skills whose learning requires exposure to a continuous set of episodes, but it also introduces fake domain non-stationarity that a continuous learning AGI agent would struggle to model. FractalBrain does not employ the notion of episodes, neither at training nor testing time. In contrast, FractalBrain learns from a continuous, lifelong stream of agent observations.

Problem 7: Data-Driven Learning on How to Reason

Strong AGI agents ought to be able to not only recognize the input signal patterns and act on them reactively, using the responses retrieved from its already trained ANN, but also engage in proactive reasoning via planning for the potential futures as well as retrospective analysis of the counterfactuals from the past. And while in the brain there likely are dedicated processes responsible for such strategic planning or counterfactual reasoning, none of these processes are explicitly encoded in the parameters of an ANN.

Indeed, the only way for an ANN to approximate such reasoning (and perhaps only within a narrow scope of a given training domain) is to have this reasoning be infer-able from the data the ANN is trained on. Exciting ANN architectures have thus been proposed to facilitate such implicit reasoning, with a somewhat similar overarching theme: They provide the ANN with external memory and a controller (e.g. a recurrent ANN) and allow the latter to tinker with the contents of the former, to achieve a desired objective on a given training task. Trained end-to-end, the distilled controller network can then be interpreted as engaging in some sort of reasoning on the contents of the external memory. And what is hoped for is that this implicit, narrow domain reasoning will later generalize to other domains, eventually covering the entire space of reasoning processes if only a sufficient number of training tasks is involved. To date, the most notable success of this approach has recently been achieved using the GPT family of ANN architectures, trained on Internet-scale text data; Given the vast amount of data, these architectures were observed, e.g. to learn to approximate the functionality of the induction heads, required for associative-recall needed for in-context learning.

This implicit ANN reasoning strategy being sound notwithstanding, it is arguably not scalable for future AGI agents, as they would potentially have to be trained or fine-tuned in advance on a prohibitively large number of rare reasoning tasks to know how to handle them in the future. For example, an agent may need to reason on such uncommon tasks as “how to disentangle a pantyhose from the spokes of a flying bike”. Tasks such as this will likely never be encountered by an average human, let alone be added in sufficient scale and variety to a training set of an ANN agent, and the ANN agent will at best hallucinate how to perform them. (An in-depth discussion on the fallacy of the data-only driven reasoning mechanisms can be found in Judea Pearl’s ”Book of Why”.)

A much more efficient and scalable reasoning strategy, employed by FractalBrain, is to explicitly bake-in into the agent brain the general purpose reasoning mechanisms such as induction, associative recall, counterfactual replay, planning etc., and postpone the encoding of the actual reasoning processes until required by a task. In essence, given a task at hand, FractalBrain employs its “temporary” memory to run and encode the results of its available ”what-if” replay and preplay tests applied on said task. As such, only the relevant results of these tests are later copied to the agent's “permanent” memory, subsequently expanding the agent's knowledge base with the additional ”recipes” that the agent can now readily (without the need to conscious reasoning) apply to said task in the future.

Problem 8: Unrealistic Agent Objectives

An objective of an ANN-based ADP agent is to maximize the total sum of discounted, expected rewards collected over the course of some episode during agent training. As such, an agent that follows such strategy is simply myopic, for the following reason: Although in theory the Q-value layer of a trained A3C/DQN does integrate all the rewards that the agent will collect when fol- lowing the currently learned policy, in practice this is not the case. In infinite horizon planning problems, because the value of each Q-value layer neuron needs to be bounded (to allow for the gradients back-propagated through the network to be bounded), each subsequent reward that an agent expects to collect is discounted by a fixed discount factor smaller than 1 (with the discount effect exponentially compounding for subsequent actions, to produce a converging and hence bounded, geometric series of rewards) Consequently, the A3C/DQN agent undervalues the rewards expected to be encountered in distant future by a rather ad-hoc factor, with often disastrous impact on the agent performance. For example, if in the “Montezuma revenge” game the agent is lured with a tiny, positive, immediate reward for entering a room that after a sufficiently long delay becomes a trap (which will cost the agent a huge reward penalty), the agent will likely re-enter said room on the next occasion (because said long delay will have resulted in such severe discounting of the huge penalty, that it will be outweigh by said tiny, positive, immediate reward). Tree Search techniques (e.g. DeepMind “AlphaGo” or OpenAI project “Strawberry”) alleviate this problem, but does not remedy it, since the tree search is performed only for relatively short observation sequences. In essence, though mathematically convenient, the ad-hoc discounting of the later in time events is simply not something that humans or AGI agents should resort to. (Humans indeed do discount the rewards from situations that are less likely to occur, but that does not automatically correspond to situations that occur later in time.)

The summation of the rewards that an agent expects to receive is yet another impractical objective of an ANN-based AGI agent. For one, it is simply impractical for agents to attempt to plan their actions by taking into account all the rewards that will be collected over the entire course of their lifetimes: Not only would that require them to reason about an exponential-in-planning-horizon number of plausible future reward trajectories, but also result in unbounded sums of rewards, for infinite planning horizons. And while the latter problem may be somewhat mitigated by employing a discount factor (which would lead to agent myopia, as just discussed) or employing an average-reward objective function (which in turn would produce far-sighted agents), the former problem will still persist, especially for long planning trajectories of real-world planning problems.

The above-discussed objective of the current ANN-based AGI agents is impractical and biologically inaccurate. Specifically, an average biological actor is primarily interested in aversively avoiding most painful or life-threatening experiences, while simultaneously opportunistically pursuing most pleasurable experiences over the entire duration of an agent's life. And it certainly does not employ and additive reward aggregation: For example, for an average human, the reward for eating an apple on a given day does not simply stack up with more and more apples eaten, but is rather a function of the unique activations of the taste pleasure receptors and the deactivations of the hunger pain receptors. In essence, the ignorance of the number of times a given reward is encountered on a given trajectory naturally leads a biological agent to attribute a disproportionately greater importance to rare, yet more extreme rewards on said trajectory. That is, unlike a typical A3C/DQN agent that adds all the rewards it encounters and hence dilutes the rewards from rare but often crucial events with a plethora of minor rewards (e.g. incremental costs for agent movements), a biological agent effectively avoids this undesired dilution. And the perceived inability of a biological agent to properly differentiate the utility of reward trajectories that contain a different number of activations of the same reward stimulus can largely be mitigated by making a reasonable assumption that in real-world biological systems, such repeated activations of the same stimuli have a higher chance to trigger an activation of some other (stronger) stimulus. (For example, the repeated activation of a sugar taste receptor has a higher chance to trigger the activation of a stomach pain receptor.) The result of this is that the agent only implicitly prefers trajectories with a greater number of similar, positive rewards (or avoids trajectories with similar, negative rewards), as it is the agent model that implies that such repetitive rewards are likely to be followed by other (stronger) types of rewards.

For planning how to reach its objectives, FractalBrain is not reliant on ANN gradient descent and as such, does not have to employ the discount factors that result in ANN agents acting myopically.

Problem 9: Back-propagation of Reward Signals

The use of ANNs as function approximation components of reinforcement learning algorithms, as in a value network of DQN or in a critic-network of A3C, introduces an additional set of problems. To understand these problems, recall that DQN / A3C still belong to the class of supervised-learning algorithms trainable using back-propagation. That is, they still require the supervision signal, computed in their case as the difference between the network prediction of the discounted expected reward of an agent executing an action and a more accurate prediction of said value, calculated using an observed immediate reward(s) for executing said action. As such, they are directly exposed to a new set of problems, of which the following two are most pronounced.

Firstly, in the absence of immediate rewards provided to the agent by the simulator, the agent does not learn anything: That is, the supervision signal is zero and so is the corresponding gradient back-propagated through the layers of the agent network. For example, in a maze domain wherein the agent is only rewarded once it finds its way out of the maze, barring the use of any intrinsic motivation auxiliary rewards, the agent will not have encountered any reward for prolonged periods of time and hence would not have learned anything (not updated the parameters of its ANN). And while this may be a tolerable problem for an agent having the comfort of living indefinitely in a simulated training domain, the problem would certainly become unacceptable for AGI agents operating in real-world environments. What is essentially advocated for here, is that an AGI agent should employ other (than backpropagation of expected reward signals) forms of learning to efficiently learn to act in a given sparse-reward domain. In other words, agent learning of the underlying world model should occur even in the absence of any particular reward signals, with the role of the perceived reward limited to modulate the learning process, to bias the statistics of the underlying learning process towards more meaningful events.

Secondly, the inclusion of agent rewards in the gradients back-propagated through the network has a severe impact on the later transferability of the agent knowledge to a new domain. Specifically, the gradient that carries the partial reward signals will at some point unavoidably start infusing the filter parameters with said rewards coming from a given training task at hand. And while this is the right thing to do towards optimizing the agent performance on that very task, it will result in the network filter parameters being permanently attuned to the training task at hand. That is, some input patterns (of potentially critical importance for future tasks) that were meaningless for the agent performance on the old task, will end up being filtered-out by the network. Consequently, future attempts to transfer the knowledge of the original network (e.g. bottom layer convolutional filters) and reuse it in a new network built for a new set of tasks may simply fail. For example, an agent trained to drive a car will likely have optimized its parameter space in such a way that it infers the rewards of its actions based on what it perceives on the road and in its surroundings, but not necessarily on what is on the sky. As such, if this agent were to be transferred to a new domain wherein it is asked to predict the chance of rain based on what it currently sees, it would likely filter out all the relevant visual cues (e.g. color and shapes of the clouds) plainly visible on the sky.

FractalBrain possesses the ability to learn in sparse-reward or even zero-reward domains, as well as restrain itself from encoding the domain rewards in its model, in a way that inhibits future transfer of agent knowledge to new domains. This is accomplished by using a unique model based planning approach that needs not to resort to the backpropagation of the reward signals.

Problem 10: Shallow Deep Reinforcement Learning

A rather inconspicuous misconception that characterizes the A3C/DQN deep reinforcement learning family of algorithms is that they in fact use only shallow reinforcement learning, albeit sitting on top of deep signal processing ANN architecture. And this somewhat misleading terminology would not be much of an issue if not for the limitations that the shallow RL component of A3C/DQN implies. Listed below are some of them that are of particular concern for AGI agents.

Firstly, a shallow RL agent explicitly builds a joint plan: An atomic, non-decomposable plan that, although it may implicitly involve multiple sub-tasks, prescribes a carefully chosen (joint) action executed at each and every time step to have a direct contribution to the fulfillment of the plan. And since the plan is not explicitly decomposable, a (model-based) shallow RL agent who aims to accomplish a given task has to often plan up-front for a prohibitively large number of intermediate steps. That is, it needs to plan for all the steps of the auxiliary sub-tasks that will have been interspersed in-between the actions of the agent primary task at hand. For example, an agent who schedules its work meetings during the day needs to additionally plan up-front for such arguably unrelated sub-tasks as which restrooms to use and what salad dressing to choose at lunch time.

An arguably more efficient strategy, implemented by FractalBrain and conjectured to be employed by a biological brain, is to: (i) First explicitly discover and simultaneously keep track of multiple disjoint trajectories of actionable events and then (ii) Multitask between them (akin a CPU multitasking between concurrent processes), towards constructing relatively short, individual plans for each separate trajectory.

Another limitation of shallow RL agents has to do with their inability to (i) automatically discover and form plans of a varying degree of abstraction (to operate on more manageable planning spaces and facilitate plan transfer to new domains), as well as to (ii) seamlessly switch between these plan abstractions, to maximize the agent performance on a given domain. To wit, as already mentioned, some form of plan abstraction is readily available to a biological agent who already perceives the world observations in variable resolutions (because of the aforementioned selective signal amplifiers). And once the agent manages to distill the trajectories of its coarse-grained observations (of potentially different modalities) they may indeed constitute solid foundations for abstract plan formation. These coarse-resolution, abstract plans could then be supplemented (at the agent whim) with more fine-grained plans, formed from higher resolution agent observations, allowing the agent planning mechanism to effectively switch back and forth (or even fuse) between plan trajectories of varying levels of abstraction. For example, an agent playing the capture the flag video game may form an abstract (from low resolution observations) plan on how to navigate the underlying environments to find a flag or home base. The very same agent may also automatically discover a more specific (from high resolution observations) concurrent plan on how to look at the subtle, small features of the other player characters in the game, to distinguish its teammates from the players of the opponent team. Not only will these plans be relatively short and less complex, but they will also be much easier to transfer to slightly different domains with either different map layouts of different opponent types.

The automated discovery of plan decompositions and plan abstractions is baked-in into FractalBrain. In contrast, DLR agents lack this functionality, resulting in their plans having short horizons, transfer poorly across the domains and being close to impossible to interpret and explain.