Advanced Sequence Modeling

In what follows we shall explore certain classes of challenges for sequence prediction models. These challenges are motivated by the following observation made by a number of research groups:

Word-error-rate / next token prediction tests/challenges are intrinsically naive: on the one hand text-search algorithms would master them easily, and on the other predicting next word/token is no good measure of "intelligence" of a model. The associative-recall tests, concerned mainly with bigrams (pairs of words/tokens) are better - e.g. allow to assess if a model can learn transitive relations (as in MQAR). In essence, the challenges involve analyzing the content of presented contexts for occurrence of groups of important information and relations between them.

Further observations, in our opinion, put these challenges into a proper formal form, devoid of antropomorphic motivations, and lead to better understanding of AI models. In the paper following observations about the construction of challenges were made:

  • context should be full of heystack, meaning a large number of tokens irrelevant to the final answer (complicating the "what is relevant" problem),
  • the tokens to look for/pay attention to should be pretty generic (no special tokens, which would make heystack easily discardable just by single token classification),
  • ignoring the arrow ("→") notation of the authors, the tests boil down to the following questions/challenges:

Challenge type 1 (list-contain challenge): Does the long context contain a list of certain subsequences in it?
Challenge type 2 (set-contain challenge): Does the long context contain a set of certain subsequences in it?

Exploring the challenges

Exploring the challenges

Exploring the challenges

Examples of challenges

list-contain example

This challenge deals with the existence of a predefined list of subsequences, e.g. ['AA', 'BB', 'CC'] (using Python notation) in a context; they should appear in order, but can be separated by an arbitrary amout of "hey" tokens, meaning tokens irrelevant to the answer. In order to avoid the trap of signaling-out relevant parts of the context just by the type of tokens used we shall use a very conservative, 4-letter, alphabet (we will use ABCD, but DNA nucleotides, ACTG could be used instead).

If the sequences AA, BB, CC appear in the context as non-overlapping subsequences, in the given order, we classify such contexts as Yea contexts; e.g. the following ones are Yea contexts:

ctx1='ABCABCAABCBCBCBBCDCDCCABC' ctx2='ABCABCAABBBCBCBCBBCDCDCCABC'

It might not be easy for the reader to find the required subsequences; let us help with this task by using dots, ., for what (ultimately) constitutes the "haystack":

ctx1='......AA......BB....CC...' ctx2='......AABB............CC...'

Clearly finding such lists of subsequences is not trivial (but doable by classical string algorithms). In addition, due to very limited alphabet the relevant tokens might appear many times all over the context; they might also appear "accidentally", meaning there is a non-zero probability of a random context being actually a Yea context. etc. This will be addressed during precise formulation of the challenges which follows.

set-contain version

For such challenges, the set of sequences, {'AA', 'BB', 'CC'} (again in Python notation) should be present in a given context. In other words – these subsequences must appear in the context for a Yea classification, but their order can be arbitrary. The following sequences (with haystack masked again) lead to Yea answers:

ctx1='....BB..AA.........CC...' ctx2='......CCAABB.......CC...'

(in ctx2 we present the subsequence CC twice; either one can be marked as hay, and the context will still be a Yea for the set-contain challenge).

On the other hand, for both of the presented challenges (list-contain and set-contain) contexts such as presented below are categorized as Nay:

ctx1='ABCABCABCABCABCABC' ctx2='AAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBCACAAAAA'

Generalizations

A number of important challenges can be reformulated into the form of list-contain or set-contain challenges. Clearly associative recall tasks are just special simple cases of problems of the above type. A good challenge arises to find out which of the multitude of AI models can solve them at all, and if so, with what parameters and with how many resources for the training. Note, that that due to the very limited alphabet the contexts are expected to be long (starting with few hundred tokens, and ending with tens of thousands of tokens).

Further possible applications of the above challenges may involve:

  • putting restrictions on the distances between tokens of the list (list-contain challenge),
  • posing other types of challenges, e.g. testing inference of the structure of transitive/reflexive relations (as in MQAR),
  • predicting numerical values (e.g. for binding affinities etc), via introduction of a larger number of answer classes.

Clearly, having an AI algorithm capable of learning and solving such challenges would be of advantage, and – in such a case – observing the process of the adaptation of the model to the challenge (during training) will shed light onto the capabilities and emergent internal structures of the model.

vocabulary v.s. marker sequences

vocabulary v.s. marker sequences

vocabulary v.s. marker sequences

In order to be directly applicable to existing AI models we introduce the following simple structure of "markers" which still keeps the problem in the given category (of vocabulary-starved challenges). Since no special tokens should be introduce, token sequences called marker sequences will be employed. These are few-token sequences that have a function for the problem, but are composed of the same vocabulary as everything else (including haystack). The marker sequences will be special in the sense, that finding them in a random context will be practically impossible (in contrast to finding subsequences from the challenges, e.g. a subsequence AA in a DNA context, which is very common).

Marker sequences

For now we introduce the following marker sequences:

  • "(" marker,
  • ")" marker,
  • "Q" marker,
  • markers for the allowed answer classes (can be Yea, Nay or as many as the number of categories the problem will require).

(All of these marker sequences are just sequences of tokens from the alphabet, of length between 10 and 30, which is enough to make random collisions impossible.)

The problem is now formulated as follows:

In a vocabulary-starved language (such as DNA) a context is given as a sequence of tokens. This context may or may not contain any marker sequences. For challenge of list-contain-type - find all elements hidden between bracket tokens, and check if they contain the given list of subsequences of tokens. If so, continue after the Q-marker sequence with the Yea marker sequence; else, continue with Nay marker sequence.

In other words, without introducing any new tokens we present a context to the model, and if this context contained bracket markers, e.g.

test_ctx='....(sx)....(sy)....(sz)...Q'

then subsequence sx should be equal to AA, sy to BB and sz to CC, if the answer should be "Yea" for predicting list-contain for ['AA', 'BB', 'CC'] (as in examples above); in such a case, upon seeing the marker sequence Q, the model should predict a certain number of tokens which should reproduce the unique marker sequence for Yes answer.

Recalling the example of ctx1='......AA......BB....CC...' and setting (='DAD', )='ADA', Q='BAB' models will be presented with ctx1=...DADAAADA...DADBBADA...DADCCADA........BAB' and should continue with, e.g. CCCCC if Yea='CCCCC', while for contexts not fulfilling the task of the challenge, the model should continue with DDDDD if Nay='DDDDD'.

Notes:

  1. the first thing the model must learn, is to reporduce one of the two (in Yea/Nay problems) marker sequences for answer marker sequences upon seeing the marker sequence Q; this already assumes perferc word error rate for the continuation after Q sequence,
  2. the model then must understand that either:
    • there are special sequences: the bracket markers; failing to do so will completely confuse the model, as small interior sequences such as AA are expected to appear randomly in any context,
    • learn the whole sequences such as (AA) as contained in the investigated list; i.e. treat the problem as list-contain problem with ['(AA)', '(BB)', '(CC)'] (inserting long sequences for ( and ) where appropriate); while this is possible, markers for ( and ) are usually long to ensure uniqueness in random contexts, and the model might find it hard to focus on the little parts between the brackets, essentially making such approach equivalent to the one above – i.e. understanding that brackets are present.
  3. learning the bracket structure essentially gets rid of the "haystack"; now the model must learn the task of checking if the found sequence of subsequences is equal to (in the list-contain challenge) the required sequence, or has exactly the same set of subsequences as the required set (set-contain challenge). As a side-note - one may test if a model has any chance of solving such challenges by first introducing an extra token . into the vocabular for the haystack, and prestent the problem in this way (with haystack masked-out; a sine qua non condition for having any chance to solve the full challenge)

Preliminary implementation

Preliminary implementation

Preliminary implementation

At present we are using:

# answer marker: A = 'AAAAAA' + 'CCCCCC' + 'TTTTTT' + 'GGGGGG' # begin/end bracket tokens B = 'CCCCCC' + 'TTTTTT' + 'CCCCCC' E = 'GGGGGG' + 'TTTTTT' + 'GGGGGG'

these sequences do not appear at all in the DNA sequence "NC_060925.1 Homo sapiens isolate CHM13 chromosome 1, alternate assembly T2T-CHM13v2.0" which we work with.

For the encoded information an isomer class was proposed,

class IsoCategory(BaseModel): name: str base_forms: list[tuple[str, ...]] permuted_allowed: bool result_marker: str

with implementations of the type:

yea_protein = IsoCategory( name='yea', base_forms=[('AA', 'CC', 'TT')], permuted_allowed=False, result_marker='AAAAAA' + 'AAAAAA' + 'TTTTTT' + 'TTTTTT' + 'GGGGGG' ) nay_protein = IsoCategory( name='nay', base_forms=[('AA', 'CC', 'GG'), ('AA', 'TT', 'GG'), ('CC', 'TT', 'GG')], permuted_allowed=True, result_marker='GGGGGG' + 'GGGGGG' + 'CCCCCC' + 'CCCCCC' + 'TTTTTT' )

The parameter permutation_allowed effectively allowing a change between list-contain and set-contain presentations of the problem.

Contexts of length >=200 were randomly generated, and yea- or nay- subsequences were imprinted in random locations of these contexts.

Results

Here we report on the results for the list-contain challenge having 3 subsequences, prepared as described above.

FNets

FNets with various number of regions were trained on the challenge.

Results depend strongly on the number of regions. For >=19 regions FNets discovered that only Yea or Nay answers were allowed (and almost no other answers were produced). Starting with around 28 regions FNets found a way to solve the problem, and choose a proper answer, most of the time.

FNet solution of masked hash problem of type I (list-contain)

median
25th percentile
75th percentile

The results presented were obtained by 1 CPU core running the training for 10 seconds (10 epochs), although similar results are obtained already with as few as 5 epochs, which we have illustrated in the figure below. Indeed, it is the structure of the FNet, and not the number of epochs of training which determines FNet's performance on these challenges.

FNet solution of masked hash problem of type I (list-contain)

Median
25th Percentile
75th Percentile
Hyena models

We attacked the problem with some of the Hyena models (mainly the small ones, such as LongSafari/hyenadna-small-32k-seqlen-hf, or LongSafari/hyenadna-medium-160k-seqlen-hf, as contexts were <1000 tokens long). The structure of the model(s) were taken, but any pre-training was wiped out before starting with the challenge training.

LongSafari/hyenadna solution of masked hash problem of type I (list-contain)

median
25th percentile
75th percentile
Figure: statistics of results from 150 training runs of LongSafari/hyenadna-small-32k-seqlen-hf model (not pretrained).

As always with such models, it takes very many epochs to get reasonable results. The model was run on a A100 GPU for over an hour (in total). Starting with around 50 epochs the model discovered that only 2 answers were allowed (but even here – the generated answers rarely reproduced the required sequences (Yea/Nay result markers) exactly; there were shifts by few tokens, or some tokens misplaced; a criterion was implemented to accept an answer when >=80% of tokens agreed with one of the result markers). However, it should be noted that Hyena models never actually solved the problem; the results stagnated around 45% correct predictions, and never progressed further, even after thousands of epochs.

Transformer models

For this part we have used InstaDeepAI Nucleotide Transformer models, specifically the smallest, InstaDeepAI/nucleotide-transformer-v2-50m-multi-species version (as context lengths were limited in this study). Again, only the structure of the model was taken (and not the weights corresponding to some DNA-specific pretrainings). The model behaved qualitatively similarly to the Hyena models, with the exception that NT models were quite capable of learning exact form of the Yea/Nay marker sequences. As is visible from the plot below, these models do not seem to be capable of solving the masked hash problem of type I (list-contain) beyond the 50% probability of correct, basically meaning that the origin of the correct assignment of the answer remained unresolved by the model.

NucleotideTransformer solution of masked hash problem of type I (list-contain)

median
25th percentile
75th percentile
Figure: statistics of results from 150 training runs of InstaDeepAI/nucleotide-transformer-v2-50m-multi-species model (not pretrained).

We are currently applying FractalBrain genomics sequence prediction engine to the problem of transcription initiation rate in modified DNA sequences, in collaboration with Constructive Bio. Additionally, we are exploring the applicability of our Fractal Genomics Foundation model in the use-cases of interest to Astra Zeneca.