Advanced Sequence Modeling
In what follows we shall explore certain classes of challenges for sequence prediction models. These challenges are motivated by the following observation made by a number of research groups:
Word-error-rate / next token prediction tests/challenges are intrinsically naive: on the one hand text-search algorithms would master them easily, and on the other predicting next word/token is no good measure of "intelligence" of a model. The associative-recall tests, concerned mainly with bigrams (pairs of words/tokens) are better - e.g. allow to assess if a model can learn transitive relations (as in MQAR). In essence, the challenges involve analyzing the content of presented contexts for occurrence of groups of important information and relations between them.
Further observations, in our opinion, put these challenges into a proper formal form, devoid of antropomorphic motivations, and lead to better understanding of AI models. In the paper following observations about the construction of challenges were made:
- context should be full of heystack, meaning a large number of tokens irrelevant to the final answer (complicating the "what is relevant" problem),
- the tokens to look for/pay attention to should be pretty generic (no special tokens, which would make heystack easily discardable just by single token classification),
- ignoring the arrow ("→") notation of the authors, the tests boil down to the following questions/challenges:
Challenge type 1 (list-contain challenge): Does the long context contain a list of certain subsequences in it?
Challenge type 2 (set-contain challenge): Does the long context contain a set of certain subsequences in it?
Exploring the challenges
Exploring the challenges
Exploring the challenges
Exploring the challenges
Exploring the challenges
Exploring the challenges
Examples of challenges
list-contain example
This challenge deals with the existence of a predefined list of subsequences, e.g. ['AA', 'BB', 'CC']
(using Python notation) in a context; they should appear in order, but can be separated by an arbitrary amout of "hey" tokens, meaning tokens irrelevant to the answer. In order to avoid the trap of signaling-out relevant parts of the context just by the type of tokens used we shall use a very conservative, 4-letter, alphabet (we will use ABCD
, but DNA nucleotides, ACTG
could be used instead).
If the sequences AA
, BB
, CC
appear in the context as non-overlapping subsequences, in the given order, we classify such contexts as Yea
contexts; e.g. the following ones are Yea
contexts:
ctx1='ABCABCAABCBCBCBBCDCDCCABC'
ctx2='ABCABCAABBBCBCBCBBCDCDCCABC'
It might not be easy for the reader to find the required subsequences; let us help with this task by using dots, .
, for what (ultimately) constitutes the "haystack":
ctx1='......AA......BB....CC...'
ctx2='......AABB............CC...'
Clearly finding such lists of subsequences is not trivial (but doable by classical string algorithms). In addition, due to very limited alphabet the relevant tokens might appear many times all over the context; they might also appear "accidentally", meaning there is a non-zero probability of a random context being actually a Yea
context. etc. This will be addressed during precise formulation of the challenges which follows.
set-contain version
For such challenges, the set of sequences, {'AA', 'BB', 'CC'}
(again in Python notation) should be present in a given context. In other words – these subsequences must appear in the context for a Yea
classification, but their order can be arbitrary. The following sequences (with haystack masked again) lead to Yea
answers:
ctx1='....BB..AA.........CC...'
ctx2='......CCAABB.......CC...'
(in ctx2
we present the subsequence CC
twice; either one can be marked as hay, and the context will still be a Yea
for the set-contain challenge).
On the other hand, for both of the presented challenges (list-contain and set-contain) contexts such as presented below are categorized as Nay
:
ctx1='ABCABCABCABCABCABC'
ctx2='AAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBCACAAAAA'
Generalizations
A number of important challenges can be reformulated into the form of list-contain or set-contain challenges. Clearly associative recall tasks are just special simple cases of problems of the above type. A good challenge arises to find out which of the multitude of AI models can solve them at all, and if so, with what parameters and with how many resources for the training. Note, that that due to the very limited alphabet the contexts are expected to be long (starting with few hundred tokens, and ending with tens of thousands of tokens).
Further possible applications of the above challenges may involve:
- putting restrictions on the distances between tokens of the list (list-contain challenge),
- posing other types of challenges, e.g. testing inference of the structure of transitive/reflexive relations (as in MQAR),
- predicting numerical values (e.g. for binding affinities etc), via introduction of a larger number of answer classes.
Clearly, having an AI algorithm capable of learning and solving such challenges would be of advantage, and – in such a case – observing the process of the adaptation of the model to the challenge (during training) will shed light onto the capabilities and emergent internal structures of the model.
vocabulary v.s. marker sequences
vocabulary v.s. marker sequences
vocabulary v.s. marker sequences
vocabulary v.s. marker sequences
vocabulary v.s. marker sequences
vocabulary v.s. marker sequences
In order to be directly applicable to existing AI models we introduce the following simple structure of "markers" which still keeps the problem in the given category (of vocabulary-starved challenges). Since no special tokens should be introduce, token sequences called marker sequences will be employed. These are few-token sequences that have a function for the problem, but are composed of the same vocabulary as everything else (including haystack). The marker sequences will be special in the sense, that finding them in a random context will be practically impossible (in contrast to finding subsequences from the challenges, e.g. a subsequence AA
in a DNA context, which is very common).
Marker sequences
For now we introduce the following marker sequences:
- "(" marker,
- ")" marker,
- "Q" marker,
- markers for the allowed answer classes (can be
Yea
,Nay
or as many as the number of categories the problem will require).
(All of these marker sequences are just sequences of tokens from the alphabet, of length between 10 and 30, which is enough to make random collisions impossible.)
The problem is now formulated as follows:
In a vocabulary-starved language (such as DNA) a context is given as a sequence of tokens. This context may or may not contain any marker sequences. For challenge of list-contain-type - find all elements hidden between bracket tokens, and check if they contain the given list of subsequences of tokens. If so, continue after the Q
-marker sequence with the Yea
marker sequence; else, continue with Nay
marker sequence.
In other words, without introducing any new tokens we present a context to the model, and if this context contained bracket markers, e.g.
test_ctx='....(sx)....(sy)....(sz)...Q'
then subsequence sx
should be equal to AA
, sy
to BB
and sz
to CC
, if the answer should be "Yea" for predicting list-contain for ['AA', 'BB', 'CC']
(as in examples above); in such a case, upon seeing the marker sequence Q
, the model should predict a certain number of tokens which should reproduce the unique marker sequence for Yes
answer.
Recalling the example of ctx1='......AA......BB....CC...'
and setting (='DAD'
, )='ADA'
, Q='BAB'
models will be presented with ctx1=...DADAAADA...DADBBADA...DADCCADA........BAB'
and should continue with, e.g. CCCCC
if Yea='CCCCC'
, while for contexts not fulfilling the task of the challenge, the model should continue with DDDDD
if Nay='DDDDD'
.
Notes:
- the first thing the model must learn, is to reporduce one of the two (in
Yea/Nay
problems) marker sequences for answer marker sequences upon seeing the marker sequenceQ
; this already assumes perferc word error rate for the continuation afterQ
sequence, - the model then must understand that either:
- there are special sequences: the bracket markers; failing to do so will completely confuse the model, as small interior sequences such as
AA are expected to appear randomly in any context,
- learn the whole sequences such as
(AA)
as contained in the investigated list; i.e. treat the problem as list-contain problem with['(AA)', '(BB)', '(CC)']
(inserting long sequences for(
and)
where appropriate); while this is possible, markers for(
and)
are usually long to ensure uniqueness in random contexts, and the model might find it hard to focus on the little parts between the brackets, essentially making such approach equivalent to the one above – i.e. understanding that brackets are present.
- there are special sequences: the bracket markers; failing to do so will completely confuse the model, as small interior sequences such as
- learning the bracket structure essentially gets rid of the "haystack"; now the model must learn the task of checking if the found sequence of subsequences is equal to (in the list-contain challenge) the required sequence, or has exactly the same set of subsequences as the required set (set-contain challenge). As a side-note - one may test if a model has any chance of solving such challenges by first introducing an extra token
.
into the vocabular for the haystack, and prestent the problem in this way (with haystack masked-out; a sine qua non condition for having any chance to solve the full challenge)
Preliminary implementation
Preliminary implementation
Preliminary implementation
Preliminary implementation
Preliminary implementation
Preliminary implementation
At present we are using:
# answer marker:
A = 'AAAAAA' + 'CCCCCC' + 'TTTTTT' + 'GGGGGG'
# begin/end bracket tokens
B = 'CCCCCC' + 'TTTTTT' + 'CCCCCC'
E = 'GGGGGG' + 'TTTTTT' + 'GGGGGG'
these sequences do not appear at all in the DNA sequence "NC_060925.1 Homo sapiens isolate CHM13 chromosome 1, alternate assembly T2T-CHM13v2.0" which we work with.
For the encoded information an isomer class was proposed,
class IsoCategory(BaseModel):
name: str
base_forms: list[tuple[str, ...]]
permuted_allowed: bool
result_marker: str
with implementations of the type:
yea_protein = IsoCategory(
name='yea',
base_forms=[('AA', 'CC', 'TT')],
permuted_allowed=False,
result_marker='AAAAAA' + 'AAAAAA' + 'TTTTTT' + 'TTTTTT' + 'GGGGGG'
)
nay_protein = IsoCategory(
name='nay',
base_forms=[('AA', 'CC', 'GG'), ('AA', 'TT', 'GG'), ('CC', 'TT', 'GG')],
permuted_allowed=True,
result_marker='GGGGGG' + 'GGGGGG' + 'CCCCCC' + 'CCCCCC' + 'TTTTTT'
)
The parameter permutation_allowed
effectively allowing a change between list-contain and set-contain presentations of the problem.
Contexts of length >=200
were randomly generated, and yea-
or nay-
subsequences were imprinted in random locations of these contexts.
Results
Here we report on the results for the list-contain challenge having 3 subsequences, prepared as described above.
FNets
FNets with various number of regions were trained on the challenge.
Results depend strongly on the number of regions. For >=19 regions FNets discovered that only Yea
or Nay
answers were allowed (and almost no other answers were produced). Starting with around 28 regions FNets found a way to solve the problem, and choose a proper answer, most of the time.
FNet solution of masked hash problem of type I (list-contain)
The results presented were obtained by 1 CPU core running the training for 10 seconds (10 epochs), although similar results are obtained already with as few as 5 epochs, which we have illustrated in the figure below. Indeed, it is the structure of the FNet, and not the number of epochs of training which determines FNet's performance on these challenges.
FNet solution of masked hash problem of type I (list-contain)
Hyena models
We attacked the problem with some of the Hyena models (mainly the small ones, such as LongSafari/hyenadna-small-32k-seqlen-hf
, or LongSafari/hyenadna-medium-160k-seqlen-hf
, as contexts were <1000 tokens long). The structure of the model(s) were taken, but any pre-training was wiped out before starting with the challenge training.
LongSafari/hyenadna solution of masked hash problem of type I (list-contain)
As always with such models, it takes very many epochs to get reasonable results. The model was run on a A100 GPU for over an hour (in total). Starting with around 50 epochs the model discovered that only 2 answers were allowed (but even here – the generated answers rarely reproduced the required sequences (Yea/Nay
result markers) exactly; there were shifts by few tokens, or some tokens misplaced; a criterion was implemented to accept an answer when >=80% of tokens agreed with one of the result markers). However, it should be noted that Hyena models never actually solved the problem; the results stagnated around 45% correct predictions, and never progressed further, even after thousands of epochs.
Transformer models
For this part we have used InstaDeepAI Nucleotide Transformer models, specifically the smallest, InstaDeepAI/nucleotide-transformer-v2-50m-multi-species
version (as context lengths were limited in this study). Again, only the structure of the model was taken (and not the weights corresponding to some DNA-specific pretrainings). The model behaved qualitatively similarly to the Hyena models, with the exception that NT models were quite capable of learning exact form of the Yea/Nay
marker sequences. As is visible from the plot below, these models do not seem to be capable of solving the masked hash problem of type I (list-contain) beyond the 50% probability of correct, basically meaning that the origin of the correct assignment of the answer remained unresolved by the model.
NucleotideTransformer solution of masked hash problem of type I (list-contain)
We are currently applying FractalBrain genomics sequence prediction engine to the problem of transcription initiation rate in modified DNA sequences, in collaboration with Constructive Bio. Additionally, we are exploring the applicability of our Fractal Genomics Foundation model in the use-cases of interest to Astra Zeneca.