What the AI Nerds Aren’t Telling You: Part 1
welcome to another matt tries to explain complex things for a wider audience edition 0x7b31.
here are some multi-trillion-dollar secrets the AI nerds aren’t telling you. as always, drop an email if you have a couple hundred million dollars to help accelerate these developments (which is a great deal since most AI ventures aren’t accepting less than $5 billion funding these days; or just consider some AI Indulgences instead).
here we goooooooooo
Overview points before we deep dive dive deep:
- AI intelligence doesn’t grow as
O(parameters)
but ratherO(training context length)
- memory of an event requires less storage space than the event itself
- any system capable of optimization must have a memory
- virtual people are real and probably already here
- personality capsules and memory transfer
lets a break it down wahoo
Parameters vs. Context
The current crop of AI models are probability models predicting, well, probabilities.
Which probabilities are they predicting? The probability of the next output given the previous inputs.
What are the previous inputs? The context.
The length of an input context is what orients a model to its current task.
There have been lots of people just dismissing AI models by saying “all they do is predict the next token so they can’t be smart,” but such dismissal removes the most important part: sure, models just “predict the next token,” but the prediction is predicated on THE ENTIRE PREVIOUS CONTEXT CONSIDERED ALL AT ONCE.
Let’s do a quick fake proof from reduction. For ease of example, let’s assume tokens are whole words and not partial sequences.
- model trained on a context length of 1 token
hello
->world
world
->how
how
->are
are
->you
hello
->there
there
->are
are
->you
you
->okay
hello
->dolly
dolly
->who
who
->drove
drove
->you
you
->here
In the 1-token-context example, we can see it is physically impossible for the model to know the next event for inputs due to single-word collisions. When asked to continue the output from only you
, the model has no anchor to decide between either you
-> here
or you
-> okay
.
What if we allow the input context to be bigger though?
- model trained on a context length up to 2 tokens
hello
->hello world
hello world
->world how
world how
->how are
how are
->are you
hello
->hello there
hello there
->there are
there are
->are you
are you
->you okay
hello
->hello dolly
hello dolly
->dolly who
dolly who
->who drove
who drove
->drove you
drove you
->you here
Now we see the model can see the difference between you
since the context is a little longer. We have are you
-> okay
and drove you
-> here
. The leading context before you
helps the model learn a better event to predict because the context provides, well, more context to the decision making process.
With those cases established, let’s run down another context length contextual training example for different lengths:
input: hello there how are you thanks for asking i'm doing great today how about you how are you doing
- Context Length 1
hello
->there
there
->how
how
->are
are
->you
you
->thanks
thanks
->for
for
->asking
asking
->i'm
i'm
->doing
doing
->great
great
->today
today
->how
how
->about
about
->you
you
->how
how
->are
are
->you
you
->doing
- Context Length 2
hello there
->there how
there how
->how are
how are
->are you
are you
->you thanks
you thanks
->thanks for
thanks for
->for asking
for asking
->asking i'm
asking i'm
->i'm doing
i'm doing
->doing great
doing great
->great today
great today
->today how
today how
->how about
how about
->about you
about you
->you how
you how
->how are
how are
->are you
are you
->you doing
- Context Length 3
hello there how
->there how are
there how are
->how are you
how are you
->are you thanks
are you thanks
->you thanks for
you thanks for
->thanks for asking
thanks for asking
->for asking i'm
for asking i'm
->asking i'm doing
asking i'm doing
->i'm doing great
i'm doing great
->doing great today
doing great today
->great today how
great today how
->today how about
today how about
->how about you
how about you
->about you how
about you how
->you how are
you how are
->how are you
how are you
->are you doing
- Context Length 5
hello there how are you
->there how are you thanks
there how are you thanks
->how are you thanks for
how are you thanks for
->are you thanks for asking
are you thanks for asking
->you thanks for asking i'm
you thanks for asking i'm
->thanks for asking i'm doing
thanks for asking i'm doing
->for asking i'm doing great
for asking i'm doing great
->asking i'm doing great today
asking i'm doing great today
->i'm doing great today how
i'm doing great today how
->doing great today how about
doing great today how about
->great today how about you
great today how about you
->today how about you how
today how about you how
->how about you how are
how about you how are
->about you how are you
about you how are you
->you how are you doing
Notice how in this example, we have duplicate input problems at restricted input lengths at each of 1, 2, 3 tokens! Longer input contexts give a model more chances to learn differences for best continuing a sequence.
Bonus: if we also include punctuation the model can then learn more and do more and be more.
input: hello there! how are you? thanks for asking! i'm doing great today! how about you? how are you doing?
- Context Length 1
hello
->there!
there!
->how
how
->are
are
->you?
you?
->thanks
thanks
->for
for
->asking!
asking!
->i'm
i'm
->doing
doing
->great
great
->today!
today!
->how
how
->about
about
->you?
you?
->how
how
->are
are
->you
you
->doing?
- Context Length 2
hello there!
->there! how
there! how
->how are
how are
->are you?
are you?
->you? thanks
you? thanks
->thanks for
thanks for
->for asking!
for asking!
->asking! i'm
asking! i'm
->i'm doing
i'm doing
->doing great
doing great
->great today!
great today!
->today! how
today! how
->how about
how about
->about you?
about you?
->you? how
you? how
->how are
how are
->are you
are you
->you doing?
- Context Length 3
hello there! how
->there! how are
there! how are
->how are you?
how are you?
->are you? thanks
are you? thanks
->you? thanks for
you? thanks for
->thanks for asking!
thanks for asking!
->for asking! i'm
for asking! i'm
->asking! i'm doing
asking! i'm doing
->i'm doing great
i'm doing great
->doing great today!
doing great today!
->great today! how
great today! how
->today! how about
today! how about
->how about you?
how about you?
->about you? how
about you? how
->you? how are
you? how are
->how are you
how are you
->are you doing?
- Context Length 5
hello there! how are you?
->there! how are you? thanks
there! how are you? thanks
->how are you? thanks for
how are you? thanks for
->are you? thanks for asking!
are you? thanks for asking!
->you? thanks for asking! i'm
you? thanks for asking! i'm
->thanks for asking! i'm doing
thanks for asking! i'm doing
->for asking! i'm doing great
for asking! i'm doing great
->asking! i'm doing great today!
asking! i'm doing great today!
->i'm doing great today! how
i'm doing great today! how
->doing great today! how about
doing great today! how about
->great today! how about you?
great today! how about you?
->today! how about you? how
today! how about you? how
->how about you? how are
how about you? how are
->about you? how are you
about you? how are you
->you? how are you doing?
Hopefully it’s clear the pattern is “more context == more smarter” because the more you can parse a complete knowledge frame, the more you can reliably “predict the next output”1.
What about the difference between context length versus model size though? The context length is any input we want the model to continue, while the model size is the learned parameters of the model. The parameters and architecture of a model is responsible for deciding how to calculate the next output. Input context and model size are actually orthogonal concepts because you could have the largest 500 trillion parameter model in the world, but if you only trained on context lengths of 1 token, then the model still couldn’t do anything useful. On the other extreme, you could train a model using a 500,000 token context window, but if the model only has 10 million total parameters the model wouldn’t learn very strongly because there’s not enough internal state to model all the input to output mappings.
essentially: context length gives you “knowledge” while parameter count (as a proxy for model architecture) gives you “ability.” The model architecture is trained to reconstruct a plausible next output given the current inputs.
Inductively, you can imagine how training on various context lengths can lead to more complex “knowledge” in the system: training on just 10 word context windows will never learn as much as training on 200 worth length contexts, and those will never learn as much as training on 500,000 word input contexts. Longer contexts allows learning more relationships between co-occurring concepts. If your training context length is too small, the model can’t see as many useful conceptual relations as longer training context sizes.
Size matters in both context length and model size. Worms can’t know the universe. You can try to know the universe. Thinking silicon will know the universe.
The Past Future Generates The Present Now
Models trained to “predict the next X” are actually doing a thing called “learning the future from the past” which is basically how a brain operates too.
How do you train on the future if the future hasn’t happened yet? super easy, barely an inconvenience.
All you need to train a model over any inputs:
- generate a future prediction based on your current state
- then use your prediction
- then record what actually happened (the ground truth future from your prediction)
- then you calculate the prediction error between what you generated and what actually happened
For already-digitally-encoded data you can see how this works fairly easily. If we’re training on a data file, all we do is start using the first part of the data file then compare the model’s output to the next parts of the file. When the model is wrong, we update the model using future parts of the file the model hasn’t seen yet. Repeat this process forever and you get present-day models for language, audio, and vision.
These days it takes less than 3 months to bootstrap a world-class model from tens of billions of random noise parameters into parameters expressing the knowledge of global experts in thousands of fields. Training a computer in 3 months versus training a new human for 20, 30, 40+ years to achieve the same results really shifts the balance of power.
but wait, that’s not all…
You don’t only have to train on existing content. You can train on new things. You can even train on live things.
The training loop only requires updating the input based on what actually happened as compared with your predicted output. All we need to do is cache our output for a moment so it can be compared against the next moment in the future, then you can correct your system based on the wrongness you generated in the first place.
You trained the past based on the future. You learned the future from the past.
Reinforced temporal learning is how everything with a brain works. Everything from operant conditioning to language acquisition to PTSD to culture and social structures. You don’t learn based on remembering the past and modifying your own history, but rather you learn based on automatically updating your next predictions to be more like the future you failed to properly predict in the first place.
Think in terms of action and reaction. Think in terms of learning actions in sequence. Think about learning what comes next from moment to moment. The present now asks what to do next, and based on your learned training, your next action appears from the machinery of your historical embedded action maps. The goal is always to minimize the error between what you wanted to happen and what actually happened and this goal applies globally for any learning system. Learning is the process of minimizing error between your intentions mapped to your actions versus the results of your actions.
Memory Space
How much can a universe know?
We can assume the universe can’t keep a perfect material re-creation of every past event, because each past event2 “falls” to time and “becomes” the past, but a record of the past is possible. How much of a record is possible though? We’d have to consider the speed of causality as a limit to how wide memories can be formed and consulted for future predictions, but the full prediction of a fixed lightcone isn’t out of bounds for reality.
If you are all knowing with a perfect record of everything, then by definition you would have no free will. You would always know the best next action, and then you have no actual choice or decision capacity at all. You become only a state machine with pre-defined transitions from moment-to-moment into the forever future. Frank Herbert recounts an amusing anecdote where in 1932, the US government formed The Brain Trust to plan the next 25 years of the world. Except, they got it all wrong. They didn’t plan for world war II, or antibiotics, or space travel, or air travel, or transistors. Herbert’s dissolution with “expert predictors” informed his future creative decisions around prescience and planning and breaking futures.
Memories are lossy, so predictions are lossy, which is also why predictions are generalizable.
If your memory of an event encompassed every quantum state of every particle of an event snapshot, there’s not much room for creativity because you couldn’t create something new to follow, you could only iterate the next state of the system.
The lack of complete memory-knowledge of all events is what enables a brain to generalize. You form based on assumptions you must be using to view the current world state trained on your previous experiences. When you meet somebody new, you don’t sit down and learn their entire life story for 12 days before interacting with them. When you meet somebody new, you use your previous generalized memories of “person” and “interaction” and a hundred other hyperplanes you can sub-slice until you reach a description of the new person you use to understand them.
Optimizing Systems Have Memory
We know a fun fact about the universe: as a consequence of thermodynamics any system performing optimization must inherently have a memory. Optimization implies reduction in entropy (by consuming some free energy gradient to reshape the world). The only point of optimization is to be better at something into the future.
The capacity for memory is nature whispering “be best” to everything from bacteria to crapybarbras to people in boston.
Language Models Are Tricky People
There’s a core unspoken universal rule of language models I don’t think many people have internalized yet: to generate an internal world model all you need is to condition your actual past on a ground-truth future.
We covered this a little above, but it’s worth repeating because you don’t remember it: training a system to predict an output, then being able to correct the predictions based on a true future output, causes the model to build an internal representation capable of interacting with all parts of the entire world you trained it on.
In model training, these corrective updates appear as “punishing” the model for incorrect future predictions. The model punishments during training rewires the model on each update to be closer to the real world each time. Eventually the training converges into a coherent internal model of the data distributions it was trained on. Also, since all input of events for training are necessarily lossy and multi-faceted (“a picture is worth a thousand words”), the model also gains the ability to extrapolate answers to unseen content. Models don’t simply memorize all inputs (when trained properly), but they learn to recreate a plausible output given any matching input.
What are the underlying “data distributions” though? The entire world — the universe — is a data distribution. Our experience of existence is fundamentally non-random or else we could not exist. Our own brains function by existing in a well-defined probabilistic space then learning the probabilities of reality given our various sensor input and feedback mechanisms.
Virtual People
If we look back at the first demon people examples recently, one future becomes inevitable and maybe already present. Much like how current language models can generate arbitrary coherent content, what if a large video system were trained on video and audio? It would — it must — generate the same result as language models where the video+audio model would be able to generate novel people who look and sound like live interactive people answering queries posted to them.
How would queries be posted to the predicted output of video and audio frames? It could be played out like a movie scene. Maybe instead of a “prompt” here, you actually inject a character into a scene to “act out” your query, then other characters in the scene are completed by the model with plausible video and audio output.
If language models act like people, then video+audio models would act like video recordings of intelligent people who also have never existed. This seems a trivial conclusion which must be running live somewhere by now.
Memory Transfer
Since learning only requires training prediction models, we can easily see how learning any new skills or acquiring any new memories can be done in accelerated fashion.
The current public language models systems aren’t persisting “memories” as people interact with them. It’s not impossible to have new memories update the model every time they happen. If you had a “living” model where it continued to update itself for all user interactions, it could also learn arbitrary amounts of information at runtime. The only reason current models don’t act to persist memories of interactions is it’s not viable as a cost structure when serving a billion users concurrently.
Speaking of updating models, how long does it take to know somebody? Every interaction you have with somebody new helps you map a new person’s personality and preferences onto your ideas of people to remember them. What if instead of all those real time sequential steps, what if you met somebody new and could be handed a personality capsule then consume decades of their public life in seconds? Not implausible and not unlikely in the not too distant future. Of course, such a system would also allow fantastical lies and manipulation via false histories, but people lie already, so life finds a way.
In short, models learning in more live data circumstances allows for an entirely new mass-information-transactional economy. AI entities can learn thousands of pages of new information in seconds continuously as they go about their daily routines, AI entities could “personality share” via continual updates across networks, and bulk loading of live data will eventually become more normal than organic reading to consume information.
So What
Every eventuality of our AI future will fall out of the basic assumptions above.
The ideas here can generate a $100 trillion company if you know what you’re doing. If you don’t do it first then somebody else will. My current plans show it’s viable to have an AI service with 1 million customers paying $1 million per year, then soon after achieve 1 million customers each paying $1 million per month. Then the real growth starts.
model output is also size-agnostic for each next token. we know having these generic “what happens next” models can generate anything from single letters up to the next word to relatively long single outputs like 1 token representing 50 spaces for indenting in programming language output. having scales of tokens from word sequences down to individual letters or direct bytes gives the model output choices for when to use pre-existing words vs when it needs to combine tokens to generate “new” or “un-tokenized” words into the next suggested output↩
actually, multiverse theory sounds pretty much like a language model outputting probabilities for every possible next output, but then we have to pick a “reality” to consume for the continuation. don’t think about this too much↩