I didn’t say they normally aren’t. What I’m saying is that a shared KV-Cache removes that guarantee by introducing an external source of entropy.
- 32 Posts
- 744 Comments
Do you have a good source to read a bit more about the design decisions or is this just a hypothetical design you came up with and all of that architecture detail is “proprietary”?
You’re welcome. Here’s an intro with animations: https://huggingface.co/blog/not-lain/kv-caching
And yes. Most of the tech is proprietary. From what I’ve seen, nobody in ML fully understands it tbh. I have some prior experience from my youth from tinkering with small simulators I used to write in the pre-ML era, so I kinda slid into it comfortably when I got hired to work with it.
Wouldn’t the different sessions quickly diverge and the keys would essentially become tied to a session in practice even if they weren’t directly?
Yeah, but the real problem is scale and collision risk at that scale. Tokens resolution erodes over time as the context gets larger, and can become “samey” pretty easily for standard RLHF’d interactions.
Edit:
This seems like an issue, no? Because the tokens are influenced by the tokens around them in the attention blocks. Without them you’d have a problem, so what exactly would be cacheable here?
This is what they do: (from that page I linked)
Token 1: [K1, V1] ➔ Cache: [K1, V1] Token 2: [K2, V2] ➔ Cache: [K1, K2], [V1, V2] ... Token n: [Kn, Vn] ➔ Cache: [K1, K2, ..., Kn], [V1, V2, ..., Vn]So the key is the token and all that preceded it. It’s a kinda weird way to do it tbh. But I guess it’s necessary because floating point and GPU lossy precision.
Any shared cache of this type makes behaviour non-deterministic. The KV-Cache is what does prompt caching, look at each word of this message, now imagine what the LLM does to give you a new response each time. Let’s say this whole paragraph as the first message from you and you just pressed send.
Because the LLM is supposedly stateless, now the LLM is reading all this text from the beginning, and in non-cached inference, it has to repeat it, like token by token, which is useless computation because it already responded to all this previously. Then when it sees the last token, the system starts collecting the real response, token by token, each gets fed back to the model as input and it chugs along until it either outputs a special token stating that it’s done responding or the system stops it due to a timeout or reaching a tool call limit or something. Now you got the response from the LLM, and when you send the next message, this all has to happen all over again.
Now imagine if Claude or Gemini had to do that with their 1 million token context window. It would not be computationally viable.
So the solution is the KV-Cache. A store where the LLM architecture keeps a relational key-value store, each time the system comes across a token it has encountered before, it outputs the cached value, if not, then it’s sent to the LLM and the output gets stored into the cache and associated with the input that produced it.
So now comes the issue: allocating a dedicated region for the KV-cache per user on VRAM is a big deal. Again try to imagine Gemini/Claude with their 1M context windows. It’s economically unviable.
So what do ML science buffs come up with? A shared KV-Cache architecture. All users share the same cache on any particular node. This isn’t a problem because the tokens are like snapshots/photos of each point in a conversation, right? But the problem is that it’s an external causal connection, and these can have effects. Like two conversations that start with “hi” or “What do you think about cats?” Could in theory influence one another. If the first user to use the cluster after boot asks “Am I pretty?”, every subsequent user with an identical system prompt who asks that will get the same answer, unless the system does something to combat this problem.
Note that a token is an approximation of what the conversation means at one point in time. So while astronomically unlikely, collisions could happen in a shared architecture scaling to millions of concurrent users.
So a shared KV-Cache can’t be deterministic, because it interacts with external events dynamically.
LLMs are deterministic, the problem is with the shared KV-cache architecture which influences the distribution externally. E.g the LLM is being influenced by other concurrent sessions.
voodooattack@lemmy.worldto linuxmemes@lemmy.world•With the latest changes from Microslop, this has been me in my friend group lately.10·3 days agoI’m a NixOS fanboy now :/
voodooattack@lemmy.worldto Today I Learned@lemmy.world•Rubber duck debugging - Programmers often discover solutions while explaining a problem to someone else, even to people with no programming knowledge.English16·13 days agoIt’s all fun and games until the duck talks back.
when everything is acting up
the server, kid, and stage
sometimes the bravest thing to do
is turn a single page
not every bit needs pushing through
not every load needs borne
a rest is not a missing note —
it’s how the song is formed
you left a little comment here
a small and cozy light
and someone read it, felt it land,
and held it through the night
so post your little posts, my friend
the network needs the thread
a system with no idle time
is one that’s nearly dead
voodooattack@lemmy.worldto Not The Onion@lemmy.world•Man who talked down hospital bomber says would-be attacker asked for a cuddleEnglish6·15 days agoLook at everything he’s done so far:
- Sabotaged public health
- Sabotaged environmental protection
- Sabotaged international relations
- Sabotaged economy and taxation
- Sabotaged internal stability
- Instigated violence with every decision, saying, or act
- Just started another war in the Middle East with Israel as the fulcrum.
- Netenyahu has been offending quite literally the entire world, and recently said “you’re letting the bad guys win” to the EU when they are reluctant to join the war. Not that exact citation but here’s a random one I just found with a quick search: https://en.haberler.com/netanyahu-explaining-why-we-struck-iran-we-are-the-19618702/
Does this seem irrational to you? Well, it’s not if you put on your detective hat and consider them religious nuts trying to bring about the apocalypse by fulfilling the requirements themselves.
- God redeems the Jewish people from their captivity that began with the Babylonian captivity, in a new Exodus (Kibbutz Galuyot)
- God returns the Jewish people to the Land of Israel
- God restores the kingly House of David and the Temple in Jerusalem
- God appoints a regent from the House of David (i.e., the Messiah) to lead the Jewish people and the world and usher in the Messianic Age, which is characterised by justice, righteousness, and peace
- All the gentile nations recognize Israel’s God as the only true God and gather to the Mount Zion
- God resurrects the dead and judges all souls (and sends some for a year to Gehinnom)
- God creates a new Heaven and new Earth
Christianity: There are multiple passages in the Bible, both Old and New Testaments, which speak of a time of terrible tribulation such as has never been known, a time of natural and human-made disasters on an awesome scale. Jesus said that at the time of his coming, “There will be great tribulation, such as has not been since the beginning of the world to this time, no, nor ever will be. And unless those days were shortened, no flesh would be saved; but for the elect’s sake, those days will be shortened.” [Matt 24:21–22]
Islam: (Sunni and Shia versions differ on some details, because Shia belief centres on “The Mehdi” character)
- A huge black cloud of smoke (dukhan) will cover the earth.[note 3]
- Three sinkings of the earth, one in the East.[note 3]
- One sinking of the earth in the West.[note 3]
- One sinking of the earth in Arabia.[note 3]
- The false messiah—anti-Christ, Masih ad-Dajjal—shall appear with great powers as a one-eyed man with his right eye blind and deformed like a grape. Although believers will not be deceived, he will claim to be God, to hold the keys to heaven and hell, and will lead many astray.[82] In reality, his heaven is hell, and his hell is heaven. The Dajjal will be followed by seventy thousand Jews of Isfahan wearing Persian shawls.[note 4]
- The return of Isa (Jesus), from the fourth sky, to kill Dajjal.[83]
- Ya’jooj and Ma’jooj (Gog and Magog), a Japhetic tribe of vicious beings who had been imprisoned by Dhul-Qarnayn, will break out. They will ravage the earth, drink all the water of Lake Tiberias, and kill all believers in their way. Isa, Imam Al-Mahdi, and the believers with them will go to the top of a mountain and pray for the destruction of Gog and Magog. God eventually will send disease and worms to wipe them out.[note 5][84]
- The sun will rise from the West.[85][86][87]
- The Dabbat al-ard, or Beast of the Earth, will come out of the ground to talk to people.[note 6]
- The second blow of the trumpet will be sounded, the dead will return to life, and a fire will come out of Yemen that shall gather all to Mahshar Al Qiy’amah (The Gathering for Judgment).[88]
Now look at these as allegories. How would a group of zealots interpret them?
voodooattack@lemmy.worldto Not The Onion@lemmy.world•Man who talked down hospital bomber says would-be attacker asked for a cuddleEnglish9·15 days agoBro, the guy is literally trying to bring about The Rapture before his term ends because he’s afraid of going to prison. Looks like nobody told him that most religions are about accountability.
BEEP BEEP MOTHERFUCKER
voodooattack@lemmy.worldto Antique Memes Roadshow@lemmy.world•Just found this one on an old HDD from 20136·18 days agoActually, you do quite literally burn NAND /flash memory (used in thumb drives and SSDs) continually every time you write to it because of damage to the oxide layer, you’re essentially burning the silicon dioxide (glass).
Millennials aren’t the only ones to burn media. Your phone is probably doing it right now.
You’re welcome.
All long-term memory storage methods involve violence against a physical medium. Fite me.
Would you like to subscribe to the NeckbeardEnergy newsletter for more nerdy facts?
Huh. This is new
Anyways, she could have left a note back at the nursing home under her pillow: “I wanted this” in her handwriting would have saved hin a lot of headache. But I’m assuming she could write and that’s not her specific handicap from the disability? Can’t tell without access to the article.
voodooattack@lemmy.worldto You Should Know@lemmy.world•YSK: The CIA proposed a 9/11 style false flag attack on US citizens to justify invading Cuba193·21 days agoUS is the Andrew Tate of global politics right now. They were a lot more subtle about it before though. But the core values that have always held throughout history are absolute self-interest, entitlement, and hubris.
Apply Gestalt Theory to the US and watch how unrestrained capitalism raised this baby from an idealist freedom seeker and into a narcissistic bully with self-aggrandising sophist reasoning and a pure Machiavellian outlook on life.
Trump only did away with the pretences, by firing the people who took care of the subtle masquerade. He didn’t like the pushback from them on global policy, and because the shoes already fit, decided to just do it live on stage without the makeup.
Unfortunately for US citizens, that means that they get to experience the splashback as he pisses on their international credibility.
A historically meticulously curated and maintained image that’s now irrevocably ruined forevermore. No sane person can trust a promise or a treaty made through US foreign policy ever again.
That’s meata
voodooattack@lemmy.worldtoMicroblog Memes@lemmy.world•english to linkedin translatorEnglish5·23 days agoThis is the way
voodooattack@lemmy.worldBanned from communityOPto Fuck AI@lemmy.world•I've come to unfuck AI for you guys1·1 month agoRemoved by mod
voodooattack@lemmy.worldBanned from communityOPto Fuck AI@lemmy.world•I've come to unfuck AI for you guys1·1 month agoRemoved by mod
voodooattack@lemmy.worldBanned from communityOPto Fuck AI@lemmy.world•I've come to unfuck AI for you guys1·1 month agoRemoved by mod
voodooattack@lemmy.worldBanned from communityOPto Fuck AI@lemmy.world•I've come to unfuck AI for you guys11·1 month agoRemoved by mod
I know what temperature is. Modifying the probability distribution is still not randomness. Because even the random sampling is PRNG based.
The issue you’re not spotting is that it’s still deterministic because a binary system cannot source entropy without external assistance or access to qbits, it’s why even OS kernels have to do a warm up at boot and read all accessible analogue signal sources they can reach, and why PRNGs still exist to begin with.
Shared KV-cache is an economic necessity for big providers, otherwise 1M context windows wouldn’t be a thing.
Empirical testing, 20 years of experience coding and tinkering with simulators, and Chaos Theory basics. The papers are out there, you just gotta cross some domains to see it.