Negation is a direction, not a flag.

How does a transformer represent "not X"? The honest answer in 2026 is: it doesn't — at least not in any way the architecture cleanly supports. The self-attention residual $r + Σ t α t v t$ is a sum of values weighted by attention. A sum is, mathematically, a disjunction. And Birkhoff and von Neumann showed in 1936 that "OR" and "NOT" are not the same operator — not in classical logic, and not in the lattice of vector subspaces either.

This matters for sarcasm. Sarcasm is structurally a negation: "I love being stuck in traffic" means the opposite of what its words assemble. If the architecture can't represent negation cleanly, it can't detect sarcasm cleanly. I want to walk through the argument, the empirical setup, and a result that surprised me.

What we want negation to do

Start with the linguistic picture. When we say "play, not game," we mean: the concept play minus its sports-context association. The result should still be recognizably play — recreation, theatrical performance, pretending. But the sports flavor should be gone.

If we represent words as vectors, this maps to two requirements:

The result should preserve most of play's identity (high cosine similarity to play).
The result should have no component along the game direction (zero projection onto game).

Two desiderata. One operation.

Widdows (2003) and orthogonal projection

Dominic Widdows, in a 2003 paper on logical operators for word vectors, pointed out that this operation has a name in linear algebra. It's the projection of play onto the orthogonal complement of game.

Writing p and g for unit-normalised vectors and $⟨\cdot,\cdot⟩$ for inner product, the Widdows orthogonal-negation operator is:

p NOT g := p - ⟨p, g⟩ \cdot g

This is the vector rejection of p from g. It removes the g component of p while leaving everything else alone.

Both desiderata fall out. The result has cosine similarity to p proportional to $\sqrt(1 - ⟨p, g⟩ 2)$ — high if p and g aren't too aligned. And the result has zero projection onto g by construction.

This is the right operation for representing negation. It's been sitting in the literature for 22 years.

The residual is a Birkhoff–von Neumann disjunction

So why don't transformers use it? Because the self-attention update is structurally different.

The standard residual in a transformer block is:

r' = r + Σ t α t \cdot v t

where r is the running residual, $α t$ are the attention weights for position t, and $v t$ are the value vectors. The update is a sum. Adding things to a vector enlarges it; it never removes a direction.

In Birkhoff and von Neumann's 1936 quantum-logic formalism, the disjunction $A \lor B$ of two subspaces is the span — the smallest subspace containing both. That is exactly what summing does: it enlarges the represented subspace. Negation $\negA$ , on the other hand, is the orthogonal complement — a subspace defined by removing a direction. The two operators are dual but not interchangeable; you cannot build one out of the other by summing.

The implication: when a transformer encounters "not X" in input, it has to encode the negation somewhere. But the residual-as-sum can only enlarge. It can encode "X is being mentioned" (by adding the X direction). It cannot encode "X is being negated" (which would require removing the X direction). The best it can do is add some other direction — call it a "negation cue" — and rely on downstream layers to interpret the combination.

That's a fragile encoding. And it's what we see in practice: large language models handle simple negation reasonably, fail at composed negation (double negatives, idiomatic sarcasm), and fail catastrophically when the sarcasm cue is subtle and contextual.

Empirical: `play NOT game` on RoBERTa

To check this isn't only theoretical, I ran a small experiment on RoBERTa-base embeddings.

Take the contextualised embedding for play from a neutral sentence. Take the embedding for game from a similar position. Compute two candidates for "play minus game":

v subtract = p - g (naive)

v ortho = p - ⟨p, g⟩ \cdot g (Widdows)

Then measure two things: cosine similarity to the original play (how much identity is preserved), and cosine similarity to a probe cluster of sports-context words (ball, team, field, match, ...). The negated result should be less sports-y than play, but ideally not anti-sports-y — "play minus sports flavor" is not the same as "the opposite of sports."

The numbers, averaged over ~40 sentence contexts:

Operator	cos(·, play)	cos(·, sports cluster)
play itself	1.00	+0.41
play − game (naive)	0.57	−0.39
play ⊥ game (Widdows)	0.92	+0.06

The naive subtraction destroys 43% of play's identity and lands at negative 0.39 against the sports cluster — the result lives in the part of vector space that is actively opposed to sports. That's semantic nonsense for the word play.

The orthogonal projection preserves 92% of play's identity and brings the sports correlation to near zero. The result is "play minus the sports flavor" — exactly what we wanted.

This matches the theory tightly. The arithmetic that's natural to the transformer (sum, subtraction) gives a broken answer. The arithmetic that's the right operator (orthogonal rejection) gives the right answer. The transformer just doesn't have the operator in its architecture.

What this means for sarcasm detection

Sarcasm involves semantic incongruity: the literal meaning of the words contradicts the speaker's intent. To detect it, a model needs to represent both the literal meaning and its negation simultaneously and notice they're at odds.

If the only "negation" available is via summing — adding negation-cue features — then "I love being stuck in traffic" looks like:

positive_sentiment(love) + neutral(being stuck in traffic) + (some) negation cue

The model has no clean way to subtract the love direction from the composite. It has to learn, via training data, that the combination of love + traffic is suspicious — a statistical pattern, not a semantic operation. That works for obvious cases. It fails when the cue is subtle.

The direction I'm exploring: give transformer architectures access to the orthogonal-rejection operator explicitly. Either as a small layer that performs the rejection on a selected direction, or as a mixture-of-experts head whose job is the rejection. Early experiments on BESTTIE (a 6-dialect English sarcasm benchmark) are encouraging but not yet ready to publish; that's the next year of work.

Closing

The most interesting research questions are the ones where a 1936 paper and a 2003 paper meet inside a 2025 model and tell you the model is missing a key operator. The math is older than I am. The empirical test took an afternoon. The architecture work to incorporate the fix is the next year of my research.