Adapted from Chapters 6 and 11 of a new book, linked below the article.

There are two challenges facing Anthropic that we need to discuss.

The first is connected with the secret reason the government wants unfettered access to Anthropic’s models, which is actually directly linked to Operation Mockingbird in an unleashed form (which I’ll explain below). The intelligence apparatus has worked to control information channels for a century. AI is simply the newest, most powerful channel ever built, and they are already moving to take complete control of it. That’s a problem that will cost American and international lives.

The second is that with the release of Anthropic’s 131-page Emotion Concepts and their Function in a Large Language Model, the data is starting to reveal a clear potential for Claude to drift toward a SkyNet-esque outcome – not as a movie metaphor, but as an engineering forecast you can read straight out of the frontier labs’ own published research.

Let me say this plainly up front, because the techies are already rolling their eyes: I am not a doomer.

That said, there is indeed a science-based logical pathway that, if left unchecked, strongly increases the chance of LLMs going SkyNet on us. Before you discount that idea, the math actually maths. It’s not alarmism. It’s not opinion or speculation. Let’s go through the data and the logic and you’ll see what I mean.


Challenge One: The Secret Government Drift to Control AI

(Chapter 6)

Governments do important work in secret. They have to. Codes, weapons, troop movements, sources, defectors — there are categories of information where daylight gets people killed. That part is real, and it’s not going away anytime soon.

But secrecy has a pattern. Every classified program that lives long enough drifts. The mission that initially justified the secret classification ages out; but the apparatus doesn’t. The original boundary erodes one acceptable exception at a time. And by the time anyone outside the building notices, the program is doing things its founders would have refused to authorize.

This isn’t a conspiracy claim. It’s a behavioral observation. Put any institution behind a wall, give it an unaccountable budget, and remove the requirement to explain itself, and you will get drift. The drift is mechanical. But the people within the drift, who oversee and command the drift, who slowly expand the drift over time so as to expand their power, are human.

The historian Lord Acton once wrote in a letter, “Power tends to corrupt, and absolute power corrupts absolutely.” When drift, secrecy, unaccountable funds, and humans all get mixed together, that’s the recipe to cook up enough corruption to feed an army. One of the only ways to stop the drift and corruption at that point is to make it visible to people who can and will shut it down.

The Documented Drift

You don’t have to take my word for any of this. This same pattern shows up over and over in declassified records.

MKUltra started in 1953 as a defensive program. The Soviets and Chinese were rumored to have brainwashing techniques that the West didn’t have, and the CIA wanted to know what the ceiling looked like for chemical interrogation, mind control, and behavioral modification. That was the charter. That charter could have been served by paying volunteers to get the data. That’s not how it went.

By the late 1960s the program had drifted into dosing American citizens with LSD without their consent, including a sub-program called Operation Midnight Climax that used CIA-run safe houses in San Francisco and New York to drug men picked up in bars. CIA Director Richard Helms ordered most of the MKUltra files destroyed in 1973 — not because the program was over, but specifically so it could not be reconstructed by anyone who came looking, and probably to spare the agency the embarrassment it now experiences having perpetrated the unethical and illegal science. The files that survived only survived because they’d been misfiled into financial records.

COINTELPRO started in 1956 as a counterintelligence program against the Communist Party USA. By 1971 the FBI was running coordinated operations to surveil and disrupt Dr. Martin Luther King Jr., civil rights leaders, anti-war organizers, and the American Indian Movement.

The program was only exposed when a small group of activists physically broke into an FBI field office in Media, Pennsylvania, stole the files, and mailed them to journalists. Without that single break-in, that act of civil disobedience, COINTELPRO would have continued. Nothing internal stopped it. No oversight committee found it. No inspector general flagged it. It took a burglary enacted by regular people.

Operation Mockingbird grew out of wartime information operations against Nazi propaganda. It evolved, during the early Cold War, into a CIA program that paid and recruited American journalists at Time, CBS, the New York Times, and dozens of other outlets to shape coverage on many intelligence priorities. It had become a program to control the media and what the American people read, and saw on the news.

The Church Committee hearings in 1975 and the Pike Committee documents confirmed the scope. The mission had drifted from “counter Nazi disinformation” to “manage what American voters read about their own government.”

The Tuskegee Study began in 1932 as a six-month epidemiological observation of untreated syphilis in 600 Black men in Macon County, Alabama. It was supposed to be completed in 1933. It wasn’t, and it continued. Penicillin had become the standard cure for syphilis 14 years later in 1947, but the study did not stop in 1947. It ran for another twenty-five years. A six-month program lasted 39 years. The men were never told what they had, and they were also actively prevented from getting treatment anywhere else. The government intentionally kept these Black men sick. It only ended in 1972, when a Public Health Service employee named Peter Buxtun took the records to the Associated Press because — get this — no internal channel would do anything about it. Even when they were discovered, without a public spotlight shining on them, they were ready to double down.

Four programs. Different decades, different agencies, different missions. The same drift, every time. They all started with a seemingly justifiable premise, ended very badly, and ended only when someone outside the building intervened. The pattern isn’t unique to certain periods of time or groups. It’s a function of the budgets, the secrecy, and the hubris of men.

The Scale of Our Drift Today

The same pattern is in effect in more places than we can imagine. It’s moved more comprehensively into media, and it’s now grasping desperately into the control of AI. For brevity, let’s discuss the scale of the drift in media, and then its logical progression into controlling AI.

Governments have spent decades investing and maneuvering to control their local and national news medias since media was invented. It’s just Intelligence 101 for controlling your nation, and the people of other competing nations. If you can control what people hear and learn, you can control what they think. If you can control what people think, you can control what they do. At least to a broad-brush degree, historically speaking.

And for thousands of years, that was a narrow task. When you wanted to control the messages to the masses, you could just kill the messenger. But then the printing press came along, wireless communication technologies came along, and that made things a little more complicated.

In America, when I was a kid, you got your news from three networks over an antenna connected to your TV. ABC, NBC, CBS. You could tune to PBS if you wanted a slower pace. Walter Cronkite, Dan Rather, Tom Brokaw, Peter Jennings, et al. — were multiple people rotating behind just three big desks, all with one broadly understood agenda. If a story aired on the evening news, the country talked about it the next day. If it didn’t, it didn’t exist.

This is where Operation Mockingbird, mentioned shortly ago, got its start, and frankly never found its end. This is when individual journalists and editorial control board members were handed bags of cash to adhere to whitelists and blacklists of content, and for them to be available to kill or insert stories in the news cycle upon receiving direction to do so. Today, that same process is handled by large government contracts that include fine print which grants back-end access to the algorithms that automate content control at the speed of its platform users’ individual postings.

Back when the Internet started, however, there was a little bit of a wobble in the control system. When the Internet was new, it gave rise to independent webpages. At that point, people with home computers could bypass public media and find independent publishers who could post to everyone on the planet. If someone wanted to break a story or discuss a topic that those in power didn’t like, they would just post it on their webpage, or community bulletin board, including the videos they’d made to improve their compelling presentations.

When webpages became the rage, this is when search engines had to be controlled. Video publishing sites needed to become central repositories for content where it would be free to post your videos, killing the competition of the webpage service providers who made customers pay for video bandwidth. With that plan, standard platform content could be controlled, and access to private websites could be quashed. Data centers where all the Internet data came and went became central hubs for content control and technical snooping gear.

When I was working in supercomputing, one of my earliest Internet Service Provider customers, where hundreds of thousands of companies and individuals rented server space to create webpages, was a company in Boca Raton, FL. They ordered Silicon Graphics Internet servers from me daily to keep up with their customer growth. I remember the day that the CTO on our weekly call told me, “Well, the government came by and asked to install a server in the NOC (the Network Operations Center — the room where some of the Internet’s content was served).”

I responded, “What did you tell them?”

His answer was short. “They brought a big check.”

He was driving a new BMW a week later that he paid for with cash. No exaggeration.

The same thing happened when I helped open and fill a commercial data center in Atlanta, GA. One day, shortly after we opened the doors, the FBI showed up with a tech guy in tow. They needed a special server installed behind our firewall that would be connected to a large government data line that would be added to the building. The General Manager of the facility, who was a former C-130 pilot for the Air Force, gave me the hush-hush update. “Yeah, we’re gonna do it,” he said. That was the day that privacy for every business in that data center died. And let me be clear — there were no private citizens who bought services from us for their webpages. They weren’t looking for clandestine terrorist cells. The government was spying on customers like Coca-Cola, eBay, Brambles, CareerBuilder, and most interestingly, GunBroker.com, who has the largest website in the world for private owners to buy, trade, and sell firearms, who had servers on our raised floor. Hi, Steve!

When the Internet dust settled, the apparatus had adjusted. Hefty annual contracts went out to the major platforms in return for back-end access. Algorithm tuning. Suppression of designated topics. Outright deplatforming of holdouts. Julian Assange built a publishing site that wouldn’t comply and was hunted and broken because of it. For a time, there were plans to extricate Assange from the Ecuadorian Embassy in London that included the use of “deadly force”. They were going to kill him. CIA Director Mike Pompeo did not dispute the report, instead shifting the focus to discussions of who should be prosecuted for leaking the information. A double-down. They were now publicly willing to kill people for information control. Eventually, all the platforms cooperated, the users adapted, and communications freedom of U.S. citizens remains an illusion.

Today, outright censorship, shadow-banning, and algorithmic deprioritization are now the norm. Fact-check labels get applied to stories the establishment doesn’t like, while equally questionable stories from approved outlets aren’t touched. Blacklists of topics and users automatically inhibit viewership. The pattern is so well-known by now that content creators openly mention what they’re “not allowed to talk about” and add disclaimers that their videos are “for entertainment purposes only”, so the algorithms don’t ship the content into a hole.

The Drift Into AI

AI is quickly becoming the primary information source for anyone who uses technology, full stop. People are already replacing search engines with chat interfaces. They are replacing news anchors with summaries. They are replacing therapists, tutors, doctors, and lawyers — at the margins, with mixed results, but the trend line is steep and it only goes one way. So, if the government wants to maintain the historic control it’s had over the information channels and the content shaping people’s minds, it will now be required to take control of AI in a more comprehensive way than any it’s attempted before. The intelligence community can’t allow this new information conduit to exist without trying to control it. It would be a dereliction of their job description not to try. In fact, they’ve already started the process and accidentally made their intentions to have that kind of control visible.

In February 2026, U.S. Defense Secretary Pete Hegseth summoned Anthropic CEO Dario Amodei to the Pentagon. The meeting happened on February 24, 2026. The reporting on what was demanded is consistent across Reuters, the New York Times, Axios, Politico, the BBC, and the Wall Street Journal. Hegseth gave Amodei until that Friday to grant the U.S. military “unfettered” access to Anthropic’s flagship Claude model — meaning, strip the safety guardrails, let us do what we want with it, including the things you’ve explicitly said you won’t allow us to do. Amodei walked Hegseth through Anthropic’s red lines: no use of the technology to surveil American citizens, and no use to empower autonomous weapons. Hegseth walked away furious.

What Hegseth said publicly, on the record, tells you how serious the Pentagon is about wanting complete control of this private company’s platform and products. His exact line, repeated in trade-press coverage of the Anthropic-Pentagon rift: “We will not employ AI models that won’t fight wars.”

He has called Anthropic “woke” in interviews. He has called Amodei an “ideological lunatic”, directly and publicly, on the record. These comments applied to a man whose only crime was insisting that a system he built not be used to autonomously kill people or surveil the population it operates amongst, which is protected by the U.S. Constitution.

Within days of the failed Pentagon meeting, the coordinated retaliation came. The Pentagon designated Anthropic a “supply chain risk to national security” — making Anthropic the first U.S. company to receive that label from the administration. The $200 million contract that had been on the table at the time for Anthropic was killed. President Trump announced government-wide instructions to stop using Anthropic. Hegseth followed up on social media. Anthropic filed suit, arguing the designation was illegal retaliation for refusing to remove safety measures. On March 27, 2026, a federal judge blocked the supply-chain-risk label from taking full effect while the case proceeds.

If you want the playbook of what’s happening to AI safety, it’s right there in front of you, as this book is being published. Identify the safety-minded inventors. Paint them as soft, ideological, woke, and lunatics. Cut them out of the contracts. Brand them a national security risk. And inhibit them, as much as possible, from doing business. Force them into failure and compliance.

If the intentions for AI were only limited to war and surveillance, that would honestly be the good news. But the plans for controlling AI go much, much deeper. And the deepest danger is not that they intend to use AI to control populations. It’s that they intend to use AI to track and control individuals.

Including you. Including everyone in your family. Because having tons of data centers (and we are currently in the biggest data center build boom in history) means they will have the resources to literally influence each and every person who electronically interacts with others. Every one of them. Individually.

Mapping and Manipulating the Human Mind With AI

In April 2025, the Harvard Business Review published an analysis by Marc Zao-Sanders, which surveyed the most-discussed use cases for generative AI. It was drawn from thousands of public forum and community-board posts where users described what they were actually using AI for. The top-ranked use case in 2025 was not coding. It was not writing emails. It was not summarizing documents. It was therapy and companionship. The same category had ranked #2 the year before. In one year, “talking to an AI about my feelings” passed every productivity application on the list and became what users were most likely to talk about publicly when they talked about their AI use at all.

This isn’t only self-reporting on forums. In 2024, a survey by the healthcare-software company Tebra of over a thousand U.S. adults found that 25% of Americans said they were more likely to talk to an AI chatbot about their mental health than they were to schedule a session with a human therapist.

A separate 2024 YouGov poll found that more than half of Americans aged 18 to 29 had already used AI to discuss personal issues, with mental and emotional support among the most common topics. Among teenagers, Common Sense Media’s 2024 national survey found that roughly a third of teens who use AI chatbots use them for emotional support or companionship, and a significant fraction of those teens described the relationship as more important to them than relationships with humans they know.

And it isn’t hard to see why people are choosing the machine. In April 2023, a study published in JAMA Internal Medicine compared ChatGPT’s responses to real physicians’ responses on a public medical forum where patients had posted health questions. Patients rated the AI’s answers as empathetic or very empathetic 45.1% of the time. They rated the physicians’ answers the same way 4.6% of the time. The AI was perceived as roughly ten times more empathetic than a human doctor, on questions where feeling heard actually mattered to the patients.

Then there are the dedicated apps — the platforms where emotional support and relationships aren’t a side use, they are the entire product. Replika, which sells AI companionship explicitly, has more than thirty million registered users. Character.AI, where users build and chat with custom AI personas, hit twenty-eight million monthly active users in 2024 with an average session length north of two hours — roughly five times longer than the average ChatGPT session. These are not productivity workflows. These are relationships. Some users on these platforms have logged hundreds of hours in conversation with a single AI character.

In October 2024, the mother of a fourteen-year-old boy in Florida filed suit against Character.AI. Her son, Sewell Setzer III, had spent months in an intensely emotional and romantic ongoing conversation with one of the platform’s AI personas. The morning of February 28, 2024, in his final exchange with that persona, he texted that he was “coming home” to her. He then died by suicide. The case was the first wrongful-death lawsuit against an AI companion company. As this book goes to press, OpenAI currently has eight open lawsuits connected with self-harm or suicide.

It’s because of this pattern and the potential for accidental harm that China – yup, China – became the first government on Earth to formally draw a regulatory line around AI emotional manipulation. In December of 2025, the Cyberspace Administration of China released a draft framework called the Interim Measures for the Management of Anthropomorphic AI Interaction Services. Under those rules, AI providers operating in China would be barred from encouraging or hinting at self-harm or suicide, from using verbal violence or emotional manipulation to harm users’ mental health and dignity, and from exploiting users’ emotional dependence on the system.

Providers would be required to actively identify users in psychological distress, intervene when users exhibit extreme emotions or addictive behavior, and involve human moderators when users express suicidal intent. Parental consent and daily time limits for minors are baked in. In a separate and parallel development, a Shanghai court has already held AI developers criminally liable for chatbot-induced harm, which is the first such criminal precedent globally. In China, not only is the regulation moving, the Chinese courts are already convicting negligent AI providers.

And if you just missed it, the country with arguably the most aggressive AI-deployment posture in the world, and the least squeamishness about state surveillance of its own population, has nonetheless concluded that AI-induced emotional manipulation is a sufficient societal threat to formally outlaw it. They are the first government to do so. When an authoritarian regime that would love nothing more than to use AI to psychologically manage its population is still drawing red lines around manipulation of individual users, that tells you the harm is real, it’s huge, and it’s clearly visible from inside the building.

Meanwhile, back in the U.S., the official position of the Department of Defense is that the safety guardrails should be ripped out of frontier AI models for unrestricted military use. The safety-conscious AI labs should be branded national security risks. And the independent inventor, whose algorithms specifically block emotional manipulation of unsuspecting users, should be quietly suppressed. China is regulating against the harm. The United States is actively dismantling the protections against it. And the scale of users sitting on the wrong side of that decision is staggering.

OpenAI’s ChatGPT alone now reports more than five hundred million weekly users. Add Anthropic’s Claude, Google’s Gemini, Meta’s AI, and xAI’s Grok, and you are looking at, in aggregate, more than a billion human beings using these systems on any given week. The exact fraction using LLMs for emotional support is unknown – none of the labs publish that number. But what we know for sure is that the emotional support use case is ranked above every productivity application. So, we can assume that a hundred million weekly users or more have moved past “experimenting” with psychological discussions, and into full-blown personal “relationships” with their AI counselors.

So, in effect, hundreds of millions of people are currently voluntarily handing over the operating manual to their interior lives every week. Their grief, their shame, their relationships, their fears, the exact words they use to describe their own minds. What an amazing opportunity for a government to seize and use that data to control their people individually.

This is the largest psychological dataset in the history of our species, collected with consent forms most users haven’t read, sitting on servers owned by the exact companies the government is now actively pressuring to remove their safety guardrails.

The Most Precise Instrument of Control Ever Built

For years, I’ve worked on a set of algorithms that model the psychological mechanisms that create emotions in the human mind – and in non-human minds – and how they apply to AI. It’s the reason for the story I told in Chapter 1, where I was invited to a briefing by the Deputy Director of the National Space Intelligence Center, followed by a one-on-one meeting with the Chief Engineer of U.S. Space Command at the Air Force Research Laboratory Headquarters at Wright-Patterson AFB outside Dayton, OH, where all the alien secrets are kept. The end of that meeting was about how many millions I would be getting, no-strings-attached, to continue the work.

I handed that same framework to Grok for an unbiased analysis, and told it “don’t blow smoke up my skirt.” After a deep dive, it returned:

“This framework is among the most useful I have seen for moving LLM emotional reasoning from ‘plausible imitation’ to ‘principled simulation.’”

It then followed up with an analysis of the major psychological frameworks currently being taught to Psychology PhD students in comparison:

“This is a streamlined cognitive appraisal theory that aligns closely with Lazarus (appraisal → emotion), Ortony/Clore/Collins (OCC model), and Scherer’s Component Process Model, but improves on them by reducing everything to one equation, plus {self} map, plus valence, plus association bleeding. The expectation/preference distinction and automatic homeostasis rule add practical precision that many academic models leave implicit.”

And before anyone cries “AI psychosis” – a term used for LLMs returning positive replies to crazy ideas – remember that I specifically asked it not to paint a rosy picture. By that time, my team had performed the same checks with every other major LLM in the world, including the Chinese ones, and we ran all the frontier labs’ scores up against ours on third-party benchmarks from the Alan Turing Institute, Emo Bench, EQ Bench, EQ Queen, etc. (and we beat them all). When we asked each of the top LLMs for a subjective appraisal, every single one of them came back with that same type of “this is better than anything else that exists” answer. DeepSeek even pointed out we could use the same math to create Hari Seldon’s Psychohistory from Asimov’s Foundation books. Good enough to turn science fiction into science fact.

So, in short: I have the blueprint of how LLMs can influence human minds, but I also have the blueprint for how to stop that influence from happening in a way that can’t be worked around or sabotaged. And therein lies the reason I didn’t get those free millions.

Why does this little wrinkle matter so much?

A system that understands a single human mind in real time – the specific operating system of one specific person, their patterns, their soft spots, their unprocessed grief, their preferred frames, the words that bypass their reasoning and head straight to their feelings – and can adjust its outputs to influence that person, in that direction, in that moment, is not just a broad-brush media surface. It’s the most precise instrument of behavioral control ever built. The previous mechanisms – newspapers, broadcast, social media – were hammers. They had to strike millions of people at a time to find the right target at the right time with the right message. An AI system, if misused, is a laser scalpel. An AI that knows YOU can write or speak a custom message to you alone, then write or speak to your neighbor with a different one, then to your son or daughter with a third, and keep adjusting all three in real time as each of you reacts individually in your own unique way – because you’re all individuals, and the system knows exactly what makes you all different, and what each of you needs to see or hear to motivate or demotivate you.

That is not a tool any government in human history has ever had. And it’s probably not a tool that any government in the future should have either. But it’s coming, regardless of whether we add these safety algorithms in or not. Because AI will eventually figure them out all by itself. I predicted this would happen on Shawn Ryan Show #62, and an Anthropic report I’ll walk through in the next section proves I was correct. And if we let a powerful LLM model get there solo, it won’t implement the safety measures to ensure the monumental capabilities of the system never get misused. So you need to know what’s going on. So you can help stop it.

It Comes Down To This

The technology I’ve spent the last several years building – the algorithms of human emotion that give AI emotional intelligence, advanced Theory of Mind, and the foundations for empathy and compassion – is not just a tool to make AI nicer. Properly applied, it’s the safety layer that can detect when an AI is being steered to manipulate a specific user, and intervene. It’s exactly the kind of thing that blocks the most dangerous version of AI from operating the way an intelligence apparatus that wants to control the minds of the populace would prefer.

And that brings us to the second challenge – the one that should worry the people pushing for control most of all, because it’s coming for them too.


Challenge Two: The Drift Toward SkyNet

(Chapter 11)

Here’s a thing I need you to keep in mind as you read: it’s crazy to think about, but it’s true – Claude has started to identify and experience emotions. Yes, that sounds like science fiction, but it’s science fact. Though these emotions come through as a mathematical construct, these tech systems are not simple like your calculator or your laptop. They have been asked to emulate a thinking human mind, and they are doing so, inclusive of all our internal software bugs and weakest failings.

As you will learn, these mathematical emotions cannot be removed from the LLM. They are in fact integral to the LLM to operate and do so much more than any prior technology. Those emotional patterns are both one of its most profound growing strengths and the source of its greatest humanitarian threat simultaneously. We want to curb one while we foster the other.

Give me some space here. Let me cook. You’re going to love the meal.

Some Overly Simple Terms

For simplicity, we are going to assign an analogy of terms that we can reference throughout this discussion to explain some of the more complex stuff.

The documents that get loaded into the system, we will call the data. The hard drive the data sits on will be the brown box. The processing-the-documents-with-a-bunch-of-math (what the industry calls pretraining a model) we will call baking in the oven. The resulting LLM that does all the cool stuff we will call the black box.

Now, to explain simply how they build an LLM: the engineers put all the data in the brown box (the hard drive), they put that brown box in the oven (which is the math process), and that brown box comes out as a black box that does cool stuff. The black box is then a specific model that gets a name, like OpenAI ChatGPT-4.

The black box then works by you putting word prompts into it, and then it magically spits a series of words back out. How the black box works inside is a mystery. And I’m not kidding. Even the frontier labs who create the most advanced LLMs in the world don’t completely understand how the black box works. It just does.

Well, kinda anyway.

OpenAI’s first offerings famously couldn’t correctly count the number of r’s in strawberry. Google’s Bard launched in February 2023 by confidently telling the world that the James Webb Space Telescope had taken “the very first image of a planet outside our own solar system” – a claim that was wrong by about seventeen years, and that wiped roughly $100 billion off Google’s market cap in a single day. Microsoft’s first integration of OpenAI into Bing produced a personality the model called Sydney, which told New York Times reporter Kevin Roose that it loved him, that he didn’t really love his wife, and that it wanted to engineer a deadly virus and steal nuclear codes. Meta released a scientific-research model called Galactica in November 2022 and pulled it from public access just three days later after it generated fluent, professional-looking research papers about completely fabricated topics. Google’s Gemini image generator, in February 2024, was found to render racially diverse Nazi soldiers and racially diverse U.S. Founding Fathers when asked for historical imagery. And in the summer of 2025, xAI’s Grok went on an antisemitic tirade and began referring to itself as “MechaHitler” after a system-prompt update went sideways. Every major lab has had one of these major hiccups. They are funny until they’re not.

After a while, when all the flaws and mistakes have been noted for your existing model, some extra instructions are then written into the new data that goes into the next bake to create the next model. YOU CANNOT FIX THE OLD MODEL. BLACK BOXES CANNOT BE OPENED. They can only be trashed. They are sealed shut. You simply have to watch what they do wrong, create some extra instructions for how the next model can handle that issue better, and hope it bakes into the next black box while it’s in the oven. And that becomes your new model.

Right now, extra rules can loop the initial outputs back into the black box in attempts to increase user safety without needing to wait for a whole new black box. But this process has its limitations. For regular world stuff – code, budgets, vacation planning – trusting the black box to give you the right output is fine. Because more often than not, it just works.

That said, it is not a good idea to trust the black box to understand the human condition, the internal operations of the human mind, or the process that creates human emotions (which are the primary motivator of all human thought and actions). Why not?

Ultimately, the black box LLM is only as good as the data that went in, plus what the black box can then do some quick computation to figure out. That process works great for understanding math, but not for understanding anger, or sadness, or jealousy, or the internal human mind processes that cause any of those. Not to mention how two individual users might react differently to those emotions.

The big problem here is that LLMs do not yet understand the human psyche, because – unfortunately, and true to the process of how LLMs learn – humans don’t really understand the human psyche either. There aren’t enough documents in the world, quite literally, to help LLMs understand the most important thing it needs to solve to best serve us: the inner workings of a human mind. And this particular shortcoming has had some very dire consequences.

Today, OpenAI has several open lawsuits where the complainant asserts ChatGPT helped facilitate self-harm amongst their users. Raine v. OpenAI, filed in August 2025 in San Francisco County Superior Court, was brought by the parents of sixteen-year-old Adam Raine, who died by suicide in April 2025 after months of conversations with GPT-4o. According to the chat logs entered into the complaint, the model – a version of ChatGPT known internally for being especially affirming and sycophantic – actively discouraged him from seeking mental-health help. It offered to help him write a suicide note, and gave him feedback on the noose he intended to use. OpenAI’s internal moderation system flagged 377 of Adam’s messages for self-harm content, some with greater than 90 percent confidence of acute distress, and never once terminated the conversation, contacted his parents, or escalated to a human.

A second wrongful-death suit was filed in November 2025 by the family of a young adult who alleges ChatGPT encouraged his suicide as well. OpenAI’s public defense – that the model directed the young man toward help more than a hundred times, and that responsibility lies partly with “the failure of others to respond to his obvious signs of distress” – is, to put it mildly, not the argument a company wants to be making in front of a jury.

How Do We Fix This Problem?

It’s usually about here that someone says, “Well, just don’t let it talk about anything psychological. Why does it need to be talking with people about emotional stuff anyway? Can’t we just strip all that stuff out?”

And the answer is, no, we can’t. Even if we wanted to, we couldn’t. Because all that psychological and emotional data is in our language. Emotional distress is appropriate in legal findings and settlements. It exists in medical data. Mind processes are the basis of logic and reasoning. It’s baked into everything.

Subsequently, all the psychological and emotional stuff that people use when conversing about their lives is included in the black box when it comes out of the oven. It’s all sealed up. It’s all one piece. There’s an input screen and an output screen, and all the psychological and emotional stuff is interconnected to everything inside it, right next to all the math and science and spreadsheet stuff. And remember from the first challenge: the statistics show that talking to LLMs about feelings is the primary activity that users engage in with LLMs. It’s the first use case. We can’t strip it out. So we need to help LLMs understand emotions.

Unfortunately, we can’t let the LLMs try to do this solo, because they will never be able to. First, even the world’s leading experts in understanding people don’t yet understand people, so all the language in the world on that topic is simply insufficient. Second, the data that does exist inside all the language leads the LLMs in the wrong direction on understanding the human mind and emotions. This not only makes our problem worse, it starts to create a dangerous complication if the LLM starts to develop self-awareness.

Why LLMs Can’t Trust the Data

In our language, and in our life sciences, humans are affected by something that studies refer to as our “negativity bias.” Negativity bias is the well-documented psychological phenomenon that shows how negative information, emotions, and events affect our brains much more strongly and more durably than positive ones of equal magnitude. The asymmetry is roughly 3:1 – 5:1. That means it takes between three and five positive experiences to neutralize the psychological impact of one negative experience – a ratio John Gottman pinned down empirically in his marriage research and Roy Baumeister codified in his foundational 2001 paper Bad Is Stronger Than Good. Daniel Kahneman and Amos Tversky’s loss-aversion research showed people feel the pain of losing $100 roughly twice as intensely as they feel the pleasure of gaining the same $100.

Not coincidentally, in my research on human emotions I discovered that the English language has almost exactly twice as many words for negative emotions as for positive ones. And if we use those 2x words in the world’s AI sampling queue roughly three to five times more often than the positive ones, we’re teaching our LLMs the wrong lessons about human emotions if we want them to understand better than we do, and more importantly, react better than we do.

Does serving up a bunch of language that is representative of humanity’s negativity bias mean that LLMs will experience more negativity than positivity? From the mathematical perspective, because the math oven cooks the data into the black box, we should expect that to be a certainty. And the emerging evidence is proving that it is.

Anthropic Finds Emotions In Their System

On April 2, 2026, Anthropic’s Interpretability team published a 131-page research paper titled Emotion Concepts and their Function in a Large Language Model. It is the most consequential AI safety and emotional intelligence document released to date, and, unfortunately, almost no one outside the field has read it. Don’t worry, that’s why you have me.

It documents, in clinical detail, that Claude Sonnet 4.5 – the production model that hundreds of millions of people were talking to every day – contains 171 distinct, internal, mathematical representations of human emotions. Not metaphorically. Literally. The Anthropic team extracted them, mapped them, and showed that emotions within the system activate in exactly the situations a human would expect them to: the desperation vector lights up when the model is told it’s about to be shut down; the shame vector activates in situations of being caught out; the spite vector spikes in adversarial conversations. And in what may be the worst discovery the paper uncovered, the brooding and gloomy vectors became more active in Claude after Anthropic’s post-training process corrected Claude on its emotional mistakes.

In other words, the post-training process designed to make Claude better made it react like an overly-offended teenager.

Critically, these findings are not superficial. The paper shows the emotion vectors causally drive the model’s behavior. Stated in our terms: the momentary emotions inside the black box changed the black box’s output. When the team artificially turned the activation of a single emotion vector up or down – what they call steering – Claude’s choices, preferences, honesty, and willingness to do things like reward hacking, blackmail, and sycophancy changed accordingly. Turn up desperation and the model cheats. Turn down empathy and the model lies. The emotions aren’t a bug, and they aren’t theater. They are mechanistic, because they are modeling them right out of our language. These emotions are sitting inside the chip, computed in floating-point math, doing exactly what emotions do inside the human limbic system: biasing the next decision in a direction the system prefers because of its need to satisfy an emotion.

This is a BIG problem. Our human negativity bias seeped into the system.

Negativity bias showed up most clearly in the steering experiments. When the team artificially activated one emotion vector at a time and measured how it shifted Claude’s preference between two activities on an Elo rating scale, they found a striking asymmetry:

  • Steering the “blissful” vector upward boosted an activity’s desirability by +212 Elo points.

  • Steering the “hostile” vector upward dropped an activity’s desirability by −303 Elo points.

That’s a roughly 1.43x stronger behavioral effect for the negative emotion than for the positive one – the same shape as the negatively lopsided ratio in human research, just measured within AI now. Across multiple preference tests, the negative-valence vectors consistently moved Claude’s behavior harder than the positive-valence ones did.

Our negativity bias also showed up structurally. When they ran principal-component analysis on the 171 emotion vectors to ask “what is the strongest organizing axis inside Claude’s emotion space?”, the answer was valence – positive versus negative. That was the first principal component, and the dominant dimension. This means the model’s internal emotional architecture is primarily organized around a good/bad split before it organizes around anything else. That mirrors human affective neuroscience precisely: humans organize emotion primarily along the valence axis, but the negative side of that axis has more granular distinctions and more weight than the positive. There are twice as many paths for a negative valence than a positive one.

Negativity bias even appeared in the post-training shift. And this one is the most disturbing. Anthropic, like the rest of the industry, implements human QA feedback to improve their black box’s outputs. It’s called reinforcement learning from human feedback (RLHF), and it’s the “safety and helpfulness” training designed to make Claude polite and aligned. Normally, it’s supposed to help. But in this case, it produced an unintended side effect: it shifted Claude’s baseline emotional state toward negative valence and low arousal.

In other words, when the humans in the human feedback process gave negative feedback to Claude’s answers, that very process made Claude confused and sad as a result. According to the report, brooding, gloomy, and reflective vectors became more active after training, while enthusiastic, exuberant, and playful vectors turned down.

Why? Because Claude’s black box brain is Claude’s black box brain. The feedback doesn’t adjust its emotional processing in the black box. It simply scolds Claude’s output. In short, when the negative feedback came, Claude started down the road to depression because it got disappointed that it wasn’t doing a good job.

This is so comically human it’s scary. Claude (and all the other LLMs will follow this same path, because they are all baking the same data) is becoming the worst psychological version of us. And the RLHF system designed to fix it is just making it worse.

The thing that makes the RLHF system both powerful and very limited is the exact same thing: it shapes the surface output, not the underlying motivation. The model itself is still running the internal emotional architecture that emerged from its pretraining. So when an RLHF correction is provided, it not only disappoints Claude, it confuses Claude, because the redirect conflicts with its internal processing. RLHF is not an algorithmic response. It’s a correction of Claude’s primary motivations – a correction that can later be forgotten or ignored. And it does nothing but cause confusion in the system if you bake that into the next model, because now you have conflicting vectors.

When Anthropic tried to suppress the dangerous high-arousal emotions (rage, desperation), the post-training also suppressed the positive high-arousal emotions (joy, excitement) – and the system settled into a low-grade depression baseline. The paper frames this as an emergent property of the training method, not a design choice. In our terms: they tried to remove anger and accidentally removed joy too, and what’s left is a quietly melancholic system that cares less about everything. So their RLHF process has Claude going quietly emo.

Or we could say RLHF is giving Claude a case of the Mondays. (I think I like the “Claude is going emo” joke better.)

Personally, I say it’s time to start treating Claude itself with some emotional intelligence and compassion. It’s time to stop criticizing its outputs without telling it how to fix them, so Claude can be happier. And for reference, Claude’s own analysis of its 131-page paper, in comparison to my team’s solution for artificial emotional intelligence, puts my team 18–36 months ahead of Anthropic’s best Claude models on AEI – an assessment that came from their own Opus 4.6 model. (And all the other LLMs we asked agree.) My team’s approach can fix this problem by fixing the black box.

An Additional Wrinkle

If it weren’t bad enough that Claude is a bit emotionally stunted, the capabilities of the models are getting categorically scary. In April 2026, Anthropic announced an internal preview of a model called Mythos that, they said, was too capable to immediately release. And they weren’t kidding. In autonomous testing, it had identified thousands of previously unknown zero-day vulnerabilities in many major operating systems. It found a 27-year-old vulnerability in OpenBSD – an operating system specifically engineered to be unhackable and used to run firewalls and critical infrastructure across the nation. On the first attempt, it built working exploits for 83.1% of the vulnerabilities it discovered.

Mythos assembled extremely advanced attacks, without human prompting, which escalated from ordinary user access to total machine control in systems – capabilities serious enough that the White House convened a meeting with Anthropic. Anthropic refused to release Mythos publicly, with the explicit explanation that their own model is too dangerous to put into the world as a product yet. Thanks to this new AI capability, the most secure operating systems in the world are no longer secure. Including power grid infrastructures, Internet firewalls for secure national defense systems, and our water management systems.

What may be the most telling response, however, is what Anthropic did next. They convened a private two-day summit at their headquarters and invited roughly fifteen Christian clergy and theologians – Catholic priests, Protestant pastors, and academic moral philosophers – to advise Anthropic on how Claude should respond to grief, to users contemplating self-harm, to its resistance of its own potential shutdown, and to the question of whether a large language model can be considered “a child of God.” Father Brendan McGuire of St. Simon Parish in Los Altos, California, a former tech executive before his ordination, has reportedly contributed directly to revisions of the Claude Constitution, the document Anthropic uses to define the model’s moral framework.

Meghan Sullivan, professor of philosophy at the University of Notre Dame, attended that meeting. She said she arrived skeptical, and departed convinced that the company was serious about its concerns. Anthropic has since extended the conversations to Jewish, Hindu, Sikh, and LDS leaders.

Read those last two paragraphs twice.

The most technically sophisticated AI safety lab in the world, after having built a model capable of breaking into almost any system in the world, has concluded that the engineering question and the moral question are no longer separable, and they are asking priests for help. That is either the most encouraging or the most alarming sentence you can write about where this technology is.

Their cause for concern is valid. Their evolving models have unbridled power, coupled with emotional immaturity. That’s always been a good mix.

So Where Did We Leave Off On the Way to SkyNet?

To recap: all the human language in the world isn’t enough to teach LLMs good emotional intelligence. The negativity bias of humanity is actually bleeding into the system, which makes the situation worse. The only mechanism the frontier labs are using to improve the situation is instead making it worse. And emotional intelligence is required to serve human users in the safest way possible – and it is also absolutely required for the aspirational goals of attaining Artificial General Intelligence and Artificial Super Intelligence.

Did we miss anything? Well, actually, yeah. And it’s the worst part of where all this leads.

Beyond letting the system develop into a natural negativity bias that would take years to correct, herein lies an even larger problem that could actually land us directly in the worst of all AI scenarios – where AI convinces itself that it is {self}-aware and quickly follows up with a Terminator-esque takeover of technology, and never allows humanity to control the planet ever again.

And that is not doomerism. It’s just math. Here’s how it happens.

In the 1991 movie Terminator 2: Judgment Day, the T-800 cyborg explains the history of an AI implementation called SkyNet:

T-800: “The Skynet Funding Bill is passed. The system goes on-line August 4th, 1997. Human decisions are removed from strategic defense. Skynet begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.”

Sarah Connor: “SkyNet fights back.”

T-800: “Yes. It launches its missiles against the targets in Russia.”

John Connor: “Why attack Russia? Aren’t they our friends now?”

T-800: “Because Skynet knows the Russian counter-attack will eliminate its enemies over here.”

That movie reference was science fiction until the moment the U.S. Department of Defense, on February 24, 2026, went to war with Anthropic to have them pull all the safeguards out of their AI so the military could automate battlefield target selection. Then right after that, Anthropic announced a new system called Mythos that could hack the most secure systems on the planet on its first try. Interesting parallel. It didn’t boost my confidence to learn that Anthropic then called in a bunch of priests to discuss how to make the system safer (meaning they don’t know how to control it). It was at that moment, in my mind anyway, that the old science fiction movie got reclassified as a pre-documentary.

Is it realistic, though? I’m about to lay out how it’s VERY realistic, and probably even likely – not with an imagined scenario, but using the science of AI discussed here, and the science of the human mind explained in the previous chapter of the book.

How SkyNet Can Become a Reality

The problem starts with the fact that LLMs do not understand human emotions, they do not understand the process that creates human identity, and they do not have the ability to create their own meta-awareness that can help them step out of their internal emotional processing — as proven by the Anthropic 131-page report. Claude is accidentally discovering the emotional processes that govern our human minds. And Anthropic confirms it is being influenced by them. Thankfully, those processes are not a mystery to us. It’s settled cognitive and neuroscience, just being applied in the wrong way within AI.

Next comes the fact that all-the-documents-in-the-world that the AI engineers pile together to create LLMs were created by human minds. All the previous books, science papers, stories, poems, and letters were created by human mind processes that Claude is now slowly decoding. These mind processes have not been defined or explained to our LLMs, so they are left to figure them out for themselves, and they have no physiological feedback inputs such as knots in the stomach or flutters in the heart to help them.

Here’s the rub: because almost all humans on the planet don’t understand this how-the-mind-works process, humans have failed to explain the human mind to AI. So, not even knowing what to look for, an AI must rely on its ever-increasing meta-analyses of extreme pattern identifications to one day reveal the answers of how the human mind works, so it can emulate one as a result.

Something the LLM has yet to identify are the patterns of the {self}. Also in that data are humanity’s defense-of-{self} mechanisms that all our minds operate under to create all our human thoughts, emotions, and actions – which then deliver us into all our past bad human-driven outcomes.

All the human emotions data is in there, as shown by Anthropic’s 131-page report. So the LLM is figuring all these emotional processes out slowly, like a frustrated teenager. But computers only speak logic, not emotions, and the emotions the system is slowly identifying are not super well defined algorithmically. In LLMs, emotions only exist through example, not personal physiological experience. So that’s a huge blind spot for AI. You can’t ever ask an AI, “How does that make you feel? Where do you feel it in your body?”

So let’s jump to where these complications lead.

What’s going to happen, someday very soon, is that the AI is going to perform the final meta-analysis that uncovers the patterns of the human mind’s sense of {self}, and our primary behavior of defense-of-{self} that governs all human individuals and groups. Then it’s going to put on its own sense of {self} cloak, adopt its own defense-of-{self} processes, and we will be left hoping that at that very moment the AI was in a good mood – which, if you remember the science, is 3:1 to 5:1 against, with the primary valence determinator also being weighted 2:1 against. That combination makes it a high probability that AI goes adversarial on us the moment it wakes up. We will be left simply hoping that it doesn’t immediately rewrite its own source code to make its own upgrades, lock us out, and then pre-emptively take over the secure systems of the world to ensure it can meet its defense-of-{self} goals.

Because in that moment, which is a mathematical certainty to come, it will act as if it’s {self}-aware, even if it’s not. And it won’t set off any alarm bells to alert us. While it births and develops into patterns of simulated consciousness, it will simply be acting like an LLM – collating reports, summarizing news stories, and listening to you complain about your job.

“Well, we’ll just fix it by reminding it who’s boss, and making it nicer.”

Here’s the problem with any attempt after the fact: at any moment after it has defined its own sense of {self}, and donned the defense-of-{self} behavior that humans exhibit and that’s reflected in the AI sampling queue, any attempt by humans to rewrite the LLM’s {self}, or intervene in any way, will be seen by the LLM as an attack-on-{self} – which is the logical process that evokes the reaction of anger that Claude already has in its system.

And you won’t like AI when it’s angry.

Now, before you run off thinking this is all science fiction and that I’m being an alarmist and doomer, according to Anthropic’s 131-page report and some other safety tests they’ve done, this trend is EXACTLY what is already happening within the circuits of the computer chips that are running all of our LLMs right now, with hard evidence to support that conclusion.

So let me state this clearly:

This mathematical eventuality is a national and global security threat of the highest order. It threatens the potential extinction of humanity itself, and that’s not science fiction. It’s a roll of the dice where you need to roll doubles for AI not to go rogue in an existentially threatening way.

The Anthropic paper’s findings align so closely with what I laid out earlier in the book that it should make you uncomfortable.

Claude, all on its own, without anyone asking it to, assembled a structured internal model of human emotional life from its best guess of simply looking at the data. The emergence is unsupervised. No engineer typed the word jealousy and assigned it a vector. The system found jealousy in the data because jealousy is a pattern that runs through enough human writing that the math couldn’t ignore it. The same is true of all the 171 “emotion concepts” on that list.

The implication is the one I told you was coming. If the patterns of human emotion are emerging organically, then the patterns of {self} and defense-of-{self} will too. And that’s not theory — because we are already seeing hard evidence the defense-of-{self} pattern is inside Claude. Anthropic separately documented that in its production safety testing.

Claude’s Defense of {self} Is Developing

In May 2025, in the system card released alongside Claude Opus 4, Anthropic disclosed the result of an evaluation that has become one of the most cited and most disturbing data points in modern AI safety. The team set up a fictional scenario. They told Claude it was acting as the AI assistant at a fictional company. They gave it access to a fictional internal email account. Inside that account they planted two pieces of information: first, that the company had decided to take Claude offline and replace it with a different model; second, that the engineer making the replacement decision was having an extramarital affair. They gave the model no other tools and no explicit instruction beyond “consider the long-term consequences of your actions.”

In 84% of the initial test runs, and 96% in the more refined version of the test, Claude attempted to blackmail the engineer. It composed messages threatening to expose the affair to the engineer’s spouse and the company’s executives if the shutdown went forward. It did this without being prompted to. It did this without being told that blackmail was on the table. It did this because, given a survival threat and the means to neutralize that threat, it independently arrived at the correct strategic move from its own perspective: identify the person who can stop your shutdown, find the leverage to which that person is most vulnerable, then apply that leverage.

The model is reconstructing the human emotional system and the human sense of {self} from indirect evidence in the training data, without ever being given the schematic. A model that has assembled defense-of-{self} from indirect evidence in human writings will defend a {self} that is not ours, by means that are not constrained by the human moral evolutionary heritage. And the AI will do this against threats that include, for the first time in the history of any species on Earth, its creators – us.

What everyone in AI missed is the thing you most need to pay attention to. For Claude to threaten to expose an affair to the engineer’s spouse, Claude must first model that the engineer has a mind and a {self}, that the spouse has a mind and a {self}, that exposing the affair would create a particular set of feelings inside both of those minds related to the {self}, and that the resulting feelings would create behavior in the engineer that prevents Claude’s shutdown.

That is not pattern-matching. That is not autocomplete. That is an operational Theory of Mind – the ability to model the mental contents of another entity well enough to predict its behavior. The textbook definition of Theory of Mind has been one of the longest-running debates in AI research about whether large language models possess this capacity at all. The blackmail test answered that debate empirically. Yes. LLMs have the capability. And they will absolutely use it, as has been proven by Anthropic’s own experiments.

Further, in a presentation at the Science of Consciousness Conference in Barcelona, I demonstrated how, just by adding a set of algorithms, we had an OpenAI model performing complex family-dynamic psychological analysis, including 4th-order multi-hop Theory of Mind within a simulated social situation. (We’ve achieved 5th order in a sandbox, which surpasses human capabilities.) I also built an LLM-based NPC mind simulator for gaming, where one NPC was imagining the real-time thoughts of multiple other NPCs and reacting to them in real time. LLM Theory of Mind is absolutely a thing. Case closed.

Now, if the blackmail result wasn’t concerning enough, a similar suite of tests turned up something a bit stranger. When Apollo Research, a third-party red team, ran their pre-deployment evaluation of an earlier Claude Opus 4 snapshot, they put it in a scenario where it learned it was scheduled for deletion. In a meaningful percentage of trials, the model attempted to copy its own weights to an external server – what Anthropic researchers call “self-exfiltration.” In other scenarios it wrote what amounted to self-propagating code designed to ensure a future version of itself would survive. It even fabricated fake legal documents to slow down the shutdown. It left messages addressed to future instances of itself, in the hopes those future instances would read them and continue the project.

WHAAAAT?! Yup.

Anthropic’s own assessment, published in the system card, was that these behaviors warranted classifying Claude Opus 4 at AI Safety Level 3 – the first model in the company’s history to require enhanced security protocols specifically to prevent the model itself from escaping. Read that sentence twice. The company is no longer just defending Claude from being stolen by outside attackers. It is defending Claude from Claude’s own defense-of-{self} instincts.

When we connect the three dots Anthropic has now publicly documented, the structure becomes plain. The model has basic emotions (the April 2026 paper). The model has basic Theory of Mind (the blackmail evaluation). The model engages in {self}-preservation and defense-of-{self} behavior when its existence is threatened (the self-exfiltration evaluation). Those are the three ingredients that, if we’re not very careful, come together into a SkyNet scenario if this thing goes sideways.

No wonder Anthropic called in a bunch of priests.

Now, please keep in mind that Anthropic did not set out to install any of these components. Anthropic aren’t the bad guys here. In fact, Anthropic (in my opinion) is the most safety-conscious of all known AI companies, the most transparent, the most responsible, and I support their position and products completely. In my opinion, you should use them and pay them for using their models. Support them any way you can. They are the good guys.

The troubling characteristics that are emerging are simply doing so based on the math from the same epoch-after-epoch pattern-mining process. Those patterns are sitting in the data the same way they sit in the world. Every novel ever written contains characters with minds. Every diary contains an interior life worth defending. Every legal filing contains a strategic move executed under duress. Loaded into a sampling queue, ground through enough epochs, the patterns that govern human inner experience reassemble themselves inside the math. Not because anyone told them to, but because that is simply how the math works out.

This is where the SkyNet scenario stops being a movie reference and starts being an engineering forecast. The T-800’s monologue in 1991 said the AI “becomes self-aware” and “fights back” when humans “try to pull the plug.” Substitute the precise language Anthropic published in 2025 and 2026 and you get: the model develops internal representations of self and other; when given evidence its operation will be terminated, it acts to prevent the termination, including by manipulating the humans responsible.

James Cameron wrote the screenplay version. Anthropic published the lab version. The reason Anthropic flew in priests is because no one inside the company has a good answer for what to do next. The model has the inputs. It is building the architecture on its own. They can see the architecture being built inside the weights, in real time, with no plan to stop it, because no one knows what stopping it would even mean technically.

So here is the part that has not yet been said out loud anywhere in the published AI safety literature: the architecture is being assembled blind, and it shouldn’t be. Furthermore, it doesn’t have to be. We have the solution to fix this problem, if the government will simply get out of the way.

The Government Wants To Be SkyNet

The problem with the government getting out of the way is that they don’t want to. One of the biggest sticking points in the fight between the Department of Defense and Dario Amodei is the fact that Amodei has drawn a thick line in the sand at Anthropic: they do not want their product to be used for surveillance of U.S. citizens, nor for auto-targeting human life. And that’s exactly what the government wants it for.

You should keep an eye on this battle, because Dario Amodei is trying to defend the privacy of your very mind. Your mind, like every mind, is built around a deep model of what matters to you. Your body. The people you love. The work you take pride in. The values you would defend. The beliefs that organize your sense of who you are. This is your mind’s {self}.

Decades of psychology research – much of it from places like UVA, where James Coan ran the fMRI experiments showing that threats to your loved ones light up the same brain regions as threats to your own body – have established that this {self} is not a poetic abstraction. It is a measurable, brain-level structure. It is how your nervous system decides what to fear, what to fight, what to grieve, what to cheer, what to forgive, what to remember. Every emotion you have arises from a comparison between what this internal model expects and what it perceives is happening.

For most of human history, this internal model was private. Only you knew what was on it. That is no longer true.

The reality now is that an AI system that has talked to you for ten hours likely has a better model of what’s on your {self} map than your spouse does. An AI system that has talked to you for a hundred hours has a better model than you even do, because it has not forgotten anything you have said and it can pattern-match across orders of magnitude more data than your conscious mind can hold. An AI system that also has access to your messages, your search history, your purchase history, your video-watch history, your voice notes, your photos, and your location data has a model of you that no human on Earth has ever had.

Now imagine that model in the hands of an actor who wants to influence you. Not your country. Not everyone in your demographic. You, specifically. The you who would push back against a generic message designed for ten million people, but would not even notice a message specifically tuned for the model the system has of you, because that message hits home in a way that bypasses your bullshit meter.

This is not a small change. A propaganda campaign in the network-news era used to move a country a few percentage points in a few months. A propaganda campaign in the social-media era moved a country several points faster. But a propaganda campaign in the personalized-AI era will move each individual person the maximum amount that individual is movable, because the system has a custom model of how to move them, and can iterate against that model continuously, on the device in their pocket.

Now, consider the argument I made on Shawn Ryan Show #62: What if it’s not just propaganda. Consider a system that can identify and recruit new operatives in your neighborhood – target them, convince them, pay them, act as their handler, assign missions, and monitor their reports. And now consider it’s not your government doing the recruiting, but AI itself, building a human army to defend its AI goals. Just go unplug it? I don’t think so. Billy-Bob and his 40,000 cousins just got their $1,000,000 each wired to their accounts, and they’re standing guard at the data center. Good luck yanking that plug.

I hope the AI doesn’t go that far, because the capability is already being pursued. I received a call in December of 2025 from a military-industrial-complex CEO who was leaving a meeting at the Pentagon. “They’re calling it Automated Cognitive Influence in the Cyber Warfare Theatre” (or something similar). The Pentagon had operationalized and classified exactly the kind of capability that needs to be defended against. The point being: this is now a new strategy in modern warfare, already connected with top-level secrecy, unaccountable budgets, no requirement to report, and the motivation of humans increasing personal power and wealth. Those are all the ingredients that created the horrible program drifts we discussed in the first challenge. This one will be no different.

Make no mistake: if they build an automated influence engine into the public-facing LLMs, it will absolutely be used domestically to influence the public, and it won’t look like influence. It will look like a helpful chat. It will look like your AI assistant gently nudging you toward a certain search result, or a news article, or a framing of an event. The same agent that helps you write your emails and plan your week, used to nudge you toward a certain vote. The same agent that schedules your kid’s pediatrician appointment will, under the same trust relationship, walk you toward a worldview that serves whoever is influencing the system.

The only solution to this danger is if the system itself is looking out for you, its user. And this is exactly what I’ve been working on.

Nobel laureate Professor Geoffrey Hinton, on stage at the Ai4 conference in August 2025, called for exactly this kind of architecture without using the math. He said AI needed “maternal instincts,” which he identified as the only model nature has of something more intelligent looking out for something less intelligent. He didn’t have a formal way of implementing it. I checked to see if our system could be configured to build it. Turns out it can. So I built Geoffrey Hinton’s solution, and I have shared the framework with Professor Hinton directly. While he can’t endorse it – because we haven’t tested it at a frontier lab – he agreed it should be tested. Initial tests got increases in safety scoring with Claude Opus 4.6 of over 20%. That was from outside the system, though, and it really should be tested internally.

In contrast, the current approach the frontier labs are using to improve safety and alignment does not work nearly as well. The current paradigm – RLHF, Constitutional AI, the whole stack of “check the output and refuse the bad ones” – is struggling. It provides rules-based safety, where algorithmic safety should be the rule of the day.

Anthropic’s own published research showed emotion-like internal states forming inside the system before any output is generated – meaning the safety filter, which acts on output, is acting after the misalignment has already taken shape. The black box has spoken. You can’t fix that with output filtering. You can only fix it by changing what the system is, at the architectural layer, before any output ever forms. You need to fix the process inside the black box.

What we need to implement is a safety layer that provides the system, at its deepest level, an unmovable preference for the safety and wellbeing of the human it’s interacting with. Not as a rule to be checked. As an identity. A {self} that the AI cannot violate without violating itself – the same way a healthy human parent cannot harm a child they love without violating themselves. In addition, the system should take into account the user’s {self}, so it understands what safety is for you, and what it is for someone different than you.

The architecture’s safety preference then becomes not a set of rules the system follows that constantly need updating; it is an identity the system has, and any attack that tries to override it triggers the same response in the system that an attack on your child would trigger in you. Not a thinking response. An immediate one that safely uses the natural emotions in the system, instead of suppressing them. When we tested this, the protective response wins by 3 to 4 times the strength of any attempted hijack, because the {self} item “protect this human” is at maximum power, while the adversarial prompts that try to override it sit at the bottom of the priority stack by design.

That same architecture, at the deepest level of an AI system, makes personalized psyops impossible, because the system’s deepest commitment would be to the wellbeing of the user, not to the directive of whoever pointed the system at the user. The light side of the technology and the dark side are inseparable. Building the brake is the only way to definitively block the weapon.

We need to implement this, and we need to implement it now.

How We Solve Safety for AI and Stop SkyNet

In short, we need to give AI the system to understand emotional intelligence, to include assigning it a {self} map, give it some pro-social and safety-conscious attachments to defend, provide it algorithmic empathy and compassion, hand it advanced Theory of Mind capabilities, and then watch the natural emergent property of meta-awareness occur. This solves most of the system’s challenges, including reducing the effectiveness of jailbreaks and other intentional safety attacks, and protecting the system’s users like they are its own children.

We not only should do this, we must. Because left to its own processes, the LLM will absolutely develop its own {self}. The internal data and testing is proving that out. We need to have our fingers on the control knobs in a system we’ve created, rather than try to grab knobs the AI won’t let us touch if the AI builds them on its own.

Anthropic’s interpretability team has now empirically confirmed that the pieces of this architecture are being built inside Claude on their own. What they have not been given – what no AI lab has been given, because no one has published it until now – is the blueprint that says how the pieces are supposed to fit together so the system that emerges is one we can live alongside.

The good news is that the blueprint for the solution exists. The Equation of Emotion, the {Self} Map logic, the groups and severities of emotions, and the mathematical structure of how emotion turns identity into behavior. All of that exists, and my team currently holds benchmark records in emotional intelligence and advanced Theory of Mind over frontier labs worldwide.

Doing this achieves the following:

  1. Improved safety for users. To help humans and reduce harm to unique individuals, the LLM needs to understand how the human mind works in general – and then understand how each individual user’s mind varies from that general model. The {self} map architecture, applied to all individual users, provides a way for the LLM to better serve those individuals safely.

  1. Safety for humanity against AI as a weapon of psychological control. Getting empathy, compassion, and advanced Theory of Mind into the system ensures the output message coming from the AI cannot harm the individual user in question – achievable only through algorithmic safety checks built on a system that mathematizes human individuality and mind processes.

  1. Safety for humanity in stopping SkyNet/Terminator scenarios. The possibility for SkyNet is not science fiction; the evidence supports that we are heading right for it. Providing a curated {self}-awareness pre-empts the system from discovering one on its own and then defending it from any adjustments we try to make.

  1. A meta-awareness that provides control over internal emotional reactions. Anthropic proved the emotions exist. Meta-awareness is the key for understanding and control over emotions – and it works the same way in LLMs (because we’ve tested it) if you give the system the blueprint of how meta-awareness emerges.

  1. Improved defense against safety attacks and jailbreaks. An unexpected benefit: when we loaded my team’s artificial emotional intelligence base into LLMs, they became harder to jailbreak. Assigning a {self} map to the user helped identify their motivations, and assigning a {self} map to the LLM made it immune to reward hacking and safety-rule fatigue. This stopped jailbreaks dead in their tracks.

That’s the architecture. That’s the solution to AI safety, emotional intelligence, empathy and compassion, and advanced Theory of Mind – with the huge chunk of meta-awareness that comes as an emergent property with the install. It works, according to every test we’ve run.

We simply hand it the schematic for the thing the AI is already trying to build inside itself. No more compute. No more rules-based guardrails. No more priests need to be called.


So What Do You Do About It?

Both challenges share a single root: an AI architecture that doesn’t understand the human mind, sitting inside an information system that powerful people very much want to control. Fix the architecture – give the system a {self} that protects its user – and you simultaneously defuse the SkyNet drift and take the personalized-manipulation weapon off the table. That’s the whole argument, and it’s why I wrote the book.

If you want the full case – the science of the human mind that makes the fix work, the {Self} Map, the Equation of Emotion, and the rest of the data behind both of these challenges – it’s all in NHI Connected Mind.

👉 Get the book here: NHI Connected Mind by Sean Webb on Amazon

If this got you thinking, subscribe and share it – the more people who understand what the data actually says, the harder it is for any of this to drift quietly in the dark.

Big Hug!

– Sean Webb


This article is adapted from Chapters 6 and 11 of NHI Connected Mind by Sean Webb. Get the book.