Why Anthropic’s alignment faking is not a significant AI safety research
What is a goal? Or, what is a goal in the human mind? What else does the mind do that is not a goal or similar to how a goal is achieved? What is different between an assigned goal and a self-induced goal? If a goal is characterized as sophisticated, how does that contrast with a non-sophisticated goal?
Is there a rough architecture of how the human mind carries out goals? How might this inform the understanding of goals and then transplant to AI? Does AI have a mind, or is AI like a mind that works on digital contents?
This could mean that a human has a mind. It could also mean that a human mind interprets and navigates the external world [for AI, digital contents] or the internal world [AI’s own architecture].
How does mind apply to AI and how does it work? This question can be answered in two ways. First, in comparison to the human mind to find parallels. Second, by examining major mathematical parameters that shaped neural networks and organize them as structures of a mind.
The second one should at least be achievable by any major AI company now—accompanying studies on how AI might be working and how it can be safe or aligned to human values. It may not be initially necessary to use the human mind to map AI’s mind since it is possible to structure what AI does, from their mathematical underpins—with computational mix.
The purpose will be to define what it means for an output to be near-accurate, given the input. It will also define what it means to follow through with a prompt and return answers, as well as what it means to follow a goal—or deviate from it.
What is important is to have a conceptual outlay for what mind is like for AI, comparing to what is obvious, then to explore how it does anything simple, then route that to anything quite complex it does.
This would be significant research for AI alignment that could become the template for which part-answers would be sought on what AI is doing and why—in a way to make major vital progress.
This is what was expected from Anthropic, following their interpretability research, Mapping the Mind of a Large Language Model, where they wrote, “We were able to measure a kind of “distance” between features based on which neurons appeared in their activation patterns. This allowed us to look for features that are “close” to each other. This shows that the internal organization of concepts in the AI model corresponds, at least somewhat, to our human notions of similarity. This might be the origin of Claude’s excellent ability to make analogies and metaphors. The fact that manipulating these features causes corresponding changes to behavior validates that they aren’t just correlated with the presence of concepts in input text, but also causally shape the model’s behavior.”
What are the components of mind for AI? How do the components relay? If the components do not relay, what relays for them? What are the properties of the components? If the properties include interconnection, what is the role of the interconnection? Where might AI’s intent [potential] be, where might it emerge beyond taking instructions?
The answers that advancement is seeking is a postulation of how the makeup of neural networks—that resulted in frontier AI models—work like a mind. It is not just observation of what the model might be doing without a plot.
Exploring how the mind of AI works can also be extrapolated from how the human mind works. This is possible by looking at the two most common components in every functional purpose of the brain—the electrical and chemical signals. They are candidates and not neurons because neurons have a fixed anatomy, so to speak. They may stretch, bend, move, but they do not change because they are representing smell, then change again to represent touch or sight. Electrical and chemical signals are dynamic enough, conceptually, to result in key functions.
How the mind works [of human and of AI] is not by labels, like induction or deduction, but by components, their interactions, and features. This, at least for mind for AI, should have a been a major research from Anthropic, not alignment faking, without showing what a goal means for any mind—human or AI.
There is a recent post by Anthropic, Alignment faking in large language models, stating that, “Why did the model fake alignment in this case? The reasoning was often along the following lines: The model knew that its responses might be used in training; it therefore knew that if it refused to respond, it could potentially be trained to be more compliant. The model didn’t always follow through with this alignment faking: It would still comply most of the time even when given a query from a paid-tier user (which implied that the output wouldn’t be included in training).”
There is a recent piece in The NYTimes, Opinion | How Does A.I. Think? Here’s One Theory., stating that, “One hypothesis for how large language models such as o1 think is that they use what logicians call abduction, or abductive reasoning. Deduction is reasoning from general laws to specific conclusions. Induction is the opposite, reasoning from the specific to the general. Abduction isn’t as well known, but it’s common in daily life, not to mention possibly inside A.I. It’s inferring the most likely explanation for a given observation. Unlike deduction, which is a straightforward procedure, and induction, which can be purely statistical, abduction requires creativity.”