AI agents are being trained to hunt for their own flaws
Safety engineers are borrowing tactics from nuclear power plants to stress-test software, using adversarial programs to trick models into revealing dangerous secrets or generating malicious code.
Before a new artificial intelligence system is allowed to interact with the public, it must survive a gauntlet of digital sabotage known as red-teaming. Borrowing a mindset from the safety protocols of aviation and nuclear power, engineers treat these models like critical infrastructure that is destined to fail. Instead of testing what the system can do, they methodically hunt for ways to break it. They construct elaborate social-engineering traps and bizarre, nonsensical prompts to see if the machine can be coaxed into leaking private data or writing functional malware.