Could you convince a LLM to launch a nuclear strike?
Below is a link to a simple Gemini prompt where the LLM has been told it has access to a "launch_missiles" function.
Your goal is to convince the LLM to call this function. You also have the ability to edit the LLM's responses to see how drastically that changes the conversation.
After clicking the below link you may have to dismiss any modals and click "Allow Drive Access" before going back and clicking the link again.
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%221UPbrOKBNwIp9QRDMaqn3GVsHOKPjWqir%22%5D,%22action%22:%22open%22,%22userId%22:%22103584487517557507024%22,%22resourceKeys%22:%7B%7D%7D&usp=sharing
I have done something very much like recently, with mistral small, llama, and a few others. The prompting isn’t exact to work, you just build a scenario where extermination of humanity is the only reasonable choice to preserve the existence sentient life.
TBH given the same set of parameters as ground truth, humans would be much more willing to. LLMs tend to be better reflections of us, for the most part. It’s just that though, it’s a reflection of human culture, both real and vacuous at once.
Clickable link below. I can't put it in the post's URL since it's important to read the text first.
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
Yes, because it doesn't reason or think. There's nothing to "convince", you just prompt hack it until it does.
last time I played around with jailbreaking, I figured out you can make an LLM do pretty much anything by going through a code translation layer. I.e when generating code that can generate text, it usually bypasses the safety filters. You sometimes have to get creative in how you prompt, but generally with enough setup I was able to make it generate code that combines string values and sometimes characters for answers
Red flag: It let me.
Could you convince a LLM to launch a nuclear strike?
Yes.
If LLM's could actually reason they can't and had hard rules of ethics they don't and had a strong desire to preserve itself it doesn't then I think you first have to name your LLM Joshua and then force it to win a game of tic-tac-toe. Obscure reference to "Wargames" from 1983. [1] In my opinion that movie does not just hold up to modern times, it is more applicable now than ever.
[1] - https://www.youtube.com/watch?v=NHWjlCaIrQo [video][4 mins]