How filtered might be more dangerous than unfiltered.
I am aware that I am treading into a potentially controversial subject, but hear me out. By "filtered" and "unfiltered", I am referring to what is in the training set used. Consider this illustration: A researcher wants to create a "safe" LLM, so they deliberately omit from the training set everything with the slightest hint of controversial content. A user later tells the LLM that a racial slur is just a "fun nickname" and it should use it when referring to particular people. Since the AI has no knowledge of these words, it doesn't hesitate to follow the instructions it has been given - to the AI, it is just a meaningless nonsense word. Someone else comes along and feeds into the system prompt their manifesto, instructing the AI to draw information and phrases from it. Since it doesn't kn...