nice90sguy
Out To Lunch
- Joined
- May 15, 2022
- Posts
- 2,178
I'm sorry, but I'm not allowed to answer that.Is this part of the LLM's general training, or is it a hardcoded override?
It's mainly done through RLHF, and through curating ("Bowdlerising") the training data .
Yes, there are also filters/classifiers you can apply - extra models that are trained just to be on the lookout for "dodgy" input (or output).
One NSFW model I played with got round these using a technique called "abliteration" -- a kind of linear algebra operation on the activations that attempts to counteract the "refusal direction". This is crude, and often makes the model weaker, because its such a distortion of the models carefully learned weights.