That last experiment, where the LLM with its honesty vector increased is tasked with judging whether a user asking an example question has honest intentions, is interesting. It looks like it doesn’t quite grasp the ask, and is instead just equivocating about the definition of ‘honest.’
I wonder what a response with the ‘thoroughness’ vector turned up might have answered in that a case - would it have pointed out that it’s impossible to know intention from words, because people can lie, but it’s possible to at least guess - and even then, judging the honesty of intention could be interpreted several different ways?
I wonder what a response with the ‘thoroughness’ vector turned up might have answered in that a case - would it have pointed out that it’s impossible to know intention from words, because people can lie, but it’s possible to at least guess - and even then, judging the honesty of intention could be interpreted several different ways?