Most interesting/useful paper to come out of mechanistic interpretability for a while: a streaming hallucination detector that flags hallucinations in real-time.
Some quotes from the author that I found insightful about the paper: Most prior hallucination detection work has focused on simple factual questions with short answers, but real-world LLM usage increasingly involves long and complex responses where hallucinations are much harder to detect. Trained on a large-scale dataset with 40k+ annotated long-form samples across 5 different open-source models, focusing on entity-level hallucinations (names, dates, citations) which naturally map to token-level labels. They were able to automate generation of the dataset with Closed Source models, which circumvented the data problems in previous work. Arxiv Paper Title: Real-Time Detection of Hallucinated Entities in Long-Form Generation submitted by /u/Envoy-Insc [link] [comments]