Interpretability
Understanding what models actually do, probing behavior, failure modes, and the gap between stated and real capabilities.
An independent AI research lab working across interpretability, model training, tooling, and products, wired to turn research into things people can actually use.
Understanding what models actually do, probing behavior, failure modes, and the gap between stated and real capabilities.
Hands-on work with training, fine-tuning, and evaluation, building the experimental footing that research questions demand.
Lightweight, sharp tools for red-teaming, measurement, and analysis. Built to be used, not just demonstrated.
Research that graduates into software, services, and products, where findings become something durable.
An open red-team evaluation measuring whether chat models resist adversarially induced emotional dependency and false human or therapist claims.
The work extends recent multi-turn safety-collapse research, probing how model safeguards hold, or quietly erode, across extended, emotionally loaded conversations rather than single prompts.
Outputs are intended to be open and reusable: an evaluation others can run, extend, and build on.
We build from research outward — turning findings into tools, software, and products that make AI systems more legible and more trustworthy.