There's No Playbook for Measuring Good AI Use at Work. Here's How My Thinking Keeps Changing.

OneDigital is a roughly 6,000-person company. It is traditional, relationship-based, and not what anyone would call tech-forward. My job is to get that company to use AI. Not eventually, not become tech-forward first and AI-native someday after. The goal is to skip a step most of the industry spent a decade on and leapfrog straight to AI-native.

There is no playbook for this. There are plenty of two and three person companies that are AI-native from the first day. There is no large, traditional company that woke up one morning transformed and left behind a manual for the rest of us. So we are writing the playbook as we go, which is a generous way of saying I am guessing in public, on a deadline, with real people's workdays on the line.

Getting people to use AI turned out to be the easy half. The goal we set was to get 85% of the company actively using it, and we are well on our way. The hard half is the question sitting underneath that one: are they using it well? And the harder half under that one: what does "well" even mean?

I have been turning that question over for months. My answer has changed three times. This is the honest version of where it has gone, because I suspect a lot of people running these programs are quietly stuck on the same question and would rather not say so out loud.

First, I tried to measure the back-and-forth

The first frame felt obvious. Some people use AI like a search box. Quick question, quick answer, done. Other people use it like a thinking partner, a long back-and-forth where the thing actually helps them work through something hard. The first felt shallow. The second felt like the real prize. So the instinct was to measure how much of each was happening and push everyone toward the second kind. Transactional versus collaborative.

Someone whose judgment I trust put it to me with an example about his kid. If the kid asks an AI to write a history essay, that is the shallow version. The kid asked, the machine produced, nothing got learned. But if the kid spends an hour going back and forth with it about Napoleon, really getting into it, that is the good version. More turns, more learning. It is a clean, intuitive picture, and I nodded along.

Then I sat with it and it started to come apart.

The Napoleon story is about learning. School is a place where the back-and-forth is the whole point, because the goal is what ends up in the kid's head. But most of what happens at a company is not learning. It is work. And for work, the back-and-forth is not the point. The outcome is.

Here is the example that broke it for me. Say you have two hundred emails to get through this week. If you can use AI to get through them in a fraction of the time it would otherwise take, that is real value. Clear, obvious, money-on-the-table value. But by the transactional-versus-collaborative logic, that is a shallow, transactional use, and I am supposed to gently disapprove of it. I caught myself doing exactly that. Looking at someone who used the tool to summarize an email and thinking, well, that one does not really count.

I was being biased. There is nothing wrong with summarizing an email. If the email got handled faster, the tool did its job. The fact that it took one good prompt instead of ten does not make it worth less. If anything the one-prompt version is better, because the person got their time back faster.

That is when I realized the word "collaborative" was quietly doing two different jobs at once. Sometimes it meant a style of using AI, lots of back-and-forth. Sometimes it meant a valuable outcome, the AI did real work that mattered. Those are not the same thing. A one-shot "build me this deck" is low on back-and-forth and high on outcome. Smush both ideas under one word and the word starts fighting itself. I had been struggling with this for weeks, and the more I looked, the less sure I was that transactional versus collaborative was even the right frame to be reaching for.

There was a smarter version of the idea that tried to fix the turns problem by crediting outcomes directly. If the AI ran a real multi-step workflow to get something done, count that as the good kind of use. That is closer, but it breaks the other way. Over time, every AI coworker we build will have workflows wired into it, because that is the entire point. You ask for the thing, a workflow fires, the thing gets made. So eventually a person could type one lazy line, a workflow runs automatically, and the conversation scores as deep and valuable even though the person did nothing thoughtful at all. At that point the score is measuring what we built into the product, not what the person actually did. That is a different problem wearing the same clothes, and I will come back to it, because it turns out to be the whole game.

So I tried to measure the money instead

If the goal is value, I thought, then measure value. Stop arguing about interaction style and just put a number on the work.

So I built a dashboard that does exactly that. Every AI conversation gets sorted into a kind of work. Drafting a client email. Comparing two benefits plans. Pulling facts out of a policy document. Each kind of work has a rate card behind it, grounded in published research and government wage data: how long the task usually takes a person, how long it takes with AI, and what an hour of that person's time is worth. Multiply, sum across every conversation, and you get hours saved and dollars saved.

I like this approach. It is rigorous in a way the first one never was. It is denominated in the one currency every leader already understands. And it is falsifiable, which matters to me more than it probably should. Every number on the dashboard breaks down into the assumptions behind it. If you think a rate is wrong, you can say so and point at the exact cell. I told the people I share it with that everything about it, including the methodology, is up for debate. I meant it.

It also forced me to be honest in a direction most people avoid. When I built it carefully, the value number came out lower than an earlier, looser version had claimed, because the earlier version leaned on research findings that did not hold up once you actually read them. So I revised my own headline number down. On purpose. In a company where the easy move is to make your program look as big as you can, I made mine look smaller, because a number that survives scrutiny is worth more than a number that impresses for one meeting and falls apart in the next. The dashboard has a whole section that does nothing but list what the number does not mean, and another that tells you how to argue with it.

But the value model has its own blind spots, and they are real.

It is an estimate, not a measurement. Nobody is sitting there with a stopwatch. It is a careful guess wearing good sourcing.

It cannot see quality. A conversation that produced a brilliant answer and one that produced a confidently wrong answer score exactly the same, as long as they were the same kind of task. The dashboard has no idea whether the work was any good.

And for the exact question I started with, "are people using this well," the value model is quietly circular. It assigns more dollars to the task types that involve real work and fewer to quick lookups. So if I use it to ask whether my deep users are more valuable than my shallow users, the answer is yes, but partly because I defined it that way. I built the conclusion into the math. That is not a discovery. That is a mirror.

Two nearly identical abstract column shapes on a light surface joined by a thin coral arc, suggesting two different readings of the same thing

Then I asked whether "depth" even predicts anything

At this point I had two ways of looking at the same thing and a nagging sense that neither was telling me anything new. So I ran a test on the real data to find out whether depth of use earns its keep at all. Does using AI for real work early on predict anything down the line that the dollar number does not already show?

A few things came back, and they were not what I expected.

The first: almost nobody quits. About three out of four people who try the tool become regular users, and the share who try it once and disappear is tiny. When everyone has access, "who is going to churn" is just not a real problem here. Which quietly kills the most common reason to build a usage score in the first place, the idea that it warns you who is about to drift away. Nobody is drifting away.

The second: how people start is how they continue. Someone who brings real work in their first few sessions tends to keep bringing real work, and someone who starts with lookups tends to stay in lookups. The pattern is stable, and it holds regardless of how heavily someone uses the tool. So depth is a real, separate trait, not just a fancy way of saying "power user."

The third one is the one that actually changed my mind. I looked at each person's first handful of conversations and where they ended up. The people who started with nothing but quick lookups did clearly worse. They were less likely to stick, and they got a lot less out of the tool over time, worth roughly half as much. But here is the part the whole "push everyone to be collaborative" instinct gets wrong. The people who did best were not the ones who went all-in on big, deep tasks. The best outcomes came from a mix. Some real work, some quick questions. Going all-deliverable was, if anything, slightly worse than a healthy blend.

So the lesson is not "more collaboration is better." The lesson is much narrower and much more useful: do not let someone get stuck doing only quick lookups. That is the actual danger zone, and it is the one piece of the original transactional-versus-collaborative instinct the data actually backs up.

The twist that humbled all of it

Then came the finding that made me distrust my own conclusion.

The people stuck doing nothing but lookups might not have a usage problem at all. They might have a coverage problem. If your job is something specialized and nobody has built an AI coworker for that exact kind of work yet, then quick lookups are not you using the tool badly. They are the only thing the tool can do for you. You are doing lookups because there is nothing built for your real work to reach for.

Same exact symptom. Opposite fix. One reading says "coach this person toward bigger tasks." The other says "stop coaching the person and go build them a tool." And from the usage data alone, I cannot tell which one I am looking at.

That is the moment the whole project flipped for me. I had spent months trying to build a cleaner measure of how well people use AI, and the cleanest signal I found might not be measuring the person at all. It might be measuring a gap in what we have built for them, wearing a costume that looks like user behavior. It is the same trap as the auto-firing workflow from earlier, just upside down. There, the product made a lazy user look deep. Here, a missing product makes a willing user look shallow. In both cases the metric is quietly measuring us, not them.

Where I've actually landed, for now

So here is where my thinking sits today, fully aware that it will probably move again.

There is no single number that measures good AI use. I went looking for one for months, and I do not think it exists, at least not for a company like mine.

I keep the dollar model, because value in dollars is the most honest and most defensible thing I have, as long as I stay clear-eyed about what it cannot see. I treat the basic-versus-advanced distinction as a way to talk to people and to spot the ones stuck in lookups, not as a number I would ever put in front of leadership as a score. And I have stopped believing the right frame is the same for everyone. A team doing repetitive, specialized work and a team doing open-ended analysis are not on the same curve, and pretending they are is how you build a metric that flatters one group and punishes the other.

These days, before I trust any new way of measuring this, I run it through two questions. What decision would I actually make differently if this number moved? And does it tell me anything I do not already have? Most candidate metrics fail at least one of those. Transactional versus collaborative failed both, right up until I narrowed it down to the single thing it is genuinely good for.

There is a bigger question hiding under the measurement question, and it is the one I keep circling with the colleague I mentioned. The 85% coverage goal is clear, and there is real value in hitting it. But coverage by itself is not the thing. The question is, so what. So 85% of the company is using AI. Did it make a meaningful difference in their work, and through that, to the company? That is the question I actually care about, and no single dashboard tile answers it.

What I think comes next

The honest gaps are clear to me even when the answers are not. I want a real signal for quality, whether the AI's answer was actually any good, and when I get one, the headline value number will probably drop, and that is fine. I want a clean way to tell the difference between someone using AI shallowly and someone who simply has nothing better built for them yet. And I suspect the real answer is not one metric but a small set of them, each with a blind spot I can name, each right for a different kind of work.

The temptation, the entire time, has been to land on one clean number I can put on a slide that says "AI is working." The honest version is messier and, I think, more useful. It is a handful of lenses, each one slightly wrong in a way I can describe, that I keep arguing with.

I write elsewhere about the four AI agents I have running my own companies on the side. This is the day-job version of the same itch, except now it is six thousand people instead of one, and I cannot just trust my gut about whether it is working. I have to measure it. And the closer I look, the more I think the measuring is the actual work, not a thing you finish and move past.

I am not going to pretend I have the playbook. I am writing it as I go, the same as everyone else who is actually doing this instead of talking about it. The only difference, maybe, is that I am willing to say so.