A Microsoft 365 Copilot license costs about $30 per user per month. For a hundred users, that is $36,000 a year before you count training time, change management, or the IT effort to get it deployed. Most rollouts we audit cannot answer the basic question of whether they got that money back.
The reason is rarely that Copilot is failing to deliver value. The reason is that the rollout never bothered to measure it. Six months in, leadership wants to know whether to renew. The IT team has anecdotes. There is no data.
This is the playbook we hand to clients who want a rollout that actually answers that question by week six.
Why measurement is the part most rollouts skip
The default Copilot rollout looks like this: buy 25 licenses, hand them to enthusiastic users, check in once, expand based on vibes. That works for showing leadership that something happened. It does not work for deciding whether to scale.
Real measurement requires three things that most rollouts skip: a baseline, a defined cohort, and a same-task comparison. Without all three, you are looking at usage numbers (which Microsoft helpfully provides) and inferring value from activity.
Activity is not value. Someone using Copilot for an hour a day might be five times faster, or might be having long conversations that produce nothing. The Microsoft 365 admin center cannot tell you the difference.
Weeks 1 to 2: Pre-rollout baseline
Before any license is assigned, measure how long the target tasks currently take. Pick three to five concrete tasks the rollout is meant to make faster:
- Drafting a sales proposal from a discovery call’s notes
- Writing the weekly status email from the project tracking spreadsheet
- Summarizing a Teams meeting recording into action items
- Building a first-pass deck from a customer brief
- Searching across SharePoint for specific policy or pricing details
For each task, time a small sample of people doing it the current way. Record the time and a quality rating from the recipient or reviewer. This is the baseline. It does not need to be statistically robust. It needs to exist.
This is the step rollouts always want to skip. Skip it and you give up the ability to compare anything later.
Weeks 3 to 4: Pilot cohort, training, prompt library
Pick 25 to 50 users. Diversify across roles. Mix enthusiasts with skeptics. Skeptics are the more useful cohort because their feedback tells you whether Copilot is working for someone who is not motivated to make it work.
Spend the first half of week 3 on training. Not generic Copilot training: training specific to your three to five baseline tasks. Show the prompt patterns that work for each. Build a small internal prompt library and pin it in the team’s Teams channel.
Through weeks 3 and 4, users do their normal work, with Copilot, on the same task categories. Track usage via the Microsoft 365 admin center, but more importantly, run a brief weekly check-in (15 minutes per user) asking which tasks they used Copilot for and how it went.
Some users will go quiet after week 1. Some will stay quiet because they are using it heavily. Some will stay quiet because they tried it once, were unimpressed, and went back to old habits. The check-ins distinguish those cases.
Weeks 5 to 6: Re-measure and decide
Re-run the baseline timing exercise. Same task categories, same kind of measurement, with users who have had Copilot for a month. Compare against the week-1 baseline.
The numbers you want at the end:
- Time savings per task category. Often dramatic for some categories (drafting, summarization) and negligible for others (specific data lookups, structured editing). The category-level breakdown matters more than the average.
- Quality delta. Did the deliverables get better, worse, or stay the same? Quality is harder to measure but matters. A 50% time saving with a quality drop is not a win.
- Active usage rate in the cohort. What percentage of pilot users used Copilot in the past 7 days, did at least 5 prompts, and produced output they shipped? Microsoft’s “active user” metric is too lenient; build your own.
- Categorical decision. For each baseline task category, decide: keep using Copilot, drop it, or specific-to-this-task tooling fits better.
Now leadership has data. Either the per-user cost is paying back through measurable time savings on tasks the team actually does, or it is not. The decision to expand, hold, or contract is straightforward from there.
The metrics that look impressive but mean nothing
A few traps:
- Total Copilot prompts sent. Volume, not value. Easy to game by long unproductive sessions.
- “Copilot active users.” Microsoft’s threshold is low. A user who tried it once in 28 days counts as active.
- Time saved estimates from the user. Self-reports are optimistic. Time the task again instead.
- Survey scores for “satisfaction.” Polite users score it well to be supportive. The behavior data is more honest.
If the only metrics in your rollout report are these, the rollout has not been measured.
What success looks like at week six
A successful pilot, in the data, looks like:
- Two to three task categories with clear time savings of 30% or more, sustained across week 4 and week 5
- Active usage rate above 60% of the pilot cohort
- Quality ratings from reviewers at least matching the pre-Copilot baseline
- A specific list of additional users or roles who would benefit, with the same task-category logic applied
That is enough to confidently expand. Without it, you have an experiment, not a rollout.
The rollouts that fail are not the ones where Copilot does not work. They are the ones where nobody bothered to find out whether it worked.