What’s Wrong with Pay for Performance?

The quality of teaching has a big impact on student performance. But a lot of other factors are also important. Over the course of a semester student test scores in a good teacher’s class might go down or remain unchanged because of all those other factors. The scores in a bad teacher’s class might go up for the same reasons. If we reward and punish teachers based on test scores, therefore, much of the time we will be doing the wrong thing. That is, the reward system will often reward bad teachers and punish good ones.

This same principle also applies to the practice of medicine. Paying doctors or hospitals based on outcomes would be fine as long as the outcomes can be reliably measured and we know how much each entity contributes. Until that is possible, we run the risk that we will inadvertently punish the good practitioners and reward the bad ones. (Pay for performance wouldn’t be a problem if we actually knew how to measure outcomes and each person’s contribution to it.)

Understanding the problems with pay-for-performance is important because Medicare will begin adjusting payments to physicians based on “the value of the care provided” in 2015. Education is full of examples in which lousy, inaccurate measures have unintended consequences. Unfortunately, lousy health care measures are pretty much all we have, or are likely to have, by 2015.

Before going further, let’s make a distinction between inputs and outputs. Inputs are often easier to measure, and many pay-for-performance schemes are actually paying for inputs. Yet it is the outputs that we really care about.

In education, inputs are things like the time teachers spend in the classroom, how many minutes are devoted to math, how many minutes are devoted to vocabulary, and how much a school district spends on books. These inputs may or may not be related to how much children learn. In health care, inputs are things like whether a medical history was taken, whether the results of an examination are recorded electronically, and the number of nurses per patient. As in education, these inputs may or may not be related to whether patients actually get well.

What happens when we try to pay based on outputs?

In a study for Mathematica Policy Research, Greg Peterson and Eric Schone suggest that the value-added models developed to determine teacher pay might also prove useful in health care. They provide a useful, non-technical rundown of the problems that the Centers for Medicare & Medicaid Services (CMS) will face once it moves beyond measuring inputs and begins searching for actual performance measures.

One problem is figuring out how to apportion measured improvement among the many physicians that may see a patient during an episode of care. Another is deciding how to apportion credit over time. A patient with a condition that is difficult to diagnose may see several specialists over several years. Once he is diagnosed, he may improve after surgery, other treatments, and continuing medications. How, exactly, is credit for his improvement to be apportioned?

There are also significant data problems. Measurements to describe many outcomes are simply not available, and if they are, they may not be comparable from patient to patient. While one patient may describe a cut as a five on a 1 to 10 pain scale, another patient may describe the same cut as a two because he has a higher pain tolerance or more experience with pain. Paying a physician more because the second patient reports less pain is paying for differences in patient perceptions, not physician skill.

Even seemingly objective measures, such as rating physicians’ ability to treat diabetes using their patients’ HbA1c levels, have problems. One study using identical twins concluded that 62 percent of HbA1c variability is genetic. The variation introduced by factors that are beyond a physician’s control generates noisy data, making it difficult to separate a physician’s influence from that of genetics, environment, patient willingness to comply with medical recommendations, and the capital and staff a physician has to work with.

As is well known, value-added models in education have similar problems. Educational researchers have produced a substantial literature that is based on enormous, highly detailed datasets for teachers, schools and student achievement. They have shown that value-added models in education have problems so severe that Jesse Rothstein concluded that policies based on them will “reward or punish teachers who do not deserve it and fail to reward or punish teachers who do.” For three common value-added specifications, “accountability policies that rely on measures of short-term value added would do an extremely poor job of rewarding the teachers who are best for students’ longer-run outcomes.”

Although everyone agrees that good teachers can make a big difference, existing estimates suggest that 80 percent or more of student achievement is explained by something other than existing measures of teaching quality.

Eric Hanushek and Steven Rivkin conclude that representative estimates of teacher value-added range from 0.1 to 0.2 student achievement standard deviations.  This implies that moving a student from a teacher in the 25th percentile to the 75th percentile of measured effectiveness would only move the student from the 50th to the 58th percentile in the achievement distribution.

Furthermore, the measured performance of a particular teacher does not appear to be especially persistent. In another study, Daniel McCaffrey and his colleagues estimate that 30 to 60 percent of the variation in measured teacher effects is due to transitory noise and that less than half of a measured effect persists. Goldhaber [gated, with abstract] points out that recent evidence suggests that teacher value-added also depends upon peer effectiveness, the quality of the match between teachers and schools, changes in school demographics, experience, and absences of both teachers and their peers.  He also notes that “incorporating too much prior information [into value added-models] increases the risk of bias from performance that does not persist over time.”

Finally, academics have been unable to show that many of the observable measures thought to be significant contributors to teacher value-added have much effect on student achievement. In their summary of the relationship between the observable characteristics of teachers and student performance, Douglas Staigner and Jonah Rockoff conclude that although teachers do improve after several years of experience, there is little reason to believe that teacher academic background does much to affect student performance.  Teach for America, a highly selective program that draws applicants from top universities, fields teachers whose students score slightly better in math but no better in reading.

Rivkin, Hanushek and Kain  find that while achievement gains are systematically related to observable teacher and school characteristics, they are small. There is no evidence that master’s degrees improve teacher skills and there is little evidence the teacher skills improve after the first three years of experience. Class size has modest effects on mathematics and reading growth but it is limited to the younger grades and the effect is so small that the benefits from decreasing class size are likely to be outweighed by its costs. There is no evidence that more restrictive certification standards or teacher education requirements will raise the quality of instruction.

The good news is that work from the 1970s suggests that principals’ subjective ratings do a fairly good job of identifying good teachers. That may explain why the private and charter schools in which principals have the power to hire and fire are more likely to improve achievement by disadvantaged students than their relatively powerless public counterparts.

The superiority of subjective measures may also explain why private medicine, where peers, patients and professional associations subjectively evaluate a physician’s value-added does a better job of providing quality care than the quality measures adopted in national systems run by governments.

Judging from the progress on value-added models in education, CMS might do more good by freeing doctors and patients to reach their own conclusions and by redirecting its resources toward reducing the national debt.



Comments (18)

Trackback URL | Comments RSS Feed

  1. Ken says:

    Good post Linda. Very well done.

  2. Studebaker says:

    If you read books on quality management competition is what drives quality. P4P is an attempt to create synthetically what would be the outcome of competition. Often, all that results is the wrong kind of competition.

    A good analogy of what can go wrong is using standardized tests on school kids to measure learning, and thus reward teacher quality. Rather than teach children to think critically, teachers are encourages to 1) teach the tests; 2) cheat by changing the answers on the test answer cheats; 3) encourage the problematic kids to stay home on test day.

  3. EBC says:

    A P4P system would be too difficult to implement. It also would not take into consideration the natural variation within patients or the host of intervening factors that affect patient outcomes. A smoker is much less likely to recoup from a heart attack than a nonsmoker. Would we account for smoking in the input or simply penalize a doctor for a less-than-optimal outcome for the smoker?

    Just like teacher pay-for-performance, in the end, you only have so much to work with. When a doctor or a teacher has given their best effort based on their training, why should they be paid based on exogenous influences?

  4. Gabriel Odom says:

    While I cannot discount the fact that there are similarities between teachers and doctors, I feel that the differences are significant:

    Teachers seek to constantly make minor adjustments to student behaviour, so that the student can create and maintain an attitude of success throughout the school year. Teachers have direct access to the student nearly 200 days out of the year. Teachers are, in most cases, nearly entirely preventative rather than curative.

    Doctors have 1 day a year to advise major lifestyle adjustments. Doctors can only be reactionary and curative in their behaviour. Rather than being able to focus on preventative care, doctors are expected to be miracle workers – providing adequate care for a living being in a few days per year.

    This P4P procedure may work, but only if healthcare in the US can shift from curative/alleviative care to preventative care.

  5. Greg says:


  6. Angel says:

    It is crucial to practice more preventative care for any of these types of proposals to actually work.

  7. Alieta Eck, MD says:

    Yesterday I spent 20 minutes examining and talking with a 23 year old woman who lost her mother to ovarian cancer a few years ago. She had so many fears and questions that I spent the time to make sure she would be better equipped to face her future.

    There are no numbers to assess my value to this patient. BP, HgbA1C, cholesterol, weight are all normal, so there is no way for the government to measure my value. Only the patient can assess my value and I know she will be back.

    Being a good physician and helping patients navigate through life are impossible to quantify. P4P is a ridiculous concept. Patients and families would be better served in a system where they could vote with their feet and their own hard-earned dollars.

  8. Bruce Landes, MD says:

    Excellent. I will be sharing this with the board of our 1500-physician IPA. We have turned down several health-plan proposals for P4P in the past decade. Partly for the reasons stated above but also because the offerings increased adminstrative overhead costs for our physicians more than the “reward” that was being offered for “success”.

    One little editorial point: in “Class size has modest effects on mathematics and reading growth but it is limited to the younger grades and the effect is so small that the benefits from increasing class size are likely to be outweighed by its costs.”

    I believe that it should be, “DEcreasing class size”


  9. Devon Herrick says:

    There is probably some low-hanging fruit. But improvements beyond that will be incremental and more difficult to implement. Hand washing to control passing on germs or viruses is an easy fix. Safeguards to prevent wrong-side surgery are not hard to implement. Checklists like pilots use might help. Anesthesiologists have done much to improve the quality of their work through incremental improvement. The problem is that there is no real competition spurring incremental quality improvement in all of health care.

  10. Robert Sade says:

    Nice job, Linda. You got it exactly right again.

    One additional point could be made, namely, that generating outcome data could be useful in the future if current errors and inaccuracies can be mitigated, but, because one size does not fit all, their utility will not be in developing national or institutional policy. Rather, such data will best be used by patients and their personal physicians or medical advisors to apply to their individual circumstances, adding an objective component to subjective judgment.

  11. Greg Scandlen says:

    I have often said that I could parachute into any mid-sized town in the United States and find out in an afternoon who the best pediatrician is. Bureaucrats usually respond by saying I would only find out who the people THINK the best is, not supported by any objective measurement. True enough, but that is good enough for me, and probably better than any statistical measurement.

    How do we convert this notion into a payment system? By letting patients pay the doctor. We would pay less to the kid right out of medical school and more to the veteran who has seen it all. Less to the cold, uncaring SOB and more to the Marcus Welby. Less to Zeke Emmanuel and more to Alieta Eck. Docs would be happier, patients would be happier, health care would be better, and the statisticians can find some other line of work.

  12. Linda Gorman says:

    Dr. Landes–Yes, that is an error. It should read the “benefits from decreasing class size.”

    Thank you for catching it.

  13. Jordan says:

    Excellent as always Linda.

  14. L., BRODY, M.D. says:

    GREAT ARTICLE LINDA. It reminds me when I was a federally employed MD many years ago, all the tough cases were routed to me, and other “experienced” physicians avoided the complex patients, either medically or psychologically complex. The experienced physicians eventually taught me that you get just as much statistical credit for treating a sore thumb, as you do for vague abdominal complaints, which take investigation and follow-up and could be anything from undiagnosed cancer to a tummy ache.

    What has happened to the Educational system, will happen to the Medical System, but big government gets the cash flow.

  15. Ron Bachman says:

    If we move to P4P the best metrics to use are those that have few clinical critics. I would suggest blood pressure, cholesterol levels, Body Mass Index, nicotine use, and waist size. P4P to providers (and pay for compliance to patients) can be aligned around these metrics. By using these metrics we avoid focusing and potentially descriminating against one disease or condition. Nearly all medical conditions will improve if these metrics improve.

  16. Beverly Gossage says:

    Excellent, Linda! And what Greg said.

  17. Paul Nelson says:

    Many years ago, the English National Health Service developed an elaborate P4P system for their Primary Physicians. I am aware that some ten years later, its all been abandoned since accessibility, acceptability, efficiency and effective issues did not change. They have also abandoned their entire electronic health record with a Policy decision in September of 2011. A transition was planned by dividing the country into five regions, each with its own leadership process to design a new electronic health record and Primary Health Care system, a commitment to innovation. Several of their leaders gave a seminar at the Datapaloosa in Washington, June 2012.

  18. wanda j. Jones says:

    Linda and Friends….This is, indeed, a brilliant, timely and useful essay. I encourage you to beef it up into a full-blown policy article for one of the journals that Federal policy staff read, if they read anything.

    It has bothered me since this whole PPACA process started that it was clear that policy staff had been mesmerized by the possibility of measuring what physicians so and paying accordingly, but did not have the mental depth to work through what that would mean. Not only are there millions of medical and health professionals, and hundreds of millions of healthcare encounters with patient/customers, but outcomes are only attributable to what the healthcare system does to between 10 and 20%. Health status is maybe 40% what people do themselves, and you know all the other breakdowns. Even if healthcare could be charged with the full responsibility, it is beyond belief that the ability of the Federal government to obtain, analyze and interpret such data fairly and consistently without causing the professional class to modulate how their work was reported. THe administrative cost of this process would far outweigh its value, and divert budget from more important purposes.

    Before this whole ACO model gets going, let’s ridicule this P4P movement. It’s just not worth the hassle. I love it that the ENGLISH have already killed it off; since we think they are so much better than our system, maybe we should pay attention. I urge all happy readers of John’s site to move the arrow on this idea from a “Gee Whiz, isn’t that great? “to “Have you ever heard of anything so unworkable and expensive?” Time for the grown-ups to push the adolescents aside and focus more on making it easier for providers and professionals to get out of the straight-jacket of 20th century regulations and policies that just stop innovation in its tracks.


    Wanda Jones, President
    New Century Healthcare Institute
    San Francisco