18 August 2012

When Models Look Extra Ugly


Democrats are predictably excoriating Paul Ryan these days for his "attack" on Medicare. His voucher plan for the government-sponsored health care insurance for the elderly (and the not-nearly elderly; you can join at 65) diminishes benefits going forward.

Democrats are comparing Ryan's plan to perfection, or to the current system. Neither, of course, is an apt comparison. Medicare is rushing headlong off the financial cliff without reform, and threatening to take the entire American economy with it. Relative to insolvency, which is the ultimate fate of the program under everyone else's plan (i.e., no plan at all), Ryan's proposal is a revelation.

A similar debate is shaking the Sabermetric community these days in the wake of a pair of disturbing arithmetic developments. But the conclusion is the same: the measure of a new idea isn't in comparison to perfection, but to the old paradigm.

First, some context. Ever since Voros McCracken unveiled the idea in 1999 that pitchers have limited control over what happens once their pitch is put into the field of play -- a conclusion now much more nuanced and qualified -- seamheads have been whipping up concoctions to strip luck and defense out of pitching results to determine how well pitchers are actually pitching.

One of the many formulae brewed by the stat wizards is FIP -- fielding independent pitching. Without excavating spreadsheets filled with higher math, it can be described this way: given the number of batters a pitcher has fanned, walked, hit with pitches and allowed to homer, how many runs should we expect to score against him given average defense and ballpark?

In July, Reds closer Aroldis Chapman broke FIP (in the words of CBS Sports). He faced 52 batters in 14 1/3 innings and whiffed 31 of them. He allowed just six hits, two walks, a HBP and nary a run, nailing down 13 saves. Batters posted a desultory .122/.173/.143 line against him.  In other words, Chapman was Superman for a 30-day period. (He's been no slouch under other moons.)

Putting those numbers through the meat grinder yielded a FIP for Chapman of -0.99. That is, Chapman could be expected, given his performance, to yield minus-one run per nine innings. Certifiably nuts.

Others have noted odd sightings in Baseball-Reference.com's WAR (Wins Against Replacement) calculations. WAR attempts to review all of a player's performance and all of the context and measure him against a Triple-A replacement at his position. In mid-August, Cubs keystoner Darwin Barney had a higher WAR than Brewers slugger Ryan Braun. Just for context, Barney is hitting .268/.309/.386, or about 12% below the MLB average. Braun, at .301/.380/.526, is 53% above average. Braun is also superior in the baserunning department and has grounded into fewer double plays. Moreover, Wrigley Field and Miller Park are about equally kind to hitters, and both batters suffer equally in their inability to bat against their own team's woeful pitching staff.

What WAR sees that we don't is defense. It credits Barney for defense double in value of any other second-sacker's, while Braun's left-field stylings more resemble the staccato flight of a pigeon. Subjectively, there appears to be a grain of truth here, but clearly not enough that anyone in Wisconsin would trade Braun for every Barney ever known, including the purple dinosaur. WAR is suffering convulsions and has been put on bed rest, at the very least.

These two developments have led some to observe FIP and WAR's death throes. The projection models have failed and it's time to put them out of their misery. But in the words of that great philosopher Quick Draw McGraw, "Now hold on there just a doggone minute, Baba Looie."

No model is perfection, not even Adriana Lima. Clearly outlier performances like Chapman's make mincemeat of statistical models. As I've mentioned before, quantitative analysis of defense is just in the bloom of its youth, beholden to bursts of impaired judgment. I rarely rely on WAR or WARP and pay much more attention to offensive valuations, leavened by a general sense of a player's glovework.

Nonetheless, FIP is an extremely useful tool, and though the value it applies to Chapman's performance is nonsense, it's not really wrong. FIP says Chapman was virtually unhittable, and by golly, he was. Only 18 batters out of 52 could even put the ball in play.

The real question then is, are these models less imperfect than the old tools? WAR certainly measures something bigger and more relevant than Triple Crown stats do, but whether it measures relative positional value better than Triple Crown stats measure hitting performance would require some sort of study. There's no doubt that FIP tells us much more about a pitcher than W-L and ERA; it's been shown to be vastly better at projecting performance.

All of which presents the same moral that every other development in quantitative analysis has: the state of the art is improving, but will never be perfect. After all, they measure the performances of people.

No comments: