Pro Quality. Fan Perspective.
Login-facebook
Around SBN: Vinny Magalhaes Claims Ebay Sale of M-1 Challenge Belt

On The Temptation Of Microsplits

It seems like it should be important to know how a certain batter has done against a certain pitcher. It is not. It is not important.

Oct 14, 2011 - The other day, Brewers manager Ron Roenicke was drafting a lineup to put up against Cardinals righty Chris Carpenter. When it came to picking a center fielder, Roenicke had three options: Nyjer Morgan, Carlos Gomez or Mark Kotsay. He wound up going with Kotsay - not just because Morgan seemed to be in a bit of a slump, but also because Kotsay is left-handed, and was 4-for-11 against Carpenter in his career.

When I read that explanation, I sighed. I imagine that many people sighed. Of course, Kotsay would go on to homer against Carpenter and draw a pair of walks, but that's kind of beside the point. In retrospect, I probably should have chosen a better example.

Maybe it's my imagination, but I feel like I'm being bombarded with microsplits this month more than ever before. By "microsplits", I'm referring to any small-sample statistical split in general, but specifically, for this post, I'm referring to batter vs. pitcher data. Batter vs. pitcher data is everywhere I turn. It's all over Twitter, from reputable and less-reputable sources alike. It's being cited by managers. It's being discussed on the air. All the time, it's being discussed on the air. This guy has this many hits in this many at bats against this other guy. Over and over.

In a vacuum, that's okay. Batter vs. pitcher data is information. There's nothing wrong with information. The problem is how that information is interpreted. That information is treated like it's meaningful, where by "meaningful" I mean "predictive". A lot of people act like, because a match-up was so in the past, so it will continue to be in the future.

And that isn't true. With batter vs. pitcher data, that isn't true. Dave Cameron just wrote a good post about this at FanGraphs a short while back, and you should read it. In short: this data isn't predictive. Thorough examination has shown that this data isn't predictive. It's not like batter vs. pitcher data is completely, 100 percent irrelevant, but it has to be so heavily regressed that you might as well not have the data at all. You're better off looking at the observed overall performances by a given hitter and pitcher.

Okay, so for many of you, this isn't news. It isn't exactly an Internet revelation that batter vs. pitcher data is of little use. But I think it's worth considering why such data is still treated as significant, even though it's essentially been proven that it is not.

The first reason, and the main reason, is that, intuitively, batter vs. pitcher data seems perfect. It seems like exactly the data you should want. Let's say you're a manager putting together a starting lineup. The other team is starting a lefty on the mound. When you're making your lineup, you don't think about your hitters' overall performances - you think about their performances against lefties. You do that because it gives you a better idea of how they'll perform against this particular lefty. But what if they already have an established performance against this particular lefty? In theory, shouldn't that give you an even better idea of how they'll do? What better way to predict how someone will do against someone else than by examining how that specific matchup has gone in the past?

That isn't how it works. But it feels like that should be how it works. It makes so much intuitive sense that it can be hard to believe it doesn't make actual sense.

A second reason, and a lesser reason, is that I think people are wired to not care too much about sample size. It would be one thing if a batter had faced a pitcher 1,000 times. Then that information would be significant. More commonly, a batter has faced a pitcher 10 or 20 or 30 times, and so that information is not significant. The sample size is far too small, spread over too many years, for anything to be made of it.

But it isn't the instinct to worry that a sample is too small. People make quick judgments based on very limited information all the time. Think about your opinions of other people you've just met. Think about cities or countries you've visited once or twice. Yelp is a website built around members publishing reviews of establishments based often on one single experience. That's crazy! But we're always doing it. We seldom wait for a sample to be big enough in life, and many seldom wait for a sample to be big enough in baseball.

We want for batter vs. pitcher data to matter. It seems too perfect for it to not matter. It will never matter. Never, for as long as baseball is played as it's currently played. It's just a meaningless microsplit. There's that old joke about statheads worrying about how a batter does against lefties on Tuesday nights in domes between the fourth and sixth innings. The ingredients of the joke change, but the joke itself stays the same: statheads worry about ridiculous microsplits. In reality, it isn't the statheads who concern themselves with ridiculous microsplits.

Do you like this post?

Wbc_029_medium

Jeff Sullivan

Editor

I started blogging about the Seattle Mariners at Leone For Third in December of 2003, and I joined SBN and founded Lookout Landing in January 2005. I can see outside from my room, which is good... Read full bio


Comments

Display:

The only worse thing is hearing "he's 2 for 13 against this pitcher, so he's due"

and knowing that they really think this increases the likelihood of getting a hit in this at-bat

by cfj3 on Oct 14, 2011 4:41 PM EDT reply actions  

My second favorite abused stat

Batter v. Team, either in season or over career.

You don’t have to understand regression or small sample size to understand how meaningless that stat is. Hey look, this batter hits .300 against teams wearing blue and white!

HangingSliders.com
A Smart & Sassy Baseball Blog
@hangingsliders
facebook.com/hangingsliders

by Wendy Thurm on Oct 14, 2011 4:46 PM EDT reply actions  

The best* batter vs pitcher splits are the ones like “Smith is hitting .400 against Jones in his career,” and you find out he went 2 for 5 six years ago.

  • By best, I mean worst.

by Phrozen on Oct 14, 2011 7:03 PM EDT reply actions  

There’s that old joke about statheads worrying about how a batter does against lefties on Tuesday nights in domes between the fourth and sixth innings. The ingredients of the joke change, but the joke itself stays the same: statheads worry about ridiculous microsplits. In reality, it isn’t the statheads who concern themselves with ridiculous microsplits.

I hate this joke so, so much.

Juan "Doesn't Cheat The Game" Perez, future CF for the World Champion San Francisco Giants.
"And besides, if I wanted to participate in a mindless patriotic ritual where my voice isn’t really heard, I would vote." - Chris Marcil

by marcello on Oct 15, 2011 3:17 AM EDT reply actions  

The postseason is full of microsplits

About the LCS or so, the networks actually stop showing you a guy’s regular season batting line and start showing you his postseason batting line.

Which is like five games’ worth. I mean, who gives a s**t? (Well, apparently, viewers do, so that they can pour pointless vituperation out on “chokers.” But I digress.)

The point, I think, is that while no information is literally of zero value, it requires a substantial opportunity cost to process information. Crappy information crowds out good information. Once you remember to factor that in, it really is literally true that there is “something wrong” with microsplits.

"We don't want our people to be preoccupied with seminude, crazy men jumping up and down who are chasing an inflated object," said Sheik Mohamed Osman Arus, head of operations for the Hizbul Islam insurgent group.

by PaulThomas on Oct 15, 2011 4:14 PM EDT reply actions   1 recs

easy road...

hi jeff,

i think you took the easy road and stated the conventional SABR theory.
maybe you should try being a bit more contrarian sometimes (for this audience).

comments:

  • i have “The book” and will review that chapter, but on the surface it seems
    you have such a small sample size to determine a baseline, how is there a sufficient sample size AFTER the baseline to conclude anything? and is the sample after the “baseline” (presumably 3 years) still accurate. i.e. are both players still in their prime, etc. (from year 1 to year 6)
  • if a player in a GIVEN YEAR is 2 for 4 with 2 HRs against a relief pitcher and
    then the next time he faces him hits a HR, is that just standard probability? or is this a good match up for the hitter? the pitchers out pitch is in the hitters zone perhaps. batter picks up ball well and rarely chases this pitchers balls out of the zone; pitchers fastball and offspeed pitches just happen to match well with the batter’s bat speed?
  • note: when you have a small sample size, that is when you want/need to see the actual data points. whether or not you can conclude anything is another topic… but were the balls hit hard, did batter chase balls out of zone…

YOUR BIG FINISH:
>> We want for batter vs. pitcher data to matter. It seems too perfect for it to not
>> matter. It will never matter. Never, for as long as baseball is played as
>> it’s currently played. It’s just a meaningless microsplit.

  • btw, ricky henderson in his career is 0-9 with 9 strikeouts against rich gossage (maybe 2 walks). if you have ricky on your bench to pinch hit against the goose (we’ll ignore other pinch hitting options now), would that factor into your decision? if the 9 ABs were all in the same year would it still NOT matter to you??? (9 is a small sample size…)
  • btw #2: ricky didn’t do too well against Nolan Ryan either, which supports Dave C’s comments that the type of pitcher is more “significant” than the individual pitcher.
  • also (from memory only) derek jeter had a terrible EXTENDED streak against Mike Timlin; (timlin “owned” him); i may be more inclined to put value in splits vs. relief pitchers than starting pitchers. but assume Jeter was 1 for 18, you think the odds of him getting a hit the next time up are above or below Jeter’s liftetime BA? wonder what betting line vegas would give it.

anyway, enjoy the playoffs.

by no_name on Oct 15, 2011 4:48 PM EDT reply actions  

Comments For This Post Are Closed

Yahoo_full_count Yahoo_fantasy_baseball

Anaheim, CA, USA; Los Angeles Angels first baseman Mark Trumbo celebrates after hitting a home run against the Toronto Blue Jays at Angel Stadium of Anaheim.  The Angels won 6-2. Credit: Kelvin Kuo-US PRESSWIRE

The Angel Who's Improved

Los Angeles, CA, USA; Los Angeles Dodgers catcher A.J. Ellis gets brushed back by a pitch as Colorado Rockies catcher Ramon Hernandez catches the ball at Dodger Stadium. Dodgers won 11-4. Credit: Jayne Kamin-Oncea-US PRESSWIRE

A Decade Of Patience, Patience

CHICAGO, IL: Danny Duffy #23 of the Kansas City Royals leaves the game against the Chicago White Sox in the first inning with an injury at U.S. Cellular Field in Chicago, Illinois.  (Photo by David Banks/Getty Images)

The Royals And Pitcher Development