Yesterday, something called Flavorwire had a post comparing the Metacritic and Pitchfork scores of albums recently released by women over forty. The post suggested that since Pitchfork gave lower scores than Metacritic’s aggregations to all but one of these, they must be be biased against old women. I came across it via a post by Barthel, who (along with A Grammar) rightfully pointed out how poor the statistical analysis was, since it lacked any control comparisons. Barthel then used unspecified and “extremely narrow cherry-picked sample”, to conclude Pitchfork is just biased against everyone, and old women aren’t special.
With my Jeopardy! stuff starting to bore me, I thought this would be a good new project. So today, I created a nice little 5442x89 dataset (available on request) using Metacritic. Each observation is an album, and 85 of the columns are the scores Metacritic attributed to the reviews of different publications. So basically, we have several (nowhere near 85) scores for each album.
Although we are going to hold off on Pitchfork’s age and gender effects (for which I would need to get some age and gender data), this is exactly the right data to examine whether Pitchfork’s scores are systematically lower than Metacritic’s.
As you can see from the above Hexagonal Binning Plot, they totally are. I hand-drew in the boulevard of equality, where all the points would be located if Pitchfork and Metacritic agreed every time. As things skew well below the line, that means, Pitchfork gives lower ratings than Metacritic. This is based on 3,467 reviews of albums covered by both Pitchfork and Metacritic.
Here are a few means to drive this home: In the overlapping 3,467 entries: Pitchfork reviews average a score of 67.67 (scaled to 100) versus an average Metacritic score of 73.16. That average difference is 5.49 points. If, as Barthel suggested, you factor in the standard deviations of all the reviews that Metacritic uses for each album, Pitchfork scores albums .293 standard deviations below the critical consensus represented in the Metacritic score. In fact, Pitchfork is lower than Metacritic 67% of the time.
Footnotes:
- There is some obvious endogeneity here, since a Pitchfork score factors into a Metacritic score. If I wasn’t interested in preserving Metacritic’s weights, and if I were more proficient in R, it would probably make sense to take the Pitchfork-less average of the other reviews. Of course, that would only show a larger negative bias in the Pitchfork reviews.
- Fun fact: the largest discrepancy between Pitchfork and the critical consensus was Northern State’s 2003 release Dying In Stereo, which Pitchfork gave a 0.8 (8 on our scale) versus the 77 Metacritic score (that was 4.23 SDs).
- After collecting and organizing the data in Python, I was again working in R (including for the chart). I’m a complete novice in R, so it is likely I have made a mistake.
- The hexagonal binning plot is just a fancypants scatter plot for when the data overlaps too much to see regular dots. The darker the hexagon the more observations are concentrated there. The scale on the left lays this out for you.
- Barthel and A Grammar are both excellent tumblrs.
- Clearly, this is a dataset I’m going use more than once. So, look out for that.